A Crash Course in git for Data Scientists

[This article was first published on Gage Theory, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I really like git. It’s the first versioning tool I’ve ever used so I have nothing else to compare it to, but in the world of statistical model building where iteration is constant (and almost never a strict linear progression) I find it to be an invaluable tool.

This is a set of crib notes on github for analysts like me who are reasonably happy with working at the command line and file systems, but are unfamiliar with versioning concepts and git. For a more academic introduction to version control directed towards scientists* check out Software Carpentry. For those who are fearful of the command line Software Carpentry can help you out there too.

Now that we’ve got the little stuff out of the way, why the heck should you use github and what exactly is a version control system?? Here’s a video that explains what the hell git is, featuring Scott Chacon who’s really excited and talking fast. It’s a bit long, but entertaining and useful.

To get git installed and running this is your go-to source:


Simply follow the Github Bootcamp instructions on how to set up git and create your first repository.

Start Versioning

 The general process to add or make changes to a file your new repository is this:

  1. Go to your local copy of the repository
  2. Add or edit the file as you normally would
  3. Fire up Git Bash (or however you want to interact with github) then use the following sequence
# to add all the file(s) in preparation to be committed
$ git add .
$ git commit -m "write a message about what your changes are"
# this pushes the changes to the repository assuming you've used the default "origin" naming convention
$ git push origin


On github collaboration revolves around what they call a “Fork + Pull” model. The general idea is that if you want to use or contribute to someone else’s code, you make your own copy (a.k.a. you “fork” the repository), and then send a pull request if you’ve made a change to the code you think the owner would like.

How to fork: http://help.github.com/fork-a-repo/

How to send a pull request: http://help.github.com/send-pull-requests/

There’s a lot more to know about git and version control in general, but that’s really all you need to get started. Enjoy!

*Yes I realize I say that in a way that makes it sound like I don’t consider computer scientists to be “scientists”. That’s not the case, merely the simplification that all physical and social scientists come at computing from a different angle and with a different purpose than computer scientists.

To leave a comment for the author, please follow the link and comment on their blog: Gage Theory.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)