A Crash Course in git for Data Scientists

March 10, 2012
By

(This article was first published on Gage Theory, and kindly contributed to R-bloggers)

I really like git. It’s the first versioning tool I’ve ever used so I have nothing else to compare it to, but in the world of statistical model building where iteration is constant (and almost never a strict linear progression) I find it to be an invaluable tool.

This is a set of crib notes on github for analysts like me who are reasonably happy with working at the command line and file systems, but are unfamiliar with versioning concepts and git. For a more academic introduction to version control directed towards scientists* check out Software Carpentry. For those who are fearful of the command line Software Carpentry can help you out there too.

Now that we’ve got the little stuff out of the way, why the heck should you use github and what exactly is a version control system?? Here’s a video that explains what the hell git is, featuring Scott Chacon who’s really excited and talking fast. It’s a bit long, but entertaining and useful.

To get git installed and running this is your go-to source:

http://help.github.com/

Simply follow the Github Bootcamp instructions on how to set up git and create your first repository.

Start Versioning

 The general process to add or make changes to a file your new repository is this:

  1. Go to your local copy of the repository
  2. Add or edit the file as you normally would
  3. Fire up Git Bash (or however you want to interact with github) then use the following sequence
# to add all the file(s) in preparation to be committed
$ git add .
$ git commit -m "write a message about what your changes are"
# this pushes the changes to the repository assuming you've used the default "origin" naming convention
$ git push origin

Collaboration

On github collaboration revolves around what they call a “Fork + Pull” model. The general idea is that if you want to use or contribute to someone else’s code, you make your own copy (a.k.a. you “fork” the repository), and then send a pull request if you’ve made a change to the code you think the owner would like.

How to fork: http://help.github.com/fork-a-repo/

How to send a pull request: http://help.github.com/send-pull-requests/

There’s a lot more to know about git and version control in general, but that’s really all you need to get started. Enjoy!

*Yes I realize I say that in a way that makes it sound like I don’t consider computer scientists to be “scientists”. That’s not the case, merely the simplification that all physical and social scientists come at computing from a different angle and with a different purpose than computer scientists.

To leave a comment for the author, please follow the link and comment on his blog: Gage Theory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.