Roland Stevenson is a data scientist and consultant who may be reached on Linkedin.
Git and its online extensions like GitHub, Bitbucket, and GitLab are essential tools for data science. While the emphasis is often on collaboration, Git can also be very useful to the solo practitioner.
The RStudio IDE offers Git functionality via a convenient web-based interface (see the “Git” tab), as well as interaction with git via the command-line (via the “Console” tab, or via the “Git” tab’s “More”->“Shell” menu option).
Resources such as HappyGitWithR and try.github.io provide detailed setup instructions and interactive tutorials. Once you have set up Git, here are 10 commands and their RStudio GUI interface equivalents to get started using Git.
I encourage you to practice with both the command-line and the RStudio Git interface. The following commands should eventually become so habitual that they take only seconds to complete.
Before getting started with git, you should identify yourself with:
git config --global user.name "<YOUR NAME>" git config --global user.email <YOUR EMAIL>
These values will later show up in all commit and collaboration messages, so make sure they are appropriate for public visbility.
There are many ways to treat your RStudio project as a git repository, or repo for short. If you create a new project, RStudio will give you the option to use Git with the project:
To use git with an existing project, click on Tools -> Project Options -> Git/SVN:
git add <FILENAME>
Git is used to keep track of how files change. The changes to files in your project can be in one of two states:
- unstaged: changes that won’t be included in the next commit
- staged: changes that will be included in the next commit
git add to add changes to the staging area. Common changes include:
- add a new file
- add new changes to a previously-committed file
In RStudio we can easily
git add both types of changes to the staging area by clicking on the checkbox in the “Staged” column next to the filename:
In the above example, we have staged two files. The green “A” icon is for Added (as opposed to a blue “M” for Modified, or a yellow “?” for an un-tracked file).
git reset -- <FILENAME>
Unstage a file with
git reset, or in RStudio, just uncheck the checkbox. It’s that simple.
git rm <FILENAME>
Staging the removal of a tracked file looks simple, but be warned:
you cannot simply undo a
git rmremoval by running
git reset -- <FILENAME>as we did with
git add <FILENAME>
git rmstages the removal of the file so that the next commit knows the file was removed, but it also removes the actual file. This frequently causes panic in new git users when they don’t see the deleted file restored after running
git reset -- <FILENAME>. Restoring the file requires two steps:
git reset -- <FILENAME>
git checkout <FILENAME>
In RStudio, removal, and undo’ing the removal, are simpler. If you delete a tracked file, you will see a red “D” icon in the “Status” column of the “Git” tab indicating that it has been Deleted. Clicking on the checkbox in the “Staged” column will stage the removal to be included in the next commit. Clicking on the file and then clicking “More”->“Revert” will undo the deletion.
To see what files are staged, unstaged, as well as what files have been modified, use
git status. In RStudio, status is always visible by looking at the Status column of the “Git” tab.
Changes can be saved to the repo’s log in the form of a commit, a checkpoint that includes all information about how the files were changed at that point in time. Adding a concise commit message is valuable in that it allows you to quickly look through the log and understand what changes were made.
Do not use generic messages like “Updated files”: this greatly reduces the value of the commit log, as well as most other collaborative features. Other recommendations and opinions exist about how best to author a commit message, but the starting point is to include a concise (~50-character) description of what changed.
git commit via the console will open up a console-based editor that allows you to author the commit message.
In RStudio, click on the “Commit” icon in the “Git” tab. This will open up the “Review Changes” window, which allows you to see what has changed, and stage and unstage files before adding a commit message and clicking on “Commit” to finalize the commit:
When the commit is complete, you’ll see a message like the following:
git diff <FILENAME>
git diff will show you what has changed in a file that has already been committed to the repo. The “diff”erence is calculated between the what the file looked like at the last commit, and what it currently looks like in the working directory. The console
git diff command will not show you what new files exist in the unstaged area.
In RStudio, just click on the “Diff” button in the “Git” tab, to see something like the following:
Red/green shaded lines indicate lines that were removed/added to the file. In RStudio, new files appearing in the working directory will be entirely green.
Looking at git diff can be very useful when you want to know “what did I do since I last committed a change to this file?”. A common pattern after completing a segment of work is to use
git diff to see what was changed overall, and then split the changes into logical commits that describe what was done.
For example, if four different files were modified to complete one logical change, add them all in the same commit. If half the files changed were related to one topic and the other half related to another topic, add-and-commit the two sets of files with separate commit messages describing the two distinct topics addressed.
To see a log of all commits and commit messages, use
git log. In the RStudio interface, click on the “History” icon in the “Git” tab. This will pop up a window that shows the commit history:
As shown above, it will also allow you to easily view what was changed in each commit with a display of the commit’s “diff”.
git status, if there are files in the working directory that have changes, Git will provide the helpful message:
use “git checkout –
…” to discard changes in working directory
git checkout -- <FILENAME>will discard changes in the working directory, so be very careful about using
git checkout: you will lose all changes you have made to a file.
git checkout is the basis for Git’s very powerful time-traveling features, allowing you to see what your code looks like in another commit that lies backwards, forwards, or even sideways in time.
git checkout allows you to create “branches”, which we will discuss in a future post.
For now, just remember that checking-out a file that has been changed in your working directory will destroy those changes.
In RStudio, there is no interface to
git checkout– probably for good reason. It exposes a lot of functionality and can quickly lead a novice down a delicate path of learning opportunities.
Now that you have git set up, create a project and play with the above commands. Create files, add them, commit the changes, diff the changes, remove them. See what happens when you attempt to
git reset -- <FILENAME> after a
git rm <FILENAME>, or see what
git checkout <FILENAME> does when you have made changes to that file (hint: it destroys them!).
Also look at how RStudio handles these operations, likely intentionally keeping the novice safely within a subset of functionality, while also providing convenient GUI visualizations of diffs, histories, staged-state, and status.
Committing changes should become a regular part of your workflow, and understanding this essential commands will lay the foundations for the more complex workflows we’ll discuss in a future post.