Cleaning up oversized github repositories for R and beyond

[This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The version control system Git is an amazing piece of software for tracking every change that you make to a project and saving its entire history. It is incredibly useful, for users of R and other programming languages, leading it shoot from 0 market share in 2005 (when it was first released) to market domination in one short decade.

However, Git can cause confusion. Even (or at times especially) when used in conjunction with a nice graphical user interface such as that provided by GitHub, the main online repository of Git projects worldwide and home to over 10 million projects, Git can cause chaos. Like Linux (the operating system was incidentally created by the same prolific person), Git assumes you know what you’re doing. If you do not, watch out!

Partly knowing what I was doing (but not fully) I set up a repository to host a tutorial on making maps in R. I was pretty relaxed about what went in there and soon, the repository grew to an unwieldy 60 Mb in size and over 20 Mb just to download the automatically created zip file. (It is now a sprightly 2.6 Mb Zipped, wahey!) Needless to say this did not help my aim of making R accessible to everyone, a tool for empowerment (as this inspiring article about R for blind people shows it can be).

So I decided to act to clean things up. In the hope it’ll be useful to others, what follows is a description of the main steps I took to sort things out.

cleaning-in-action

Step 1: delete files in the current project

The first stage was simply to identify and delete excessively sized files in the current version of the project. For this there is no better program than Baobab, which shows you where bloat exists on your system.

That was only part of the problem though: as shown in the image of disk usage from Baobab below, most (80%, almost 50 Mb) of the space was taken up by the .Git folder itself. This meant files I’d changed in the past were taking up the most space and. Git is not designed to allow you change the past but to save it…

b4-clean

Step 2: use the BGF

Next up is the BFG ‘repo cleaner’. This is just a small java program that cleans up unwieldy commits using a command line interface.

In order for it to work, you need to mirror your repository, using the --mirror flag when you clone. The first step was thus:

    $ git clone --mirror [email protected]:Robinlovelace/Creating-maps-in-R.git

Next, you run this (in a Linux terminal, as illustrated by the $ sign), changing the size depending on what you want to keep:

   $ java -jar ~/programs/bfg-1.11.7.jar  --strip-blobs-bigger-than 1M  .git

This successful cut the size of the project in half, making it far more accessible, as shown in the figure below. Note, the changes made by the BFG only translate into disk space savings after running the following commands (suggested in the BFG usage section):

    $ cd Creating-maps-in-R.git/
    $ git reflog expire --expire=now --all
    $ git gc --prune=now --aggressive

after-clean

One issue

The only issue I encountered was this message:

    ! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)

Although this was repeated several times, it didn’t seem to influence the success of the operation: I’ve halved the size of my GitHub repo and roughly 1/8thed the size of the zip file people need download to run the tutorial code. So the issue seems to be a non-issue in the grand scheme of things.

Conclusion

Ideally we’d all be like Linus Torvalds and make no mistakes. But unfortunately we are human and prone to mistakes, which are actually one of the best ways of learning. Thanks to software like BFG and many helping hands through the open source community, 99 times out of 100 these mistakes are no big deal. I hope this post will help others to shrink unwieldy git repositories and uncrustify their lives. More importantly I hope this leads to better design from the outset: the experience has certainly made me think about project design carefully including saving giant .RData files externally and keeping new objects in a project to a minimum. According to Joseph Tainter, the marginal costs of added complexity now outweigh the benefits for industrial civilization. Lets hope R users and other programmers, at the very least, can simplify our lives sufficiently to avoid collapse. Hopefully then the rest of society will follow!

To leave a comment for the author, please follow the link and comment on their blog: Robin Lovelace - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)