Cleaning up oversized github repositories for R and beyond

June 25, 2014
By

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

The version control system Git
is an amazing piece of software for tracking every change that
you make to a project and saving its entire history.
It is incredibly useful, for users of R and other
programming languages, leading it shoot from 0 market share
in 2005 (when it was first released)
to market domination in one short decade.

However, Git can cause confusion. Even (or at times especially)
when used in conjunction with a nice graphical user interface
such as that provided by GitHub,
the main online repository of
Git projects worldwide and home to over
10 million projects,
Git can cause chaos.
Like Linux (the operating system was
incidentally created by the same prolific
person),
Git
assumes you know what you’re doing.
If you do not,
watch out!

Partly knowing what I was doing (but not fully) I set up a
repository to host a tutorial on
making maps in R.
I was pretty relaxed about what went in there and soon, the
repository grew to an unwieldy 60 Mb in size and over 20 Mb
just to download the automatically created
zip file.
(It is now a sprightly
2.6 Mb Zipped, wahey!)
Needless to say this did not help my aim of making
R accessible to everyone, a tool for empowerment
(as this
inspiring article about R for blind people shows it can be).

So I decided to act to clean things up. In the hope it’ll be useful
to others, what follows is a description of the main steps I took to
sort things out.

cleaning-in-action

Step 1: delete files in the current project

The first stage was simply to identify and delete excessively sized files
in the current version of the project. For this there is no better program
than Baobab, which shows you where
bloat exists on your system.

That was only part of the problem though: as shown in the image of
disk usage from Baobab below, most (80%, almost 50 Mb)
of the space was taken up by the .Git
folder itself. This meant files I’d changed in the past were taking up
the most space and. Git is not designed to allow you change the past but to save it…

b4-clean

Step 2: use the BGF

Next up is the BFG ‘repo cleaner’.
This is just a small java program that cleans up unwieldy commits
using a command line interface.

In order for it to work, you need to mirror your repository,
using the --mirror flag when you clone. The first step was thus:

    $ git clone --mirror [email protected]:Robinlovelace/Creating-maps-in-R.git

Next, you run this (in a Linux terminal,
as illustrated by the $ sign), changing the size depending on what you want
to keep:

   $ java -jar ~/programs/bfg-1.11.7.jar  --strip-blobs-bigger-than 1M  .git

This successful cut the size of the project in half,
making it far more accessible, as shown in the figure below.
Note, the changes
made by the BFG only translate into disk space savings
after running the following commands
(suggested in the BFG usage section):

    $ cd Creating-maps-in-R.git/
    $ git reflog expire --expire=now --all
    $ git gc --prune=now --aggressive

after-clean

One issue

The only issue I encountered was this message:

    ! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)

Although this was repeated several times, it didn’t seem to influence the success
of the operation: I’ve halved the size of my GitHub repo and roughly 1/8thed the
size of the zip file people need
download to run the tutorial code. So the issue seems to be a non-issue in the grand scheme of things.

Conclusion

Ideally we’d all be like Linus Torvalds and make
no mistakes.
But unfortunately we are human and prone to mistakes, which are
actually one of the best ways of learning. Thanks to software
like BFG and many helping hands through the open source community,
99 times out of 100 these mistakes are no big deal. I hope this
post will help others to
shrink unwieldy git repositories and
uncrustify their lives.
More importantly I hope this leads to better design from the outset:
the experience has certainly made me think about project design carefully
including saving giant .RData files externally and keeping new objects
in a project to a minimum. According to Joseph Tainter,
the marginal costs of added complexity now outweigh the benefits for
industrial civilization. Lets hope R users and other
programmers, at the very least, can simplify
our lives sufficiently to avoid collapse. Hopefully then the rest of society will
follow!

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.