GitHub recently launched a more powerful search feature which has been used on more than one occasion to identify sensitive files that may be hosted in a public GitHub repository. When used innocently, there are all sorts of fun things you can find with this search feature.
Inspired by Aldo Cortesi's post documenting his exploration of public shell history files posted to GitHub, I was curious if there were any such
.Rhistory files. For the uninitiated,
.Rhistory files are just logs of commands entered into the interactive console during an R session. Some recent IDEs, such as RStudio, automatically create these files as you work. By default, these files would be excluded from a Git repository, but users could, for whatever reason, choose to include their
.Rhistory files in the repository.
Using this search function, combined with the Python script Mr. Cortesi had put together to download the files associated with a GitHub search, I was able to download 638
.Rhistory files from public GitHub repositories (excluding forks). What follows is an exploration of those files.
Trimming out the 0-line
.Rhistory files leaves us with a total of
531 non-empty files, totaling
157265 commands entered into R.
First, I was curious about the length of these files.
It seems that many of these files represent very brief (and likely unpleasant) interaction with R. For instance:
exit exit ls exit
(if you're out there, you were likely looking for the '
q()' command). Others represent quite extensive projects; the maximum was
7268 lines long.
More interesting to me was how these users were using R – what the details contained in these history files represent in terms of the user's interaction with R. For starters, which packages were the users using? We can identify packages loaded via the
3068 such calls to load packages in the scripts. The top 10 packages loaded in this set were:
(Of course it's likely worth noting the selection bias from examining only R commands which were included in GitHub projects. I would imagine that the usage for
devtools, for instance, is certainly inflated among GitHub projects over the general populace.)
I was also curious which functions were most widely executed. We can get a rough identification of most function names by looking for a sequence of valid characters followed by an
This gives us a total of
100190 function calls of
8028 unique function names. The 20 most popular functions executed in this set were:
It should also be possible to identify for which functions the help/manual pages were viewed by identifying lines beginning with a “?” or arguments inside of a call to
I can identify
2409 requests for help on
1101 different function names. The top 10 most prevalent functions for which users request help follow.
Of course, there are all sorts of different types of analysis one could perform on this dataset. Post any suggestions you have in the comments; I imagine there's at least one more post of interesting finds in this data. Check out the source code on GitHub.