Analysis of Public .Rhistory Files

February 20, 2013
By

(This article was first published on Trestle Technology » R, and kindly contributed to R-bloggers)

GitHub recently launched a more powerful search feature which has been used on more than one occasion to identify sensitive files that may be hosted in a public GitHub repository. When used innocently, there are all sorts of fun things you can find with this search feature.

Inspired by Aldo Cortesi's post documenting his exploration of public shell history files posted to GitHub, I was curious if there were any such .Rhistory files. For the uninitiated, .Rhistory files are just logs of commands entered into the interactive console during an R session. Some recent IDEs, such as RStudio, automatically create these files as you work. By default, these files would be excluded from a Git repository, but users could, for whatever reason, choose to include their .Rhistory files in the repository.

Using this search function, combined with the Python script Mr. Cortesi had put together to download the files associated with a GitHub search, I was able to download 638 .Rhistory files from public GitHub repositories (excluding forks). What follows is an exploration of those files.

Load Data

Trimming out the 0-line .Rhistory files leaves us with a total of 531 non-empty files, totaling 157265 commands entered into R.

First, I was curious about the length of these files.

 

Length of RHistory Files

 

It seems that many of these files represent very brief (and likely unpleasant) interaction with R. For instance:

exit
exit
ls
exit

(if you're out there, you were likely looking for the 'q()' command). Others represent quite extensive projects; the maximum was 7268 lines long.

Package Usage

More interesting to me was how these users were using R – what the details contained in these history files represent in terms of the user's interaction with R. For starters, which packages were the users using? We can identify packages loaded via the library() or require() functions.

There were 3068 such calls to load packages in the scripts. The top 10 packages loaded in this set were:

Package NameCount
ggplot2291
plyr81
GREBase59
xtable59
reshape52
reshape248
devtools43
igraph41
RGreenplum40
lattice39

(Of course it's likely worth noting the selection bias from examining only R commands which were included in GitHub projects. I would imagine that the usage for devtools, for instance, is certainly inflated among GitHub projects over the general populace.)

Function Use

I was also curious which functions were most widely executed. We can get a rough identification of most function names by looking for a sequence of valid characters followed by an ( symbol.

This gives us a total of 100190 function calls of 8028 unique function names. The 20 most popular functions executed in this set were:

Function NameCount
source5191
plot2552
c2448
library2416
function1711
for1138
summary1107
if1062
read.csv955
rep887
length880
head828
lm766
sum753
View722
print661
install.packages648
mean606
setwd569
names562

It should also be possible to identify for which functions the help/manual pages were viewed by identifying lines beginning with a “?” or arguments inside of a call to help().

I can identify 2409 requests for help on 1101 different function names. The top 10 most prevalent functions for which users request help follow.

Function NameCount
plot43
hist31
lm25
writePage24
order20
sort20
cor17
apply16
read.csv16
matrix15

Conclusion

Of course, there are all sorts of different types of analysis one could perform on this dataset. Post any suggestions you have in the comments; I imagine there's at least one more post of interesting finds in this data. Check out the source code on GitHub.

To leave a comment for the author, please follow the link and comment on his blog: Trestle Technology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.