Analysis of Public .Rhistory Files

Posted on February 20, 2013 by Jeff Allen in Uncategorized | 0 Comments

[This article was first published on Trestle Technology » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

GitHub recently launched a more powerful search feature which has been used on more than one occasion to identify sensitive files that may be hosted in a public GitHub repository. When used innocently, there are all sorts of fun things you can find with this search feature.

Inspired by Aldo Cortesi’s post documenting his exploration of public shell history files posted to GitHub, I was curious if there were any such .Rhistory files. For the uninitiated, .Rhistory files are just logs of commands entered into the interactive console during an R session. Some recent IDEs, such as RStudio, automatically create these files as you work. By default, these files would be excluded from a Git repository, but users could, for whatever reason, choose to include their .Rhistory files in the repository.

Using this search function, combined with the Python script Mr. Cortesi had put together to download the files associated with a GitHub search, I was able to download 638 .Rhistory files from public GitHub repositories (excluding forks). What follows is an exploration of those files.

Load Data

Trimming out the 0-line .Rhistory files leaves us with a total of 531 non-empty files, totaling 157265 commands entered into R.

First, I was curious about the length of these files.

It seems that many of these files represent very brief (and likely unpleasant) interaction with R. For instance:

exit
exit
ls
exit

(if you’re out there, you were likely looking for the ‘q()‘ command). Others represent quite extensive projects; the maximum was 7268 lines long.

Package Usage

More interesting to me was how these users were using R – what the details contained in these history files represent in terms of the user’s interaction with R. For starters, which packages were the users using? We can identify packages loaded via the library() or require() functions.

There were 3068 such calls to load packages in the scripts. The top 10 packages loaded in this set were:

Package Name	Count
`ggplot2`	291
`plyr`	81
`GREBase`	59
`xtable`	59
`reshape`	52
`reshape2`	48
`devtools`	43
`igraph`	41
`RGreenplum`	40
`lattice`	39

(Of course it’s likely worth noting the selection bias from examining only R commands which were included in GitHub projects. I would imagine that the usage for devtools, for instance, is certainly inflated among GitHub projects over the general populace.)

Function Use

I was also curious which functions were most widely executed. We can get a rough identification of most function names by looking for a sequence of valid characters followed by an ( symbol.

This gives us a total of 100190 function calls of 8028 unique function names. The 20 most popular functions executed in this set were:

Function Name	Count
`source`	5191
`plot`	2552
`c`	2448
`library`	2416
`function`	1711
`for`	1138
`summary`	1107
`if`	1062
`read.csv`	955
`rep`	887
`length`	880
`head`	828
`lm`	766
`sum`	753
`View`	722
`print`	661
`install.packages`	648
`mean`	606
`setwd`	569
`names`	562

It should also be possible to identify for which functions the help/manual pages were viewed by identifying lines beginning with a “?” or arguments inside of a call to help().

I can identify 2409 requests for help on 1101 different function names. The top 10 most prevalent functions for which users request help follow.

Function Name	Count
`plot`	43
`hist`	31
`lm`	25
`writePage`	24
`order`	20
`sort`	20
`cor`	17
`apply`	16
`read.csv`	16
`matrix`	15

Conclusion

Of course, there are all sorts of different types of analysis one could perform on this dataset. Post any suggestions you have in the comments; I imagine there’s at least one more post of interesting finds in this data. Check out the source code on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: Trestle Technology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Analysis of Public .Rhistory Files

Load Data

Package Usage

Function Use

Conclusion

Related

Load Data

Package Usage

Function Use

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)