[This article was first published on John Myles White: Die Sudelbücher » Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few months ago I decided to apply word frequency analysis ideas to R code. My idea was simple: the functions that one should invest the most effort into learning are precisely those functions that are used most frequently in real world code. In fact, this simple idea can be applied to spoken languages as valuably as to programming languages: the words that will most improve your ability to understand day-to-day speech in a foreign language are precisely the words that occur most frequently in day-to-day speech. Our language classrooms tend to screw this up by emphasizing words like “spoon” over “dude”, even though people my age will use the word “dude” many more times a day than they’ll use “spoon”.1 In large part, this is because the usefulness of a word tends to be confounded with its respectability when you learn a language in an academic setting. This over-emphasis on respectability is why you’re never taught curse words, even though they’re as practically useful as the majority of other words.

But I digress. Returning to R, the attached CSV data set contains the results of my word frequency analysis. To produce this data set, I spidered all of the source code for every R package on CRAN, split the resulting corpus text into lexical tokens, and counted the occurrences of each token. To make things simpler, I decided to treat every token followed by a parenthesized expression as a function, even though this gives me some syntactical units like `if` in my list of functions.

I think the resulting raw data set is probably as useful as any summaries I can offer of it, but I’ll offer some obvious results for those interested.

Function call frequencies in R, unsurprisingly, seem to follow Zipf’s law or some variety thereof. You can see this fairly clearly in the following logarithmic plot:

And here are the top 25 most frequently used functions in R, including some syntactical units:

Function Name Occurrences in Corpus
if 333421
c 143123
function 137490
length 109847
paste 62906
cat 56199
return 56001
stop 54161
is.null 53575
list 51862
log 49479
for 48950
rep 39090
names 34439
sum 31475
as.integer 29417
matrix 29344
is.na 20230
dim 20202
max 19995
nrow 19789
as.double 19705
attr 18802
t 17889
print 17461

Some fun extensions to this work would be to use this frequency data set as a standard for the abnormality of code style: you could compare an individual programmer’s code to the standard frequency data to see which functions such-and-such a programmer tends to overuse and underuse. I personally tend to underuse `stop` and I never use `attr` at all.

Another interesting project would be to use this data set as input to a text classifier that would attempt to predict the author of a piece of R code based on token frequency information, in line with Mosteller and Wallace’s famous analysis of the Federalist Papers or anti-spam programs.

1. To this day, I still mix up the words for “spoon” (“cuchara”) and “knife” (“cuchillo”) in Spanish, thought I never make any mistakes with slang words like “dude” (“tio”).

To leave a comment for the author, please follow the link and comment on their blog: John Myles White: Die Sudelbücher » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)