R Function Usage Frequencies

December 7, 2009
By

(This article was first published on John Myles White: Die Sudelbücher » Statistics, and kindly contributed to R-bloggers)

A few months ago I decided to apply word frequency analysis ideas to R code. My idea was simple: the functions that one should invest the most effort into learning are precisely those functions that are used most frequently in real world code. In fact, this simple idea can be applied to spoken languages as valuably as to programming languages: the words that will most improve your ability to understand day-to-day speech in a foreign language are precisely the words that occur most frequently in day-to-day speech. Our language classrooms tend to screw this up by emphasizing words like “spoon” over “dude”, even though people my age will use the word “dude” many more times a day than they’ll use “spoon”.1 In large part, this is because the usefulness of a word tends to be confounded with its respectability when you learn a language in an academic setting. This over-emphasis on respectability is why you’re never taught curse words, even though they’re as practically useful as the majority of other words.

But I digress. Returning to R, the attached CSV data set contains the results of my word frequency analysis. To produce this data set, I spidered all of the source code for every R package on CRAN, split the resulting corpus text into lexical tokens, and counted the occurrences of each token. To make things simpler, I decided to treat every token followed by a parenthesized expression as a function, even though this gives me some syntactical units like if in my list of functions.

I think the resulting raw data set is probably as useful as any summaries I can offer of it, but I’ll offer some obvious results for those interested.

Function call frequencies in R, unsurprisingly, seem to follow Zipf’s law or some variety thereof. You can see this fairly clearly in the following logarithmic plot:

r_log_frequencies.png

And here are the top 25 most frequently used functions in R, including some syntactical units:

Function NameOccurrences in Corpus
if333421
c143123
function137490
length109847
paste62906
cat56199
return56001
stop54161
is.null53575
list51862
log49479
for48950
rep39090
names34439
sum31475
as.integer29417
matrix29344
is.na20230
dim20202
max19995
nrow19789
as.double19705
attr18802
t17889
print17461

Some fun extensions to this work would be to use this frequency data set as a standard for the abnormality of code style: you could compare an individual programmer’s code to the standard frequency data to see which functions such-and-such a programmer tends to overuse and underuse. I personally tend to underuse stop and I never use attr at all.

Another interesting project would be to use this data set as input to a text classifier that would attempt to predict the author of a piece of R code based on token frequency information, in line with Mosteller and Wallace’s famous analysis of the Federalist Papers or anti-spam programs.

  1. To this day, I still mix up the words for “spoon” (“cuchara”) and “knife” (“cuchillo”) in Spanish, thought I never make any mistakes with slang words like “dude” (“tio”).

To leave a comment for the author, please follow the link and comment on his blog: John Myles White: Die Sudelbücher » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.