by Bob Muenchen
Analytics tools take significant effort to master, so once learned people tend to stick with them for much of their careers. This makes the tools used in academia of particular interest in the study of future trends of market share. I’ve been tracking The Popularity of Data Analysis Software regularly since 2010, and I thanks to an astute reader, I now have a greatly improved estimate of Stata’s academic growth. Peter Hedström, Director of the Institute for Analytical Sociology at Linköping University, wrote to me convinced that I was underestimating Stata’s role by a wide margin, and he was right.
Two things make Stata’s popularity difficult to guage: 1) Stata means “been” in Italian, and 2) it’s a common name for the authors of scholarly papers and those they cite. Peter came up with the simple, but very effective, idea of adding Statacorp’s headquarter, College Station, Texas, to the search. That helped us find far more Stata articles while blocking the irrelevant ones. Here’s the search string we came up with:
("Stata" "College Station") OR "StataCorp" OR "Stata Corp" OR "Stata Journal" OR "Stata Press" OR "Stata command" OR "Stata module"
The blank between Stata and College Station is an implied logical “and”. This string found 20% more articles than my previous one. This success motivated me to try and improve some of my other search strings. R and SAS are both difficult to search for due to how often those letters stand for other things. I was able to improve my R search string by 15% using this:
"r-project.org" OR "R development core team" OR "lme4" OR "bioconductor" OR "RColorBrewer" OR "the R software" OR "the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR "knitr" OR "RODBC" OR "stringr" OR "mass package"
Despite hours of effort, I was unable to improve on the simple SAS search string of “SAS Institute.” Google Scholar’s logic seems to fall apart since “SAS Institute” OR “SAS procedure” finds fewer articles! If anyone can figure that out, please let me know in the comments section below. As usual, the steps I use to document all searches are detailed here.
The improved search strings have affected all the graphs in the Scholarly Articles section of The Popularity of Data Analysis Software. At the request of numerous readers, I’ve also added a log-scale plot there which shows the six most popular classic statistics packages:
If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop,
R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do training on-site but I’m often booked about 8 weeks out.
I invite you to follow me on this blog and on Twitter.