An R function to determine if you are a data scientist

October 10, 2011

(This article was first published on Simply Statistics, and kindly contributed to R-bloggers)

“Data scientist” is one of the buzzwords in the running for rebranding applied statistics mixed with some computing. David Champagne, over at Revolution Analytics, described the skills for being a data scientist with a Venn Diagram. Just for fun, I wrote a little R function for determining where you land on the data science Venn Diagram. Here is an example of a plot the function makes using the Simply Statistics bloggers as examples. 

The code can be found here. You will need the png and klaR R packages to run the script. You also need to either download the file datascience.png or be connected to the internet. 

Here is the function definition:

dataScientist(names=c(“D. Scientist”),skills=matrix(rep(1/3,3),nrow=1), addSS=TRUE, just=NULL)

  • names = a character vector of the names of the people to plot
  • addSS = if TRUE will add the blog authors to the plot
  • just = whether to write the name on the right or the left of the point, just = “left” prints on the left and just =”right” prints on the right. If just=NULL, then all names will print to the right. 
  • skills = a matrix with one row for each person you are plotting, the first column corresponds to “hacking”, the second column is “substantive expertise”, and the third column is “math and statistics knowledge”

So how do you define your skills? Here is how it works:

If you are an academic

You calculate your skills by adding papers in journals. The classification scheme is the following:

  • Hacking = sum of papers in journals that are primarily dedicated to software/computation/methods for very specific problems. Examples are: Bioinformatics, Journal of Statistical Software, IEEE Computing in Science and Engineering, or a software article in Genome Biology.
  • Substantive  = sum of papers in journals that primarily publish scientific results such as JAMA, New England Journal of Medicine, Cell, Sleep, Circulation
  • Math and Statistics = sum of papers in primarily statistical journals including Biostatistics, Biometrics, JASA, JRSSB, Annals of Statistics

Some journals are general, like Nature, Science, the Nature sub-journals, PNAS, and PLoS One. For papers in those journals, assess which of the areas the paper falls in by determining the main contribution of the paper in terms of the non-academic classification below. 

If you are a non-academic

Since papers aren’t involved, determine the percent of your time you spend on the following things:

  • Hacking = downloading/transferring data, cleaning data, writing software, combining previously used software
  • Substantive = time you spend learning about the scientific problem, discussing with scientists, working in the lab/field.
  • Math and Statistics = time you spend formalizing a problem in mathematical notation, time you spend developing new mathematical/statistical theory, time you spend developing general method.


To leave a comment for the author, please follow the link and comment on their blog: Simply Statistics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)