R and (Software) Relatives

February 18, 2014
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

Post also available with code executed inline at rpubs.com.

O'Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary.  The entire survey is available for download.  In the survey results, R was heralded as second only to SQL as a tool used by conference attendees.  An chart from the survey appeared in this post and elsewhere online.



These two technologies overlap a bit but are are highly complementary.  SQL can be used to quickly extract data from relational databases and filter, order and summarize data.  SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R.  R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.  

As part of a in-progress R screencast, I wanted to speculate a bit about the most common "clusters" of technologies that are popular among R users (at least the Strata Conference respondents).  Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis.  I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts.  This would make usage reported among non-data respondents a bit clearer.  

So the first step was to replicate the original plot with a few cosmetic and editorial updates - no "Respodents" appear in the new version.  This involved the use of reshape2 and ggplot2.


library(reshape2)
library(ggplot2)


With these available, I created the data frame by combining a few vectors containing the data of interest.


data.science.tools <- as.data.frame="" b="" rbind="" t="">
  
  c('All Respondents',
    'SQL','R','Python','Excel','Hadoop','Java',
    'Network/Graph','JavaScript','Tableau','D3',
    'Mahout','Ruby','SAS/SPSS'),
  
  c(57,42,33,26,25,23,17,16,7,15,8,7,5,9),
  
  c(43,29,10,15,11,12,17,4,13,4,5,6,6,2)
)))

names(data.science.tools)=c('DataTool', 'Data', 'NonData')


At this point, the results match up with what appeared in the chart from the O'Reilly report.  The numbers represent percent of respondents that use the given tool.


data.science.tools

DataTool      Data    NonData
1  All Respondents       57          43
2              SQL       42          29
3                R       33          10
4           Python       26          15
5            Excel       25          11
6           Hadoop       23          12
7             Java       17          17
8    Network/Graph       16           4
9       JavaScript        7          13
10         Tableau       15           4
11              D3        8           5
12          Mahout        7           6
13            Ruby        5           6
14        SAS/SPSS        9           2

The data is easier to deal with if reshaped using melt.


data.science.tools.df <- b="" melt="">
  data.science.tools, 
  c('DataTool'), 
  variable.name='Role', 
  value.name='Respondents'
)


The resulting data frame:

data.science.tools.df

          DataTool    Role Respondents
1  All Respondents    Data          57
2              SQL    Data          42
3                R    Data          33
4           Python    Data          26
5            Excel    Data          25
6           Hadoop    Data          23
7             Java    Data          17
8    Network/Graph    Data          16
9       JavaScript    Data           7
10         Tableau    Data          15
11              D3    Data           8
12          Mahout    Data           7
13            Ruby    Data           5
14        SAS/SPSS    Data           9
15 All Respondents NonData          43
16             SQL NonData          29
17               R NonData          10
18          Python NonData          15
19           Excel NonData          11
20          Hadoop NonData          12
21            Java NonData          17
22   Network/Graph NonData           4
23      JavaScript NonData          13
24         Tableau NonData           4
25              D3 NonData           5
26          Mahout NonData           6
27            Ruby NonData           6
28        SAS/SPSS NonData           2

Convert data into required numeric type

data.science.tools.df$Respondents <- as.numeric="" b="">
  data.science.tools.df$Respondents
)

Create the original chart

ggplot(data = data.science.tools.df, 
       aes(x=reorder(DataTool, 
                      Respondents, 
                      function(x) max(x)
                     ), 
            y=Respondents, 
            fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip() + 
  theme(axis.title.y = element_blank())

Now do the facetted example

ggplot(data = data.science.tools.df,
       aes(x=reorder(DataTool, 
                     Respondents, 
                     function(x) max(x)), 
           y=Respondents, 
           fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip()  + 
  facet_grid(. ~ Role) + 
  theme(axis.title.y = element_blank())

Those in the non-data role appear to be largely coming from a more traditional software development/programming background.  The top tool in use after SQL is Java, followed by Python and JavaScript.  Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R.  Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R.   As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.

To get a closer look at "non-Data" role.

ggplot(data = data.science.tools.df[
  data.science.tools.df$Role=='NonData',],
       aes(x=reorder(DataTool, 
                     Respondents, 
                     function(x) max(x)
                     ), 
           y=Respondents)
  ) + 
  geom_bar(stat='identity') + 
  coord_flip() + 
  theme(axis.title.y = element_blank())

There are a number of tools conspicuously lacking in the survey.
  • Microsoft programming is completely absent.
  • Command line utilities (like awk, sed, sqlite3 and some others
  • Perl
It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).

As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience.  Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations.  The fact that so many R packages are in essence full-fledged DSLs has further complicated R's presentation.  As I mentioned in my previous post, Hadley's new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way.  And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well. 

To leave a comment for the author, please follow the link and comment on his blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.