R and (Software) Relatives

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Post also available with code executed inline at rpubs.com.
O’Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary.  The entire survey is available for download.  In the survey results, R was heralded as second only to SQL as a tool used by conference attendees.  An chart from the survey appeared in this post and elsewhere online.

These two technologies overlap a bit but are are highly complementary.  SQL can be used to quickly extract data from relational databases and filter, order and summarize data.  SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R.  R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.  

As part of a in-progress R screencast, I wanted to speculate a bit about the most common “clusters” of technologies that are popular among R users (at least the Strata Conference respondents).  Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis.  I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts.  This would make usage reported among non-data respondents a bit clearer.  

So the first step was to replicate the original plot with a few cosmetic and editorial updates – no “Respodents” appear in the new version.  This involved the use of reshape2 and ggplot2.


With these available, I created the data frame by combining a few vectors containing the data of interest.

data.science.tools <- as.data.frame="" b="" rbind="" t="">
  c(‘All Respondents’,

names(data.science.tools)=c(‘DataTool’, ‘Data’, ‘NonData’)

At this point, the results match up with what appeared in the chart from the O’Reilly report.  The numbers represent percent of respondents that use the given tool.


DataTool      Data    NonData
1  All Respondents       57          43
2              SQL       42          29
3                R       33          10
4           Python       26          15
5            Excel       25          11
6           Hadoop       23          12
7             Java       17          17
8    Network/Graph       16           4
9       JavaScript        7          13
10         Tableau       15           4
11              D3        8           5
12          Mahout        7           6
13            Ruby        5           6
14        SAS/SPSS        9           2

The data is easier to deal with if reshaped using melt.

data.science.tools.df <- b="" melt="">

The resulting data frame:
          DataTool    Role Respondents
1  All Respondents    Data          57
2              SQL    Data          42
3                R    Data          33
4           Python    Data          26
5            Excel    Data          25
6           Hadoop    Data          23
7             Java    Data          17
8    Network/Graph    Data          16
9       JavaScript    Data           7
10         Tableau    Data          15
11              D3    Data           8
12          Mahout    Data           7
13            Ruby    Data           5
14        SAS/SPSS    Data           9
15 All Respondents NonData          43
16             SQL NonData          29
17               R NonData          10
18          Python NonData          15
19           Excel NonData          11
20          Hadoop NonData          12
21            Java NonData          17
22   Network/Graph NonData           4
23      JavaScript NonData          13
24         Tableau NonData           4
25              D3 NonData           5
26          Mahout NonData           6
27            Ruby NonData           6
28        SAS/SPSS NonData           2

Convert data into required numeric type
data.science.tools.df$Respondents <- as.numeric="" b="">
Create the original chart
ggplot(data = data.science.tools.df, 
                      function(x) max(x)
       ) + 
  geom_bar(stat=’identity’) + 
  coord_flip() + 
  theme(axis.title.y = element_blank())
Now do the facetted example

ggplot(data = data.science.tools.df,
                     function(x) max(x)), 
       ) + 
  geom_bar(stat=’identity’) + 
  coord_flip()  + 
  facet_grid(. ~ Role) + 
  theme(axis.title.y = element_blank())
Those in the non-data role appear to be largely coming from a more traditional software development/programming background.  The top tool in use after SQL is Java, followed by Python and JavaScript.  Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R.  Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R.   As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.

To get a closer look at “non-Data” role.
ggplot(data = data.science.tools.df[
                     function(x) max(x)
  ) + 
  geom_bar(stat=’identity’) + 
  coord_flip() + 
  theme(axis.title.y = element_blank())
There are a number of tools conspicuously lacking in the survey.
  • Microsoft programming is completely absent.
  • Command line utilities (like awk, sed, sqlite3 and some others
  • Perl
It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).

As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience.  Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations.  The fact that so many R packages are in essence full-fledged DSLs has further complicated R’s presentation.  As I mentioned in my previous post, Hadley’s new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way.  And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well. 

To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)