The R community is unique as programming communities go. Many users of R come from academia and have a relatively extensive mathematical background. The R community has developed in relative isolation from some other areas of programming that have been widely adopted by business. To many business users, working with data is synonymous with dealing with a relational database system (RDBMS). Yet none of the R books that I read use relational databases at all (and online resources on the subject are limited).
It is not that such integration is not possible, there are many packages on CRAN that relate to databases. Oracle has even integrated its data mining package with R using RODM in recent days. There is a rather obvious overlap of concerns between those in the RDMS crowd and R users. The widespread use of RDBMS systems appears to be restricted more to business environments. This is changing somewhat in recent days as alternatives to relational models are explored – particularly when dealing with large data sets.
As someone coming to R from an RDBMS background, I noticed a significant difference in the terminology and concepts that exist concerning the manipulation of data. R users reshape data frames while database users construct queries and return result sets. In either case, we are dealing with the same basic idea – essentially a rectangular grid containing a set of values obtained from a data source. And we have the same basic goal, modify the structure of the grid without changing the essential meaning of the underlying data.
There is tremendous value in learning the R terms and functions involved in manipulating data. There are some things that are trivial to do in R that are a large amount of work in SQL. In the meantime, there is a package that I found useful – that also might be useful to RDBMS developers – those who “think in SQL” but want to use R for additional statistical analysis or creation of visualizations. It is sqldf, an R package for runing SQL statements on data frames.
Examples speak for themselves in large part.
# Load the package
# Use the titanic data set
Incidentally, I really appreciate the tradition in the statistics (and therefore R) communities of using well known published data sets to illustrate analysis and programming techniques. This is not done consistently in the RDBMS community, although the Oracle community tends to use sample schemas (SCOTT/TIGER, HR, etc) and other RDBMS vendors have similar sample schemas for training. There are not nearly as well know or consistently used as data sets in the statistics community (titanic, iris, etc). We will use the titanic data set in this example.
As an RDBMS developer, I might be interested in knowing the number of people that are a particular age in the data set. That would translate in my mind as selecting the count of each age listed in the titanic3 data set. A GROUP BY clause would be required to produce a COUNT(*) for each age listed.
This is better expressed recognizing that NA and NULL correspond:
sqldf(‘select age, count(*) from titanic3 where age is not null group by age’)
This returns a listing of results – very similar to what I would do in a RDBMS environment. Of course in R, this sort of information might be handled differently. Rather than having a listing that runs off the screen, I can view a histogram that reflects the entire distribution in a manner that is viewable in a small amount of space. In this case
DF=sqldf(‘select age from titanic3 where age != “NA”‘)
The GROUP BY summarization using SQL is one example of a relatively straightforward summarization available using SQL. Lets say that I am now interested in focusing in on the survival rate of 29 year olds on the titanic.
DF=sqldf(‘select count(*) total from titanic3 where age=29 group by survived’)
Using the t function to perform a matrix transform which is essentially the same as producing a pivot of the original data.
total 17 13
The sqldf package is a great resource for RDBMS folks who are working in R. I am not sure that it actually provides any functionality that is not readily available in a more R-ish form, but it certainly provides a way of conceptualizing problems and implementing solutions that is familiar to database developers. Otherwise stated: it is not so matter whether R or SQL can express it but how easy it is to express. Some queries are simply much easier to express in one notation or the other.
The package will be particularly helpful when introducing R to database developers – and could be helpful when R and database developers are collaborating by providing a common means of communication.