Pivot Tables for R: Try sqldf

May 21, 2013
By

(This article was first published on Cartopedia.co.uk » r-tutorial, and kindly contributed to R-bloggers)

Pivot tables are a a growing staple for analysis in excel yet they remain limited to the functionality which Microsoft has chosen to include. Typical operations are the inclusion of filters, choice over rows, columns, and maths operations. In R this functionality is not always so straightforward. That is until you try sqldf(). The sqldf package implements sql (Structured Query Language) query functionality onto traditional data.frames. A simple example being:

sqldf("select * from irisTable")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.10 3.50 1.40 0.20 setosa
2 4.90 3.00 1.40 0.20 setosa
3 4.70 3.20 1.30 0.20 setosa
4 4.60 3.10 1.50 0.20 setosa
5 5.00 3.60 1.40 0.20 setosa
6 5.40 3.90 1.70 0.40 setosa

This simple statement will request all columns (all indicated with *) from a table which you are basing your work on. Before we start we will create the sample data. We will use the Iris dataset and change column names to remove any full stops and spaces.

irisTable <- iris

names(irisTable) <- c("Sepal_length", "Sepal_Width", "Petal_length", "Petal_width", "Species")

However, what if we only want a selection of all columns? Further, we may wish to order our data by a particular column. This can be achieved by appending ‘order by <col_name> desc or asc.

sqldf("select Species, Sepal_length from irisTable")
Species Sepal_length
1 setosa 5.10
2 setosa 4.90
3 setosa 4.70
4 setosa 4.60
5 setosa 5.00
6 setosa 5.40

By specifying the column names as they are displayed in the data.frame we can select these specific columns. At this stage we have the basic functionality available in the standard Pivot Table tool. Moving on to aggregating our data we can use the ‘group by’ function as demonstrated below.

sqldf("select Species, count(*) from irisTable group by Species")
Species count(*)
1 setosa 50
2 versicolor 50
3 virginica 50

Having added the group by statement we are now bringing the real power of Pivot Tables to R. If we assume that Species is a list of row names we have brought together all of our records into a much smaller number of categories. Questions we may now want to ask are: What is the average Sepal Length of each Species? What is the Min, Max or average petal length? these function may be simply applied by wrapping the chosen variable with the function you wish to perform.

sqldf("select Species, avg(Sepal_length) from irisTable group by Species")
Species avg(Sepal_length)
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59

A full list of these functions can be found at W3 SQL. In addition to math operations a number of text based operations are available such as length() which determines the length of a string, UCASE() converts to upper-case and LCASE converts to lower case.

The next aspect of Pivot Tables are filters. Filters allow subsets of the data to be generated for example we want to study a specific district, or are looking for values between two points. You will notice that once we have used the ‘group by’ statement we use ‘having’ rather than ‘where’.

sqldf("select Species, Sepal_length from irisTable group by Species having Sepal_length between 5 and 6")
Species Sepal_length
1 setosa 5.00
2 versicolor 5.70
3 virginica 5.90
sqldf("select Species, Sepal_length from irisTable group by Species having Species='virginica'")
Species Sepal_length
1 virginica 5.90

The above two example highlight the use of where with two different approached. The first example is looking for values between, and including 0 and 10. The second example is looking for all records relating to Berkshire. It should be noted that where clauses are not case sensitive.

One issue which many users experience in R is the generated field names created during joins, or maths functions. In its basic format, sqldf will produce similar column names however this can be rectified using the ‘as’ statement.

sqldf("select Species, Sepal_length as 'Avg Sepal Length' from irisTable group by Species having Species='virginica'")
Species Avg Sepal Length
1 virginica 6.59

Though brief, this tutorial is able to address all and more of the functionality found within Pivot Tables and yet leaves many features of sqldf unused. Additional functionality includes table join, SQLite backend databases and data recording.

To leave a comment for the author, please follow the link and comment on his blog: Cartopedia.co.uk » r-tutorial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.