Manipulating Data Frames Using sqldf – A Brief Overview

August 8, 2012
By

(This article was first published on Anything but R-bitrary, and kindly contributed to R-bloggers)

By Josh Mills

Introduction

For those who are learning R and who may be well-versed in SQL, the sqldf package provides a mechanism to manipulate R data frames using SQL. Even for experienced R programmers, sqldf can be a useful tool for data manipulation. This site provides a useful introduction to SQL. [SQLCourse.com 2012]

The following packages will be used in this document:
  • sqldf - A package that allows manipulation of R data frames with SQL (as well as connectivity with a limited set of database engines). [Grothendieck 2012]
  • plyr - A useful package for aggregating and summarizing data over multiple subgroups, with more advanced applications. [Wickham 2011]
Load these packages into memory.


Data Sets Used

Highway Data (crashes.csv and roads.csv - Click to Download)

These are fictional data sets containing crash and highway data with the following variables:
Crash data (crashes.csv)
  • Year - The year in which the observation was taken
  • Road - The name of the road being studied
  • N_Crashes - The number of crashes on the road during that year
  • Volume - Average annual daily traffic (AADT) volumes on the road during that year. See this site for a formal definition. [North Carolina Department of Transportation 2012]
Road data (roads.csv)
  • Road - The name of the road being studied
  • District - The administrative district responsible for the road's upkeep and maintenance
  • Length - Length of the road in miles


Joins and Merges with sqldf

Read in and explore the data to get a feel for the data's structure.

setwd("W:/Data Mining and Modeling/Applied Analytics - R Discussion/Related Files")
crashes <- read.csv("crashes.csv")
roads <- read.csv("roads.csv")
head(crashes)
##   Year          Road N_Crashes Volume
## 1 1991 Interstate 65 25 40000
## 2 1992 Interstate 65 37 41000
## 3 1993 Interstate 65 45 45000
## 4 1994 Interstate 65 46 45600
## 5 1995 Interstate 65 46 49000
## 6 1996 Interstate 65 59 51000
tail(crashes)
##     Year           Road N_Crashes Volume
## 105 2007 Interstate 275 32 21900
## 106 2008 Interstate 275 21 21850
## 107 2009 Interstate 275 25 22100
## 108 2010 Interstate 275 24 21500
## 109 2011 Interstate 275 23 20300
## 110 2012 Interstate 275 22 21200
print(roads)
##            Road       District Length
## 1 Interstate 65 Greenfield 262
## 2 Interstate 70 Vincennes 156
## 3 US-36 Crawfordsville 139
## 4 US-40 Greenfield 150
## 5 US-52 Crawfordsville 172

Performing joins is one of the most common operations in SQL. Left joins return all rows in the “left-hand” table - the crash data set in this case, whereas right joins return all rows in the “right-hand” table - the road data set in this case. Inner joins return only rows with matching data for the common variable, and full outer joins return all rows in all data sets, even if there are rows without matches. Currently, sqldf does not support right joins or full outer joins.

It is useful to format SQL statements with spaces and line breaks for readability and to store the query in a character string. The following statement will perform a left join of the crash data set to the road data set based on the common variable Road.

join_string <- "select
crashes.*
, roads.District
, roads.Length
from crashes
left join roads
on crashes.Road = roads.Road"

A new data frame, crashes_join_roads, will be created using the sqldf statement. The sqldf statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors argument will force categorical variables (like Road and District) to have the class character rather than factor.

crashes_join_roads <- sqldf(join_string,stringsAsFactors = FALSE)
## Loading required package: tcltk
head(crashes_join_roads)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
tail(crashes_join_roads)
##     Year           Road N_Crashes Volume District Length
## 105 2007 Interstate 275 32 21900 <NA> NA
## 106 2008 Interstate 275 21 21850 <NA> NA
## 107 2009 Interstate 275 25 22100 <NA> NA
## 108 2010 Interstate 275 24 21500 <NA> NA
## 109 2011 Interstate 275 23 20300 <NA> NA
## 110 2012 Interstate 275 22 21200 <NA> NA

By using an inner join, only matching rows will be kept.

join_string2 <- "select
crashes.*
, roads.District
, roads.Length
from crashes
inner join roads
on crashes.Road = roads.Road"
crashes_join_roads2 <- sqldf(join_string2, stringsAsFactors = FALSE)
head(crashes_join_roads2)
##   Year          Road N_Crashes Volume   District Length
## 1 1991 Interstate 65 25 40000 Greenfield 262
## 2 1992 Interstate 65 37 41000 Greenfield 262
## 3 1993 Interstate 65 45 45000 Greenfield 262
## 4 1994 Interstate 65 46 45600 Greenfield 262
## 5 1995 Interstate 65 46 49000 Greenfield 262
## 6 1996 Interstate 65 59 51000 Greenfield 262
tail(crashes_join_roads2)
##    Year  Road N_Crashes Volume       District Length
## 83 2007 US-36 49 24000 Crawfordsville 139
## 84 2008 US-36 52 24500 Crawfordsville 139
## 85 2009 US-36 55 24700 Crawfordsville 139
## 86 2010 US-36 35 23000 Crawfordsville 139
## 87 2011 US-36 33 21000 Crawfordsville 139
## 88 2012 US-36 31 20500 Crawfordsville 139

The merge statement in base R can perform the equivalent of inner and left joins, as well as right and full outer joins, which are unavailable in sqldf.

crashes_merge_roads <- merge(crashes, roads, by = c("Road"))
head(crashes_merge_roads)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000 95 74000 Greenfield 262
## 2 Interstate 65 1997 76 52000 Greenfield 262
## 3 Interstate 65 1998 90 58000 Greenfield 262
## 4 Interstate 65 1999 95 65000 Greenfield 262
## 5 Interstate 65 1991 25 40000 Greenfield 262
## 6 Interstate 65 1992 37 41000 Greenfield 262
tail(crashes_merge_roads)
##     Road Year N_Crashes Volume   District Length
## 83 US-40 2003 94 55200 Greenfield 150
## 84 US-40 2004 25 55300 Greenfield 150
## 85 US-40 2009 67 65000 Greenfield 150
## 86 US-40 2010 102 67000 Greenfield 150
## 87 US-40 2011 87 67500 Greenfield 150
## 88 US-40 2012 32 67500 Greenfield 150
crashes_merge_roads2 <- merge(crashes, roads, by = c("Road"), all.x = TRUE)
head(crashes_merge_roads2)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994 21 21200 <NA> NA
## 2 Interstate 275 1995 28 23200 <NA> NA
## 3 Interstate 275 1996 22 20000 <NA> NA
## 4 Interstate 275 1997 27 18000 <NA> NA
## 5 Interstate 275 1998 21 19500 <NA> NA
## 6 Interstate 275 1999 22 21000 <NA> NA
tail(crashes_merge_roads2)
##      Road Year N_Crashes Volume   District Length
## 105 US-40 2003 94 55200 Greenfield 150
## 106 US-40 2004 25 55300 Greenfield 150
## 107 US-40 2009 67 65000 Greenfield 150
## 108 US-40 2010 102 67000 Greenfield 150
## 109 US-40 2011 87 67500 Greenfield 150
## 110 US-40 2012 32 67500 Greenfield 150
crashes_merge_roads3 <- merge(crashes, roads, by = c("Road"), all.y = TRUE)
head(crashes_merge_roads3)
##            Road Year N_Crashes Volume   District Length
## 1 Interstate 65 2000 95 74000 Greenfield 262
## 2 Interstate 65 1997 76 52000 Greenfield 262
## 3 Interstate 65 1998 90 58000 Greenfield 262
## 4 Interstate 65 1999 95 65000 Greenfield 262
## 5 Interstate 65 1991 25 40000 Greenfield 262
## 6 Interstate 65 1992 37 41000 Greenfield 262
tail(crashes_merge_roads3)
##     Road Year N_Crashes Volume       District Length
## 84 US-40 2004 25 55300 Greenfield 150
## 85 US-40 2009 67 65000 Greenfield 150
## 86 US-40 2010 102 67000 Greenfield 150
## 87 US-40 2011 87 67500 Greenfield 150
## 88 US-40 2012 32 67500 Greenfield 150
## 89 US-52 NA NA NA Crawfordsville 172
crashes_merge_roads4 <- merge(crashes, roads, by = c("Road"), all.x = TRUE, 
all.y = TRUE)
head(crashes_merge_roads4)
##             Road Year N_Crashes Volume District Length
## 1 Interstate 275 1994 21 21200 <NA> NA
## 2 Interstate 275 1995 28 23200 <NA> NA
## 3 Interstate 275 1996 22 20000 <NA> NA
## 4 Interstate 275 1997 27 18000 <NA> NA
## 5 Interstate 275 1998 21 19500 <NA> NA
## 6 Interstate 275 1999 22 21000 <NA> NA
tail(crashes_merge_roads4)
##      Road Year N_Crashes Volume       District Length
## 106 US-40 2004 25 55300 Greenfield 150
## 107 US-40 2009 67 65000 Greenfield 150
## 108 US-40 2010 102 67000 Greenfield 150
## 109 US-40 2011 87 67500 Greenfield 150
## 110 US-40 2012 32 67500 Greenfield 150
## 111 US-52 NA NA NA Crawfordsville 172

Note how the order of the rows in the data frames were rearranged when using the merge statement.

The sqldf statement can process SQLite commands, which include most of the standard syntax used in ANSI SQL, except for some of the join operations outlined previously mentioned. [SQLite.org 2012]

Modifying the inner join query to include a where is the equivalent of combining merge and subset statements.

join_string2 <- "select
crashes.*
, roads.District
, roads.Length
from crashes
inner join roads
on crashes.Road = roads.Road
where crashes.Road = 'US-40'"
crashes_join_roads4 <- sqldf(join_string2,stringsAsFactors = FALSE)
head(crashes_join_roads4)
##   Year  Road N_Crashes Volume   District Length
## 1 1991 US-40 46 21000 Greenfield 150
## 2 1992 US-40 101 21500 Greenfield 150
## 3 1993 US-40 76 23000 Greenfield 150
## 4 1994 US-40 72 21000 Greenfield 150
## 5 1995 US-40 75 24000 Greenfield 150
## 6 1996 US-40 136 23500 Greenfield 150
tail(crashes_join_roads4)
##    Year  Road N_Crashes Volume   District Length
## 17 2007 US-40 45 59500 Greenfield 150
## 18 2008 US-40 23 61000 Greenfield 150
## 19 2009 US-40 67 65000 Greenfield 150
## 20 2010 US-40 102 67000 Greenfield 150
## 21 2011 US-40 87 67500 Greenfield 150
## 22 2012 US-40 32 67500 Greenfield 150


Aggregation Functions and Limitations of sqldf

Aggregate functions available using SQLite can be used through the use of a group by clause.

group_string <- "select
crashes.Road
, avg(crashes.N_Crashes) as Mean_Crashes
from crashes
left join roads
on crashes.Road = roads.Road
group by 1"
sqldf(group_string)
##             Road Mean_Crashes
## 1 Interstate 275 24.95
## 2 Interstate 65 107.82
## 3 Interstate 70 65.18
## 4 US-36 48.00
## 5 US-40 68.68

The available aggregation functions within SQLite or ANSI SQL are limited, however. While sqldf can make certain data manipulation operations easier, more advanced data manipulation tasks and calculations must be performed in R, such as using Hadley Wickham's plyr package.

ddply(crashes_merge_roads,
c("Road"),
function(X) data.frame(Mean_Crashes = mean(X$N_Crashes),
Q1_Crashes = quantile(X$N_Crashes, 0.25),
Q3_Crashes = quantile(X$N_Crashes, 0.75),
Median_Crashes = quantile(X$N_Crashes, 0.50))
)
##            Road Mean_Crashes Q1_Crashes Q3_Crashes Median_Crashes
## 1 Interstate 65 107.82 63.25 140.25 108.5
## 2 Interstate 70 65.18 52.00 75.50 66.5
## 3 US-36 48.00 42.00 57.25 47.0
## 4 US-40 68.68 45.25 90.75 70.0

In short, the sqldf package can make it easy for SQL users to begin making the transition to R. The package provides a convenient mechanism for data manipulation in R using SQL. While there are limitations to the use of SQL within R, the added convenience provides a useful alternative to using standard R functions. In addition, SQL statements used can easily be modified to function in a large-scale database environment such as Teradata or Netezza.

Download this document in R Markdown format.

References
  • SQLCourse.com (2012). SQLCourse.com®: Interactive Online SQL Training. Link
  • G. Grothendieck (2012). sqldf: Perform SQL Selects on R Data Frames. R package version 0.4-6.4. Link
  • H. Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. Link
  • North Carolina Department of Transportation (2012). Training Material for Traffic Engineering Accident Analysis System (TEAAS). Link
  • SQLite.org (2012). SQL As Understood By SQLite. Link
  • R Development Core Team (2012). R: A language and environment for statistical  computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN  3-900051-07-0. Link
This document was generated primarily using R Markdown and knitr.

To leave a comment for the author, please follow the link and comment on his blog: Anything but R-bitrary.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.