Translating between R and SQL: the basics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An introductory comparison of using the two languages.
Background
R was made especially for data analysis and graphics. SQL was made especially for databases. They are allies.
The data structure in R that most closely matches a SQL table is a data frame. The terms rows and columns are used in both.
A mashup
There is an R package called sqldf
that allows you to use SQL commands to extract data from an R data frame. We will use this package in the examples. There are two basic steps to using an R package:
- it must be installed on your machine (once)
- it must be available in each R session where it is used
You can use sqldf
by doing (one time only, and assuming an internet connection):
install.packages("sqldf")
Then in each R session where you want to use it:
require(sqldf)
To simplify the examples, we’ll slightly modify one of the inbuilt data frames:
myCO2 <- CO2 attributes(myCO2) <- attributes(CO2)[ c("names", "row.names", "class")] class(myCO2) <- "data.frame"
Note that the character between C and 2 is a capital-O and not a zero. The CO2
object has a complicated value for its class but the result of the sqldf
function has only "data.frame"
in its class. We want to cleanly see if two objects are the same, and hence we want the classes to match.
Column names
In R the colnames
function returns the names of the columns:
> colnames(myCO2) [1] "Plant" "Type" "Treatment" "conc" [5] "uptake"
The result is a vector of character strings.
Subsetting columns
Columns in SQL are also called “fields”. In R it is rather common for columns to be called “variables”.
In SQL the subset of columns is determined by select
. Here we want to get the Type and conc columns:
s01 <- sqldf("select Type, conc from myCO2")
Subsetting in R (commonly called “subscripting”) is done with square brackets. When subscripting a data frame there will be two places inside the square brackets separated by a comma. The R equivalent of the command above is:
r01 <- myCO2[, c("Type", "conc")]
The first part inside the square brackets (corresponding to rows) is empty. The second part (corresponding to columns) has a character vector with the names of the two columns we want.
We can test that r01
and s01
are the same:
> all.equal(s01, r01) [1] TRUE
In R the vector of column names could be created as an object and then used in the subscripting:
someCols <- c("Type", "conc") r01b <- myCO2[, someCols]
The r01
and r01b
objects are the same.
All columns
An asterisk is used in SQL to indicate that you want all columns:
s02 <- sqldf("select * from myCO2")
When you want all items in a dimension in R, you leave it blank:
r02 <- myCO2[ , ]
You might have been able to guess that because we’ve seen that done for rows already. Note that spaces almost never matter in R — the command above has spaces either side of the comma, but would be exactly the same with no spaces.
Only one column
How to select only a single column is no surprise in either language:
s03 <- sqldf("select Type from myCO2") r03 <- myCO2[ , "Type"]
But there is a surprise when you test if these two objects are equal:
all.equal(s03, r03)
The command above results in a bunch of stuff, indicating they are quite different.
The r03
object is not a data frame, it is an object of the type of the column. While surprising to those used to SQL, this is quite natural for R’s purposes. For example, we give the mean
function a vector of numbers, not a data frame:
> mean(myCO2[, "uptake"]) [1] 27.2131
You can get a one-column data frame by slightly modifying the command:
r03d <- myCO2[ , "Type", drop=FALSE]
The s03
and r03d
objects are the same.
Data frames are not natural inputs to some functions:
> mean(myCO2[, "uptake", drop=FALSE]) [1] NA Warning message: In mean.default(myCO2[, "uptake", drop = FALSE]) : argument is not numeric or logical: returning NA
Case sensitivity
SQL is not case-sensitive:
s04 <- sqldf("select type, coNC from myCO2")
s04
is the same as s01
.
On the other hand, R is case-sensitive:
> r04 <- myCO2[, c("type", "coNC")] Error in `[.data.frame`(myCO2, , c("type", "coNC")) : undefined columns selected
R extensions
We've seen how to select columns of an R data frame with the names of the columns. There are other ways of selecting columns as well.
The order of the columns in an R data frame is of significance. You can select columns by number. For example, you can select column 5 and then column 2:
myCO2[, c(5, 2)]
You can use negative numbers to exclude columns. Here you are asking for all columns except the first and the fourth:
myCO2[, c(-1, -4)]
Column selection in R can also be done with logical values:
myCO2[, c(TRUE, FALSE, FALSE, TRUE, FALSE)]
Those logical values can be created by a command:
myCO2[, colnames(myCO2) > "d"]
Subsetting rows
In SQL a common synonym for “row” is “record”. In R a common synonym is “observation”.
Conditions
The common way of getting a subset of rows in SQL is with the where
command:
s05 <- sqldf("select * from myCO2 where uptake < 20")
In R the equivalent of the where
is put in the first position inside the square brackets:
r05 <- myCO2[ myCO2[, "uptake"] < 20, ]
s05
and r05
are in most respects the same. The difference is that the row names are different. r05
has the row names from the original data frame while s05
has new ones that are sequential from 1.
with
The command that created r05
is a little convoluted (but logical once you stare at it long enough). The with
function allows a command that is more in the spirit of what is done in SQL:
r05w <- with(myCO2, myCO2[uptake < 20, ]) # same as r05
Inside the with
call the columns of the data frame named in the first argument can be used as objects. In this example uptake
is used directly instead of pulling that column out of the data frame.
Logical operators
Logical comparisons in SQL are combined with AND
and OR
:
s06 <- sqldf("select * from myCO2 where uptake < 20 and Type='Quebec'")
Also note that testing equality is with =
.
In R this type of ‘and’ operation is done with &
and the ‘or’ is |
:
r06 <- with(myCO2, myCO2[uptake < 20 & Type == 'Quebec', ])
A possible trouble spot is that equality in R is tested with ==
(while =
is an assignment operator).
The s06
and r06
objects are the same except for their row names.
First few
The limit
command in SQL limits the number of rows that are given:
s07 <- sqldf("select * from myC02 limit 6")
One way to see just the column names is to limit the number of rows to zero.
You can get the first few rows in R with head
:
r07 <- head(myCO2)
The tail
function gives you the last few rows, and the corner
function is a logical extension of head
and tail
.
Row names versus numbers
A source of possible confusion is that row names are character even though they are, by default, representations of numbers. Let’s experiment with r06
:
> r06 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 Qn2 Quebec nonchilled 95 13.6 15 Qn3 Quebec nonchilled 95 16.2 22 Qc1 Quebec chilled 95 14.2 29 Qc2 Quebec chilled 95 9.3 36 Qc3 Quebec chilled 95 15.1
Select the first three rows:
> r06[1:3,] Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 Qn2 Quebec nonchilled 95 13.6 15 Qn3 Quebec nonchilled 95 16.2
Now let’s select the characters one through three:
> r06[c 1=""2"," 2=""3")," language="("1","][/c] Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 NA <NA> <NA> <NA> NA NA 36 Qc3 Quebec chilled 95 15.1
What happened? The first row is correct — the first row name is 1. In the second row it looked for a row name called “2″ and didn’t find one, so it put in missing values. The third row is even weirder: it looked for a row name called “3″; there was a single row name starting with “3″ so it did a partial match and gave us that row.
Trying to give numbers instead of the actual names doesn’t necessarily work either:
> r06[c 1="8," 2="15)," language="(1,"][/c] Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16 NA <NA> <NA> <NA> NA NA NA.1 <NA> <NA> <NA> NA NA
Additional details
NULL
In SQL NULL
means missing value. Confusingly R also has NULL
but the equivalent of SQL NULL
is NA
in R.
Let’s create some data to play with:
r08 <- r06 r08[2:4, 1] <- NA r08[5, 4] <- NA
This looks like:
> r08 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 <NA> Quebec nonchilled 95 13.6 15 <NA> Quebec nonchilled 95 16.2 22 <NA> Quebec chilled 95 14.2 29 Qc2 Quebec chilled NA 9.3 36 Qc3 Quebec chilled 95 15.1
Get the rows where Plant is not missing:
s09 <- sqldf("select * from r08 where plant is not null") r09 <- with(r08, r08[!is.na(Plant), ])
We can also get the rows where Plant is missing:
s10 <- sqldf("select * from r08 where plant is null") r10 <- with(r08, r08[is.na(Plant), ])
To get the rows that have no missing values in R, you can do:
> na.omit(r08) Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 36 Qc3 Quebec chilled 95 15.1
Quotes
In SQL single quotes are used to delimit character strings. A single quote inside a string is given with two single quotes in a row. Some implementations allow you to specify the delimiter.
In R either single quotes or double quotes can be used. You can use whichever you find more convenient but R always prints using double quotes. The backslash is used to escape a quote character that is the same as the delimiting quote:
> c("he's", 'he\'s', "she has \"it\"") [1] "he's" "he's" "she has \"it\""
Semicolons
Semicolons are sometimes used at the end of statements in both SQL and R.
Some SQL implementations require a semicolon at the end of a statement.
Semicolons are used to separate R commands on the same line. They can be used after all R commands, but probably shouldn’t be.
Single subscript
Note that both
r06[1:3, ]
and
r06[1:3]
are legal R commands, but they do different things. The first gives the first three rows and all of the columns; the second gives all of the rows and the first three columns (for a reason you need not be concerned about initially).
Efficiency
If a data frame is large and the manipulation is complex, then R can be inefficient. Why should someone with access to a database put up with such inefficiency? One reason is the flexibility that R gives you. It might surprise some people that not all data naturally fit into a structure of rows and columns. Besides it usually doesn’t matter. As has been asked facetiously by a certain someone: “What are you going to do with that extra millisecond?”
But if there are millions of those milliseconds, then you might start to care. The data.table
package provides an alternative form of data frames that is highly efficient.
Resources to learn R
“Impatient R” is a minimal set of things to learn about R.
“Some hints for the R beginner” suggests additional resources.
Resources to learn SQL
I’m certainly no expert at learning materials for SQL — please make suggestions. But here are things I’ve found that seem at least okay:
SQLZoo provides quite a nice interactive set of exercises.
Tutorialspoint has information easily arranged for learning and refreshing.
See also
“Select operations on R data frames”
The post Translating between R and SQL: the basics appeared first on Burns Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.