An introductory comparison of using the two languages.
R was made especially for data analysis and graphics. SQL was made especially for databases. They are allies.
The data structure in R that most closely matches a SQL table is a data frame. The terms rows and columns are used in both.
There is an R package called
sqldf that allows you to use SQL commands to extract data from an R data frame. We will use this package in the examples. There are two basic steps to using an R package:
- it must be installed on your machine (once)
- it must be available in each R session where it is used
You can use
sqldf by doing (one time only, and assuming an internet connection):
Then in each R session where you want to use it:
To simplify the examples, we’ll slightly modify one of the inbuilt data frames:
myCO2 <- CO2 attributes(myCO2) <- attributes(CO2)[ c("names", "row.names", "class")] class(myCO2) <- "data.frame"
Note that the character between C and 2 is a capital-O and not a zero. The
CO2 object has a complicated value for its class but the result of the
sqldf function has only
"data.frame" in its class. We want to cleanly see if two objects are the same, and hence we want the classes to match.
In R the
colnames function returns the names of the columns:
> colnames(myCO2)  "Plant" "Type" "Treatment" "conc"  "uptake"
The result is a vector of character strings.
Columns in SQL are also called “fields”. In R it is rather common for columns to be called “variables”.
In SQL the subset of columns is determined by
select. Here we want to get the Type and conc columns:
s01 <- sqldf("select Type, conc from myCO2")
Subsetting in R (commonly called “subscripting”) is done with square brackets. When subscripting a data frame there will be two places inside the square brackets separated by a comma. The R equivalent of the command above is:
r01 <- myCO2[, c("Type", "conc")]
The first part inside the square brackets (corresponding to rows) is empty. The second part (corresponding to columns) has a character vector with the names of the two columns we want.
We can test that
s01 are the same:
> all.equal(s01, r01)  TRUE
In R the vector of column names could be created as an object and then used in the subscripting:
someCols <- c("Type", "conc") r01b <- myCO2[, someCols]
r01b objects are the same.
An asterisk is used in SQL to indicate that you want all columns:
s02 <- sqldf("select * from myCO2")
When you want all items in a dimension in R, you leave it blank:
r02 <- myCO2[ , ]
You might have been able to guess that because we’ve seen that done for rows already. Note that spaces almost never matter in R — the command above has spaces either side of the comma, but would be exactly the same with no spaces.
Only one column
How to select only a single column is no surprise in either language:
s03 <- sqldf("select Type from myCO2") r03 <- myCO2[ , "Type"]
But there is a surprise when you test if these two objects are equal:
The command above results in a bunch of stuff, indicating they are quite different.
r03 object is not a data frame, it is an object of the type of the column. While surprising to those used to SQL, this is quite natural for R’s purposes. For example, we give the
mean function a vector of numbers, not a data frame:
> mean(myCO2[, "uptake"])  27.2131
You can get a one-column data frame by slightly modifying the command:
r03d <- myCO2[ , "Type", drop=FALSE]
r03d objects are the same.
Data frames are not natural inputs to some functions:
> mean(myCO2[, "uptake", drop=FALSE])  NA Warning message: In mean.default(myCO2[, "uptake", drop = FALSE]) : argument is not numeric or logical: returning NA
SQL is not case-sensitive:
s04 <- sqldf("select type, coNC from myCO2")
s04 is the same as
On the other hand, R is case-sensitive:
> r04 <- myCO2[, c("type", "coNC")] Error in `[.data.frame`(myCO2, , c("type", "coNC")) : undefined columns selected
We’ve seen how to select columns of an R data frame with the names of the columns. There are other ways of selecting columns as well.
The order of the columns in an R data frame is of significance. You can select columns by number. For example, you can select column 5 and then column 2:
myCO2[, c(5, 2)]
You can use negative numbers to exclude columns. Here you are asking for all columns except the first and the fourth:
myCO2[, c(-1, -4)]
Column selection in R can also be done with logical values:
myCO2[, c(TRUE, FALSE, FALSE, TRUE, FALSE)]
Those logical values can be created by a command:
myCO2[, colnames(myCO2) > "d"]
In SQL a common synonym for “row” is “record”. In R a common synonym is “observation”.
The common way of getting a subset of rows in SQL is with the
s05 <- sqldf("select * from myCO2 where uptake < 20")
In R the equivalent of the
where is put in the first position inside the square brackets:
r05 <- myCO2[ myCO2[, "uptake"] < 20, ]
r05 are in most respects the same. The difference is that the row names are different.
r05 has the row names from the original data frame while
s05 has new ones that are sequential from 1.
The command that created
r05 is a little convoluted (but logical once you stare at it long enough). The
with function allows a command that is more in the spirit of what is done in SQL:
r05w <- with(myCO2, myCO2[uptake < 20, ]) # same as r05
with call the columns of the data frame named in the first argument can be used as objects. In this example
uptake is used directly instead of pulling that column out of the data frame.
Logical comparisons in SQL are combined with
s06 <- sqldf("select * from myCO2 where uptake < 20 and Type='Quebec'")
Also note that testing equality is with
In R this type of ‘and’ operation is done with
& and the ‘or’ is
r06 <- with(myCO2, myCO2[uptake < 20 & Type == 'Quebec', ])
A possible trouble spot is that equality in R is tested with
= is an assignment operator).
r06 objects are the same except for their row names.
limit command in SQL limits the number of rows that are given:
s07 <- sqldf("select * from myC02 limit 6")
One way to see just the column names is to limit the number of rows to zero.
You can get the first few rows in R with
r07 <- head(myCO2)
tail function gives you the last few rows, and the
corner function is a logical extension of
Row names versus numbers
A source of possible confusion is that row names are character even though they are, by default, representations of numbers. Let’s experiment with
> r06 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 Qn2 Quebec nonchilled 95 13.6 15 Qn3 Quebec nonchilled 95 16.2 22 Qc1 Quebec chilled 95 14.2 29 Qc2 Quebec chilled 95 9.3 36 Qc3 Quebec chilled 95 15.1
Select the first three rows:
> r06[1:3,] Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 Qn2 Quebec nonchilled 95 13.6 15 Qn3 Quebec nonchilled 95 16.2
Now let’s select the characters one through three:
> r06 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 NA <NA> <NA> <NA> NA NA 36 Qc3 Quebec chilled 95 15.1
What happened? The first row is correct — the first row name is 1. In the second row it looked for a row name called “2″ and didn’t find one, so it put in missing values. The third row is even weirder: it looked for a row name called “3″; there was a single row name starting with “3″ so it did a partial match and gave us that row.
Trying to give numbers instead of the actual names doesn’t necessarily work either:
> r06 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16 NA <NA> <NA> <NA> NA NA NA.1 <NA> <NA> <NA> NA NA
NULL means missing value. Confusingly R also has
NULL but the equivalent of SQL
NA in R.
Let’s create some data to play with:
r08 <- r06 r08[2:4, 1] <- NA r08[5, 4] <- NA
This looks like:
> r08 Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 8 <NA> Quebec nonchilled 95 13.6 15 <NA> Quebec nonchilled 95 16.2 22 <NA> Quebec chilled 95 14.2 29 Qc2 Quebec chilled NA 9.3 36 Qc3 Quebec chilled 95 15.1
Get the rows where Plant is not missing:
s09 <- sqldf("select * from r08 where plant is not null") r09 <- with(r08, r08[!is.na(Plant), ])
We can also get the rows where Plant is missing:
s10 <- sqldf("select * from r08 where plant is null") r10 <- with(r08, r08[is.na(Plant), ])
To get the rows that have no missing values in R, you can do:
> na.omit(r08) Plant Type Treatment conc uptake 1 Qn1 Quebec nonchilled 95 16.0 36 Qc3 Quebec chilled 95 15.1
In SQL single quotes are used to delimit character strings. A single quote inside a string is given with two single quotes in a row. Some implementations allow you to specify the delimiter.
In R either single quotes or double quotes can be used. You can use whichever you find more convenient but R always prints using double quotes. The backslash is used to escape a quote character that is the same as the delimiting quote:
> c("he's", 'he\'s', "she has \"it\"")  "he's" "he's" "she has \"it\""
Semicolons are sometimes used at the end of statements in both SQL and R.
Some SQL implementations require a semicolon at the end of a statement.
Semicolons are used to separate R commands on the same line. They can be used after all R commands, but probably shouldn’t be.
Note that both
are legal R commands, but they do different things. The first gives the first three rows and all of the columns; the second gives all of the rows and the first three columns (for a reason you need not be concerned about initially).
If a data frame is large and the manipulation is complex, then R can be inefficient. Why should someone with access to a database put up with such inefficiency? One reason is the flexibility that R gives you. It might surprise some people that not all data naturally fit into a structure of rows and columns. Besides it usually doesn’t matter. As has been asked facetiously by a certain someone: “What are you going to do with that extra millisecond?”
But if there are millions of those milliseconds, then you might start to care. The
data.table package provides an alternative form of data frames that is highly efficient.
Resources to learn R
“Impatient R” is a minimal set of things to learn about R.
“Some hints for the R beginner” suggests additional resources.
Resources to learn SQL
I’m certainly no expert at learning materials for SQL — please make suggestions. But here are things I’ve found that seem at least okay:
SQLZoo provides quite a nice interactive set of exercises.
Tutorialspoint has information easily arranged for learning and refreshing.