The R language is weird – particularly for those coming from a typical programmer’s background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R’s somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.
In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.
We’re going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren’t particular to any type of data.
> head(df) sequence strand start end common_name value 1 chromosome + 1450 2112 yvrO 0.9542516 2 chromosome + 41063 41716 graD6 0.2374012 3 chromosome + 62927 63640 graD3 1.0454790 4 chromosome + 63881 64807 gmd 1.4383845 5 chromosome + 71811 72701 moaE -1.8739953 6 chromosome + 73639 74739 moaA 1.2711058
So, given a data.frame of that schema, how do we do some simple select operations?
Selecting columns by name is easy:
> df[,c('sequence','start','end')] sequence start end 1 chromosome 1450 2112 2 chromosome 41063 41716 3 chromosome 62927 63640 4 chromosome 63881 64807 5 chromosome 71811 72701 ...
As is selecting row names, or both:
> df[566:570,c('sequence','start','end')] sequence start end 566 chromosome 480999 479860 567 chromosome 481397 480999 568 chromosome 503053 501275 569 chromosome 506476 505712 570 chromosome 515461 514277
Selecting rows that meet certain criteria is a lot like a SQL where clause:
> df[df$value>3.0,] sequence strand start end common_name value 199 chromosome + 907743 909506 hutU 3.158821 321 chromosome + 1391811 1393337 nadB 3.092771 556 chromosome - 431600 431037 apt 3.043373 572 chromosome - 519043 518186 hbd1 3.077040
For extra bonus points, let’s find tRNAs.
> df[grep("trna", df$common_name, ignore.case=T),] sequence strand start end common_name value 18 chromosome + 115152 115224 Asn tRNA -0.461038128 19 chromosome + 115314 115422 Ile tRNA -0.925268307 31 chromosome + 167315 167388 Tyr tRNA 0.112527023 32 chromosome + 191112 191196 Ser tRNA 0.986357577 ...
Duplicate row names
Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:
< a = matrix(1:18, nrow=6, ncol=3) < rownames(a)
It looks to me like trying to index by the row names just returns the first row of a given name:
< a['a',] foo bar bat 1 7 13 < a['b',] foo bar bat 4 10 16
But this works:
< a[rownames(a)=='a',] foo bar bat a 1 7 13 a 2 8 14 a 3 9 15
- R programming for those coming from other languages
- Extract or Replace Parts of a Data Frame
- Subsetting Vectors, Matrices and Data Frames
- R & BioConductor Manual
- R Help search engine
- Abandon hope all ye who enter The R Inferno
- R Programming for Bioinformatics
Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.