
The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R's somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.
In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.
We're going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren't particular to any type of data.
> head(df)
sequence strand start end common_name value
1 chromosome + 1450 2112 yvrO 0.9542516
2 chromosome + 41063 41716 graD6 0.2374012
3 chromosome + 62927 63640 graD3 1.0454790
4 chromosome + 63881 64807 gmd 1.4383845
5 chromosome + 71811 72701 moaE -1.8739953
6 chromosome + 73639 74739 moaA 1.2711058
So, given a data.frame of that schema, how do we do some simple select operations?
Selecting columns by name is easy:
> df[,c('sequence','start','end')]
sequence start end
1 chromosome 1450 2112
2 chromosome 41063 41716
3 chromosome 62927 63640
4 chromosome 63881 64807
5 chromosome 71811 72701
...As is selecting row names, or both:
> df[566:570,c('sequence','start','end')]
sequence start end
566 chromosome 480999 479860
567 chromosome 481397 480999
568 chromosome 503053 501275
569 chromosome 506476 505712
570 chromosome 515461 514277Selecting rows that meet certain criteria is a lot like a SQL where clause:
> df[df$value>3.0,]
sequence strand start end common_name value
199 chromosome + 907743 909506 hutU 3.158821
321 chromosome + 1391811 1393337 nadB 3.092771
556 chromosome - 431600 431037 apt 3.043373
572 chromosome - 519043 518186 hbd1 3.077040For extra bonus points, let's find tRNAs.
> df[grep("trna", df$common_name, ignore.case=T),]
sequence strand start end common_name value
18 chromosome + 115152 115224 Asn tRNA -0.461038128
19 chromosome + 115314 115422 Ile tRNA -0.925268307
31 chromosome + 167315 167388 Tyr tRNA 0.112527023
32 chromosome + 191112 191196 Ser tRNA 0.986357577
...Duplicate row names
Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:
< a = matrix(1:18, nrow=6, ncol=3)
< rownames(a) <- c('a', 'a', 'a', 'b', 'b', 'b')
< colnames(a) <- c('foo', 'bar', 'bat')
< a
foo bar bat
a 1 7 13
a 2 8 14
a 3 9 15
b 4 10 16
b 5 11 17
b 6 12 18It looks to me like trying to index by the row names just returns the first row of a given name:
< a['a',] foo bar bat 1 7 13 < a['b',] foo bar bat 4 10 16
But this works:
< a[rownames(a)=='a',] foo bar bat a 1 7 13 a 2 8 14 a 3 9 15
More Resources:
- R programming for those coming from other languages
- Extract or Replace Parts of a Data Frame
- Subsetting Vectors, Matrices and Data Frames
- R & BioConductor Manual
- R Help search engine
- Abandon hope all ye who enter The R Inferno
- R Programming for Bioinformatics
Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).