Select operations on R data frames

July 26, 2009
By

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

The R Project

The R language is weird - particularly for those coming from a typical programmer's background, which likely includes OO languages in the curly-brace family and relational databases using SQL. A key data structure in R, the data.frame, is used something like a table in a relational database. In terms of R's somewhat byzantine type system (which is explained nicely here), a data.frame is a list of vectors of varying types. Each vector is a column in the data.frame making this a column-oriented data structure as opposed to the row-oriented nature of relational databases.

In spite of this difference, we often want to do the same sorts of things to an R data.frame that we would to a SQL table. The R docs confuse the SQL-savvy by using different terminology, so here is a quick crib-sheet for applying SQL concepts to data.frames.

We're going to use a sample data.frame with the following configuration of columns, or schema, if you prefer: (sequence:factor, strand:factor, start:integer, end:integer, common_name:character, value:double) where the type character is a string and a factor is something like an enum. Well, more accurately, value is a vector of type double and so forth. Anyway, our example is motivated by annotation of genome sequences, but the techniques aren't particular to any type of data.

> head(df)
    sequence strand start   end common_name      value
1 chromosome      +  1450  2112        yvrO  0.9542516
2 chromosome      + 41063 41716       graD6  0.2374012
3 chromosome      + 62927 63640       graD3  1.0454790
4 chromosome      + 63881 64807         gmd  1.4383845
5 chromosome      + 71811 72701        moaE -1.8739953
6 chromosome      + 73639 74739        moaA  1.2711058

So, given a data.frame of that schema, how do we do some simple select operations?

Selecting columns by name is easy:

> df[,c('sequence','start','end')]
       sequence   start     end
1    chromosome    1450    2112
2    chromosome   41063   41716
3    chromosome   62927   63640
4    chromosome   63881   64807
5    chromosome   71811   72701
...

As is selecting row names, or both:

> df[566:570,c('sequence','start','end')]
      sequence  start    end
566 chromosome 480999 479860
567 chromosome 481397 480999
568 chromosome 503053 501275
569 chromosome 506476 505712
570 chromosome 515461 514277

Selecting rows that meet certain criteria is a lot like a SQL where clause:

> df[df$value>3.0,]
      sequence strand   start     end common_name    value
199 chromosome      +  907743  909506        hutU 3.158821
321 chromosome      + 1391811 1393337        nadB 3.092771
556 chromosome      -  431600  431037         apt 3.043373
572 chromosome      -  519043  518186        hbd1 3.077040

For extra bonus points, let's find tRNAs.

> df[grep("trna", df$common_name, ignore.case=T),]
      sequence strand   start     end common_name        value
18  chromosome      +  115152  115224    Asn tRNA -0.461038128
19  chromosome      +  115314  115422    Ile tRNA -0.925268307
31  chromosome      +  167315  167388    Tyr tRNA  0.112527023
32  chromosome      +  191112  191196    Ser tRNA  0.986357577
...

Duplicate row names

Row names are not necessarily unique in R, which breaks the method shown above for selecting by row name. Take matrix a:

< a = matrix(1:18, nrow=6, ncol=3)
< rownames(a) <- c('a', 'a', 'a', 'b', 'b', 'b')
< colnames(a) <- c('foo', 'bar', 'bat')
< a
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15
b   4  10  16
b   5  11  17
b   6  12  18

It looks to me like trying to index by the row names just returns the first row of a given name:

< a['a',]
foo bar bat 
  1   7  13
< a['b',]
foo bar bat 
  4  10  16 

But this works:

< a[rownames(a)=='a',]
  foo bar bat
a   1   7  13
a   2   8  14
a   3   9  15

More Resources:

Help for R, the R language, or the R project is notoriously hard to search for, so I like to stick in a few extra keywords, like R, data frames, data.frames, select subset, subsetting, selecting rows from a data.frame that meet certain criteria, and find.

To leave a comment for the author, please follow the link and comment on his blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.