sqldf and grouping rows in R

February 8, 2011
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

In R, you can treat tables (or data.frames as they are called in R) as SQL tables. That means you can query them as you would query a database with SQL commands. This is particularly useful 1) if you know SQL, hahah:)  2) if you have large tables with millions of rows. In R, querying a database will be much faster than iterating through the rows of a million row table.

Say you have a table of exon locations and RPKM value for each exon and you want to get the total RPKM value for each transcript:


>head(rnaEx)
  ex_rank                ex_name tx_chr  tx_start    tx_end     score len   strand
       1 heart_k25:1002070u:2.9   chrX 101864664 101864760  1314.937  65     +
       2 heart_k25:1002070u:2.9   chrX 101864664 101864760     0.000   5     +
       1 heart_k25:1002622u:3.0   chrX  96534438  96534487  1055.568  50     -



# load the library that can treat tables as SQL tables
>require(sqldf)

# query the table with SQL command: group by transcripts and sum up the scores
>int.res=sqldf("select ex_name,tx_chr,tx_start,tx_
end,strand,SUM(score) AS sum_score,SUM(len) AS len from rnaEx GROUP BY ex_name,tx_chr,tx_start,tx_end")

>head(int.res)
                 ex_name tx_chr  tx_start    tx_end strand sum_score len
  heart_k25:100000u:2.5   chrX  73370385  73370481      +  2448.919  97
 heart_k25:1000027u:5.5   chr1  93550994  93551053      -  2352.406  60
 heart_k25:1000076u:2.5   chr5  15764153  15764204      -  1140.013  52
  heart_k25:1000078:2.6  chr11 116415816 116415975      +  3787.981 160


Check here for more info on sqldf:

Similar functionality also exists in R through another package called plyr. Although, I feel like sqldf will be faster for large data.frames, I have no evidence on that at the moment. plyr is described here and here, and check below for a quick intro:

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.