Joining data frames in R

December 17, 2009
By

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

The R ProjectWant to join two R data frames on a common key? Here's one way do a SQL database style join operation in R.

We start with a data frame describing probes on a microarray. The key is the probe_id and the rest of the information describes the location on the genome targeted by that probe.

> head(probes)
          probe_id sequence strand   start     end
1 mm_ex_fwd_000541      Chr      + 1192448 1192507
2 mm_ex_fwd_000542      Chr      + 1192453 1192512
3 mm_ex_fwd_000543      Chr      + 1192458 1192517
4 mm_ex_fwd_000544      Chr      + 1192463 1192522
5 mm_ex_fwd_000545      Chr      + 1192468 1192527
6 mm_ex_fwd_000546      Chr      + 1192473 1192532

> dim(probes)
[1] 241019      5

We also have a bunch of measurements in a numeric vector. For each probe (well, a few probes missing due to bad data) we have a value.

> head(value)
mm_fwd_000002 mm_fwd_000003 mm_fwd_000004 mm_fwd_000005 mm_fwd_000006 mm_fwd_000007 
   0.05294899    0.11979251    0.28160017    0.57284569    0.74402510    0.78644199 

> length(value)
[1] 241007

Let's join up these tables, er data frame and vector. We'll use the match function. Match returns a vector of positions of the (first) matches of its first argument in its second (or NA if there is no match). So, we're matching our values into our probes.

> joined = cbind(probes[match(names(value), probes$probe_id),], value)

> dim(joined)
[1] 241007      6

> head(joined)
          probe_id sequence strand start end         value
3695 mm_fwd_000002      Chr      +    15  74 0.05294899
3696 mm_fwd_000003      Chr      +    29  88 0.11979251
3697 mm_fwd_000004      Chr      +    43 102 0.28160017
3698 mm_fwd_000005      Chr      +    57 116 0.57284569
3699 mm_fwd_000006      Chr      +    71 130 0.74402510
3700 mm_fwd_000007      Chr      +    85 144 0.78644199

Merge is probably more similar to a database join.

Inner joinmerge(df1, df2, by="common_key_column")
Outer joinmerge(df1, df2, by="common_key_column", all=TRUE)
Left outermerge(df1, df2, by="common_key_column", all.x=TRUE)
Right outermerge(df1, df2, by="common_key_column", all.y=TRUE)

If we have two data frames, we can use merge. Let's convert our vector tp to a data frame and merge, getting the same result (in a different sort order).

> tp.df = data.frame(probe_id=names(tp), value=tp)

> head(tp.df)
                   probe_id      value
mm_fwd_000002 mm_fwd_000002 0.05294899
mm_fwd_000003 mm_fwd_000003 0.11979251
mm_fwd_000004 mm_fwd_000004 0.28160017
mm_fwd_000005 mm_fwd_000005 0.57284569
mm_fwd_000006 mm_fwd_000006 0.74402510
mm_fwd_000007 mm_fwd_000007 0.78644199

> m = merge(probes, tp.df, by="probe_id")

> dim(m)
[1] 241007      6

> head(mmm)
          probe_id sequence strand   start     end     value
1 mm_ex_fwd_000541      Chr      + 1192448 1192507 0.1354668
2 mm_ex_fwd_000542      Chr      + 1192453 1192512 0.1942794
3 mm_ex_fwd_000543      Chr      + 1192458 1192517 0.1924457
4 mm_ex_fwd_000544      Chr      + 1192463 1192522 0.2526351
5 mm_ex_fwd_000545      Chr      + 1192468 1192527 0.1922655
6 mm_ex_fwd_000546      Chr      + 1192473 1192532 0.2610747

There's a good discussion of merge on Stack Overflow, which includes right, left, inner and outer joins. Also the R wiki covers both match and merge. See also, the prior entry on select operations on R data frames.

To leave a comment for the author, please follow the link and comment on his blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.