An efficient way to do dataset intersection

July 27, 2011
By

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

The main message is to use "match" to get index of needed rows and then get the rows by the index, instead of using the row names to select, which is much slower. Here is example:


In example above, we know that the same values of column 2nd have same values of columns from 4th to the end. So, instead of doing unique on whole matrix, getting the unique of column 2nd and then getting the index of unique ones by match. Match(a,b) only return the index of first occurrence of a in b. For example


This tips also help in intersecting two big dataframes. For example,

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.