An efficient way to do dataset intersection

July 27, 2011

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

The main message is to use “match” to get index of needed rows and then get the rows by the index, instead of using the row names to select, which is much slower. Here is example:

In example above, we know that the same values of column 2nd have same values of columns from 4th to the end. So, instead of doing unique on whole matrix, getting the unique of column 2nd and then getting the index of unique ones by match. Match(a,b) only return the index of first occurrence of a in b. For example

This tips also help in intersecting two big dataframes. For example,

To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)