Set operations on more than two sets in R

February 6, 2013
By

(This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers)

Problem

Set operations are a common place thing to do in R, and the enabling functions in the base stats package are:

  • intersect(x, y)
  • union(x, y)
  • setdiff(x, y)
  • setequal(x, y)

That said, you’ll note that each ONLY takes two arguments – i.e. set X and set Y – which was a limitation I needed to overcome recently.

Googling around I found that you can apply set operations to multiple (>2) sets using the Reduce() function.  For example, an intersection of sets stored in the list x would be achieved with:
Reduce(intersect, x)

However, things get trickier if you want to do a setdiff() in this manner – that is find the elements that are only in set a, but not in sets b, c, d, …, etc.  Since I’m not a master at set theory, I decided to write my own, brute force method to get intersections, unions, and setdiffs for an arbitrary number of sets.

Implementation

The function I wrote uses a truth table to do this where:

  • rows are the union of all elements in all sets
  • columns are sets

So each “cell” of the table is TRUE if the element represented by the row is in the set represented by the column.

To find an element that intersects all sets, the values across the row need to be all TRUE.
To find an element that is only in one set, only one row value is TRUE.  To determine which set that element is in, numeric values are applied to each column such that:

  • col1 = 1
  • col2 = 2
  • col3 = 4
  • col4 = 8
  • … and so on

The values above are multiplied by each row’s TRUE/FALSE (ergo: 1/0) values and summed across to produce a row’s inclusion value.

For example, if an element is only in set 2 (that is column 2) the corresponding row inclusion value is 2.  If an element is in both sets 2 and 3 (columns 2 and 3) the corresponding row inclusion value is 6.

Visually …

1 2 4
set: x y z inc.val
a T F F 1
b F T T 6
c T F T 5
d T T T 7

Thus, determining intersections or setdiffs becomes a simple matter of filtering by row sums.

Code

To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)