Doppelgänger search with R and MatchIt
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.
We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that. There are several. I’ve used and prefer MatchIt.
First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:
Group | ID | Factor1 | Factor2 | Factor3 | Factor4 |
---|---|---|---|---|---|
0 | Person A | .333 | .2 | .571 | 3 |
0 | Person B | .667 | .2 | .571 | 4 |
0 | Person C | .667 | .6 | -.285 | -2 |
0 | Person D | .333 | 1.2 | .571 | 6 |
0 | Person E | .000 | .8 | -.285 | 8 |
0 | Person F | .000 | .4 | -.285 | -5 |
1 | Person G | .333 | 1.4 | -.285 | -1 |
1 | Person H | .667 | .6 | -.571 | 0 |
1 | Person I | .000 | .2 | .285 | 6 |
Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.
The following script calls the MatchIt package and performs the matching:
# Call the library library(MatchIt) # Initialize set.seed(1234) # Run matching function; all 4 factors are equally weighted match.it <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1) a <- summary(match.it) # Put matched set in a new data frame df.match <- match.data(match.it)[1:ncol(matching)] # Plot the results plot(match.it, type = 'jitter', interactive = FALSE)
Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.
The post Doppelgänger search with R and MatchIt appeared first on Jonathan Fowler.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.