Doppelgänger search with R and MatchIt

[This article was first published on r – Jonathan Fowler, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.

We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that.  There are several. I’ve used and prefer MatchIt.

First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:

GroupIDFactor1Factor2Factor3Factor4
0Person A.333.2.5713
0Person B.667.2.5714
0Person C.667.6-.285-2
0Person D.3331.2.5716
0Person E.000.8-.2858
0Person F.000.4-.285-5
1Person G.3331.4-.285-1
1Person H.667.6-.5710
1Person I.000.2.2856

Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.

The following script calls the MatchIt package and performs the matching:

# Call the library
library(MatchIt)

# Initialize
set.seed(1234)

# Run matching function; all 4 factors are equally weighted
match.it <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1)
a <- summary(match.it)

# Put matched set in a new data frame
df.match <- match.data(match.it)[1:ncol(matching)]

# Plot the results
plot(match.it, type = 'jitter', interactive = FALSE)

Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.

The post Doppelgänger search with R and MatchIt appeared first on Jonathan Fowler.

To leave a comment for the author, please follow the link and comment on their blog: r – Jonathan Fowler.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)