Doppelgänger search with R and MatchIt

March 6, 2018
By

(This article was first published on r – Jonathan Fowler, and kindly contributed to R-bloggers)

In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.

We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that.  There are several. I’ve used and prefer MatchIt.

First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:

Group ID Factor1 Factor2 Factor3 Factor4
0 Person A .333 .2 .571 3
0 Person B .667 .2 .571 4
0 Person C .667 .6 -.285 -2
0 Person D .333 1.2 .571 6
0 Person E .000 .8 -.285 8
0 Person F .000 .4 -.285 -5
1 Person G .333 1.4 -.285 -1
1 Person H .667 .6 -.571 0
1 Person I .000 .2 .285 6

Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.

The following script calls the MatchIt package and performs the matching:

# Call the library
library(MatchIt)

# Initialize
set.seed(1234)

# Run matching function; all 4 factors are equally weighted
match.it <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1)
a <- summary(match.it)

# Put matched set in a new data frame
df.match <- match.data(match.it)[1:ncol(matching)]

# Plot the results
plot(match.it, type = 'jitter', interactive = FALSE)

Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.

The post Doppelgänger search with R and MatchIt appeared first on Jonathan Fowler.

To leave a comment for the author, please follow the link and comment on their blog: r – Jonathan Fowler.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)