I like you and you like me…but what does it all mean. (Part 1)

August 19, 2014

(This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers)

Tinder is a popular matchmaking application that allows users to connect with others whom they share a physical attraction. New members build their profile by importing their age, gender, geographic information, and photos from their Facebook account. Users are then presented with profiles which meet their search criteria and are able to like or dislike them. Unlike traditional online dating sites, members can only communicate with those individuals who they share a common affinity (you liked them and they liked you).

Tinder is an interesting product that offers an interesting case study for statisticians and data scientists who want to understand how human beings interact on on mobile dating applications. Given that large scale data collection is near impossible without a large team of interns, I decided to collect data on the profiles that were presented to me over a one week period. My goal was to extract information on users and their profiles in order to determine if certain people were more likely to like my Tinder profile. After a couple days, I realized that receiving likes on Tinder was a difficult proposition, and was forced to adjust the data in order to have a robust occurrence rate. Using Naive Bayes, I attempted to glean any insights from the data I collected.

> head(dat)
  Hair_Color  Race Text Pictures Age Miles_Away Shared_Interest Overweight Liked_You
1      Black White    Y        5  23      Close               0          N         N
2     Blonde White    N        4  23      Close               1          N         N
3      Black Other    Y        4  28      Close               4          N         N
4     Blonde White    Y        5  23      Close               0          N         N
5     Blonde White    N        4  21      Close               1          N         N
6   Brunette White    Y        6  23      Close               0          N         N
  • The Naive Bayes classifier did a surprisingly good job (60 to 70% accuracy) in predicting whether a user liked me in both the training and test data.
  • Based on the logistic regression model, the most important predictors of whether someone liked me were the number of pictures on their profile, hair color, and the their physical distance from me. The predicted probabilities for someone liking me were higher for users who had less pictures, were further away, and were brunettes.

Part 1 of this series is simply focused on providing a high level overview of the problem and what I found. In part 2, I’ll offer a review of Naive Bayes classification and provide a worked out example.

train.ind <- sample(1:nrow(dat), ceiling(nrow(dat)*2/3), replace=FALSE)
nb.res <- NaiveBayes(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away, data=dat[train.ind,])
nb.pred <- predict(nb.res, dat[-train.ind,])
accuracy <- table(nb.pred$class, dat[-train.ind,"Liked_You"])
mod = glm(Liked_You ~ Hair_Color + Text + Pictures + Age + Miles_Away + Shared_Interest, 
            data=dat[train.ind,], family=binomial(link = "logit"))
plot(effect("Pictures", mod), rescale.axis=FALSE)
plot(effect("Miles_Away", mod), rescale.axis=FALSE)
plot(effect("Hair_Color", mod), rescale.axis=FALSE)
fit = fitted(mod)
accuracy = table(fit > .5, dat[train.ind, "Liked_You"])
sum(diag(accuracy)) / sum(accuracy)

To leave a comment for the author, please follow the link and comment on their blog: Mathew Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training





CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)