R, Twitter and McDonald’s

March 23, 2012

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Ed Chen is a data scientist at Twitter, so he's accustomed to working with big data and complex models. In an interview with MIT Technology Review, he describes his data science toolbox:

A common pattern for me is that I'll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.

He put this toolbox to great use in a recent blog post, Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process. After using simulation in Ruby and Python to generate some test data, he used the R language to create a novel classification model which can group "similar" members from a data set, without needing to specify the number of groups in advance. He used this model to categorize the McDonald's member into a number of "food groups" containing products with similar nutritional content.

One cluster contained all the desserts: Baked Hot Apple Pie, Snack Size McFlurry, and others including the three below: 

As you can see, the foods in the "dessert" group cluster together because of high trans fat content, low fiber, and other similar nutritional attributes. Other groups identified by the model include salads, burgers and other fried food, three categories of sauces, and (in a cluster all on its own) Fruit and Maple Oatmeal: the only high-fibre item on the menu.

For an astounding amount of detail about the analysis, visit Ed Chen's blog at the link below.

Edwin Chen's Blog: Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , ,

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)