Ed Chen is a data scientist at Twitter, so he's accustomed to working with big data and complex models. In an interview with MIT Technology Review, he describes his data science toolbox:
A common pattern for me is that I'll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end.
He put this toolbox to great use in a recent blog post, Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process. After using simulation in Ruby and Python to generate some test data, he used the R language to create a novel classification model which can group “similar” members from a data set, without needing to specify the number of groups in advance. He used this model to categorize the McDonald's member into a number of “food groups” containing products with similar nutritional content.
One cluster contained all the desserts: Baked Hot Apple Pie, Snack Size McFlurry, and others including the three below:
As you can see, the foods in the “dessert” group cluster together because of high trans fat content, low fiber, and other similar nutritional attributes. Other groups identified by the model include salads, burgers and other fried food, three categories of sauces, and (in a cluster all on its own) Fruit and Maple Oatmeal: the only high-fibre item on the menu.
For an astounding amount of detail about the analysis, visit Ed Chen's blog at the link below.
Edwin Chen's Blog: Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process