A Delicious Analysis! (aka topic modelling using recipes)

February 17, 2014
By

(This article was first published on Data and Analysis with R, for Fun (and Maybe Work!), and kindly contributed to R-bloggers)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one :)  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!


To leave a comment for the author, please follow the link and comment on his blog: Data and Analysis with R, for Fun (and Maybe Work!).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.