Analyzing Unique Ingredients in World Cuisines

[This article was first published on R Tutorials – Omni Analytics Group, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cross-posted with permission from Omni Analytics Innovative Technologies Initiative (OAITI)

Certain ingredients are often staples of particular world cuisines. The use of hard cheeses in Italian cooking, and the use of masalas in Indian cooking are two particularly well-known examples. We sought out to discover what ingredients are most uniquely associated with other various cuisines.

Using the World Cuisine Recipes page on AllRecipes.com, we selected the featured recipes from all 17 cuisines available. This is a non-exhaustive list – The recipes are numerous, and the scrape was done on the recipes displaying on the page prior to scrolling down. In total, 470 recipes were scraped. Each recipe appears as a card like this:

The cards contain URLs to the full recipes. Utilizing Web Scraping techniques, we created a recipes dataset in the following format:

recipe_df %>%
    sample_n(5) %>%
    kable("html") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
Cuisine Name Ingredients
United States Minnesotas Favorite Cookie 1 cup butter, softened 1 ½ cups brown sugar 2 eggs 2 teaspoons vanilla extract 2 ½ cups all-purpose flour 1 teaspoon baking powder ¼ teaspoon salt 1 cup milk chocolate chips ½ cup semisweet chocolate chips 2/3 cup toffee baking bits 1 cup chopped pecans
Mediterranean Baked Falafel ¼ cup chopped onion 1 (15 ounce) can garbanzo beans, rinsed and drained ¼ cup chopped fresh parsley 3 cloves garlic, minced 1 teaspoon ground cumin ¼ teaspoon ground coriander ¼ teaspoon salt ¼ teaspoon baking soda 1 tablespoon all-purpose flour 1 egg, beaten 2 teaspoons olive oil
Australian and New Zealander Black Bean and Salsa Soup 2 (15 ounce) cans black beans, drained and rinsed 1 ½ cups vegetable broth 1 cup chunky salsa 1 teaspoon ground cumin 4 tablespoons sour cream 2 tablespoons thinly sliced green onion
Thai Goong Tod Kratiem Prik Thai Prawns Fried with Garlic and White Pepper 8 cloves garlic, chopped, or more to taste 2 tablespoons tapioca flour 2 tablespoons fish sauce 2 tablespoons light soy sauce 1 tablespoon white sugar ½ teaspoon ground white pepper ¼ cup vegetable oil, divided, or as needed 1 pound whole unpeeled prawns, divided
United States Kendras Maid Rite Sandwiches 2 pounds ground beef 1 chopped onion ¾ cup ketchup 2 tablespoons brown sugar 2 tablespoons distilled white vinegar 1 tablespoon Worcestershire sauce 2 teaspoons prepared yellow mustard ½ teaspoon salt 16 hamburger buns, warmed

The next step is to use the tidytext package to process the ingredients list for each cuisine, and use it to determine the most unique ingredients. We first create a new words dataset which filters out stop words, as well as words that are more associated with measurements or cooking parameters rather than actual recipe ingredients.

recipe_words <- recipe_df %>%
    mutate(Ingredients = gsub("[0-9]", "", Ingredients)) %>%
    unnest_tokens(word, Ingredients) %>%
    count(Cuisine, word, sort = TRUE) %>%
    ungroup() %>%
    filter(!(word %in% c("teaspoon", "cup", "ounce", "tablespoons", 
                         "chopped", "teaspoons", "tablespoon", "ground", "fresh", 
                         "can", "sauce", "cups", "plain", "piece", "temperature",
                         "jar", "round", "delicious", "degrees", "minced", "dried",
                         "grated"))) %>%
    anti_join(stop_words)

recipe_words %>%
    sample_n(5) %>%
    kable("html") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Cuisine word n
European lemonade 1
Australian and New Zealander unsalted 1
Korean roast 2
Middle Eastern half 2
Canadian squash 1

This data provides a count of the occurrences of a particular word in a particular cuisine. We can now easily get the top n words for each cuisine like so (In this blog, we’re displaying just Indian and Italian for readability):

recipe_words %>%
    group_by(Cuisine) %>%
    top_n(5) %>%
    arrange(Cuisine, desc(n)) %>%
    filter(Cuisine %in% c("Indian", "Italian")) %>%
    kable("html") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Cuisine word n
Indian salt 30
Indian pepper 26
Indian oil 24
Indian garlic 21
Indian onion 20
Italian cheese 30
Italian pepper 30
Italian salt 28
Italian garlic 21
Italian oil 19

In Indian recipes, salt and pepper are the most commonly occurring ingredient, while in Italian recipes, cheese rises to the top. However, salt, pepper, and cheese are likely common in many cuisines. The real question is what are the most unique ingredients? To determine that, we can use Term Frequency Inverse Document Frequency (TF-IDF) to create a measure of uniqueness. From there, we can plot the top TF-IDF values for each cuisine to visualize the results.

## Create a TF-IDF column
tf_words <- recipe_words %>%
    bind_tf_idf(word, Cuisine, n)

## Plot the top 8 words per cuisine by TF_IDF
tf_words %>%
    arrange(desc(tf_idf)) %>%
    mutate(word = tools::toTitleCase(word)) %>%
    mutate(word = factor(word, levels = rev(unique(word)))) %>% 
    group_by(Cuisine) %>% 
    top_n(8) %>% 
    slice(1:8) %>%
    ungroup %>%
    ggplot(aes(word, tf_idf, fill = Cuisine)) +
        geom_col(show.legend = FALSE) +
        labs(x = NULL, y = "tf-idf") +
        theme_minimal() +
        scale_fill_manual(values = colorRampPalette(ptol_pal()(12))(length(unique(tf_words$Cuisine))),
                      guide = guide_legend(nrow=2)) +
        facet_wrap(~Cuisine, ncol = 3, scales = "free") +
        coord_flip() +
        ylab("Term Frequency - Inverse Document Frequency")

Now, unique words rise to the top. We see Masala in Indian cooking, Sesame in Korean cooking, and Garbanzo in African cooking. The best part is these concepts can apply far beyond recipes – Any text analysis can use these ideas to determine unique words across some grouping variable. Look for more blogs on text analysis coming soon which will extend on these ideas.

The post Analyzing Unique Ingredients in World Cuisines appeared first on Omni Analytics Group.

To leave a comment for the author, please follow the link and comment on their blog: R Tutorials – Omni Analytics Group.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)