Cross-posted with permission from Omni Analytics Innovative Technologies Initiative (OAITI)
Certain ingredients are often staples of particular world cuisines. The use of hard cheeses in Italian cooking, and the use of masalas in Indian cooking are two particularly well-known examples. We sought out to discover what ingredients are most uniquely associated with other various cuisines.
Using the World Cuisine Recipes page on AllRecipes.com, we selected the featured recipes from all 17 cuisines available. This is a non-exhaustive list – The recipes are numerous, and the scrape was done on the recipes displaying on the page prior to scrolling down. In total, 470 recipes were scraped. Each recipe appears as a card like this:
The cards contain URLs to the full recipes. Utilizing Web Scraping techniques, we created a recipes dataset in the following format:
recipe_df %>% sample_n(5) %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"))
|United States||Minnesotas Favorite Cookie||1 cup butter, softened 1 ½ cups brown sugar 2 eggs 2 teaspoons vanilla extract 2 ½ cups all-purpose flour 1 teaspoon baking powder ¼ teaspoon salt 1 cup milk chocolate chips ½ cup semisweet chocolate chips 2/3 cup toffee baking bits 1 cup chopped pecans|
|Mediterranean||Baked Falafel||¼ cup chopped onion 1 (15 ounce) can garbanzo beans, rinsed and drained ¼ cup chopped fresh parsley 3 cloves garlic, minced 1 teaspoon ground cumin ¼ teaspoon ground coriander ¼ teaspoon salt ¼ teaspoon baking soda 1 tablespoon all-purpose flour 1 egg, beaten 2 teaspoons olive oil|
|Australian and New Zealander||Black Bean and Salsa Soup||2 (15 ounce) cans black beans, drained and rinsed 1 ½ cups vegetable broth 1 cup chunky salsa 1 teaspoon ground cumin 4 tablespoons sour cream 2 tablespoons thinly sliced green onion|
|Thai||Goong Tod Kratiem Prik Thai Prawns Fried with Garlic and White Pepper||8 cloves garlic, chopped, or more to taste 2 tablespoons tapioca flour 2 tablespoons fish sauce 2 tablespoons light soy sauce 1 tablespoon white sugar ½ teaspoon ground white pepper ¼ cup vegetable oil, divided, or as needed 1 pound whole unpeeled prawns, divided|
|United States||Kendras Maid Rite Sandwiches||2 pounds ground beef 1 chopped onion ¾ cup ketchup 2 tablespoons brown sugar 2 tablespoons distilled white vinegar 1 tablespoon Worcestershire sauce 2 teaspoons prepared yellow mustard ½ teaspoon salt 16 hamburger buns, warmed|
The next step is to use the tidytext package to process the ingredients list for each cuisine, and use it to determine the most unique ingredients. We first create a new words dataset which filters out stop words, as well as words that are more associated with measurements or cooking parameters rather than actual recipe ingredients.
recipe_words <- recipe_df %>% mutate(Ingredients = gsub("[0-9]", "", Ingredients)) %>% unnest_tokens(word, Ingredients) %>% count(Cuisine, word, sort = TRUE) %>% ungroup() %>% filter(!(word %in% c("teaspoon", "cup", "ounce", "tablespoons", "chopped", "teaspoons", "tablespoon", "ground", "fresh", "can", "sauce", "cups", "plain", "piece", "temperature", "jar", "round", "delicious", "degrees", "minced", "dried", "grated"))) %>% anti_join(stop_words) recipe_words %>% sample_n(5) %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
|Australian and New Zealander||unsalted||1|
This data provides a count of the occurrences of a particular word in a particular cuisine. We can now easily get the top n words for each cuisine like so (In this blog, we’re displaying just Indian and Italian for readability):
recipe_words %>% group_by(Cuisine) %>% top_n(5) %>% arrange(Cuisine, desc(n)) %>% filter(Cuisine %in% c("Indian", "Italian")) %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
In Indian recipes, salt and pepper are the most commonly occurring ingredient, while in Italian recipes, cheese rises to the top. However, salt, pepper, and cheese are likely common in many cuisines. The real question is what are the most unique ingredients? To determine that, we can use Term Frequency Inverse Document Frequency (TF-IDF) to create a measure of uniqueness. From there, we can plot the top TF-IDF values for each cuisine to visualize the results.
## Create a TF-IDF column tf_words <- recipe_words %>% bind_tf_idf(word, Cuisine, n) ## Plot the top 8 words per cuisine by TF_IDF tf_words %>% arrange(desc(tf_idf)) %>% mutate(word = tools::toTitleCase(word)) %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(Cuisine) %>% top_n(8) %>% slice(1:8) %>% ungroup %>% ggplot(aes(word, tf_idf, fill = Cuisine)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + theme_minimal() + scale_fill_manual(values = colorRampPalette(ptol_pal()(12))(length(unique(tf_words$Cuisine))), guide = guide_legend(nrow=2)) + facet_wrap(~Cuisine, ncol = 3, scales = "free") + coord_flip() + ylab("Term Frequency - Inverse Document Frequency")
Now, unique words rise to the top. We see Masala in Indian cooking, Sesame in Korean cooking, and Garbanzo in African cooking. The best part is these concepts can apply far beyond recipes – Any text analysis can use these ideas to determine unique words across some grouping variable. Look for more blogs on text analysis coming soon which will extend on these ideas.