A Fun Gastronomical Dataset: What’s on the Menu?

Posted on September 20, 2016 by Rasmus Bååth in R bloggers | 0 Comments

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just found a fun food themed dataset that I’d never heard about and that I thought I’d share. It’s from a project called What’s on the menu where the New York Public Library has crowdsourced a digitization of their collection of historical restaurant menus. The collection stretches all the way back to the 19th century and well into the 1990’s, and on the home page it is stated that there are “1,332,271 dishes transcribed from 17,545 menus”. Here is one of those menus, from a turn of the (old) century Chinese-American restaurant:

The data is freely available in csv format (yay!) and here I ‘ll just show how to the get the data into R and I’ll use it to plot the popularity of some foods over time.

First we’re going to download the data, “unzip” csv files into a temporary directory, and read them into R.

library(tidyverse)
library(stringr)
library(curl)

# This url changes every month, check what's the latest at http://menus.nypl.org/data
menu_data_url <- "https://s3.amazonaws.com/menusdata.nypl.org/gzips/2016_09_16_07_00_30_data.tgz"
temp_dir <- tempdir()
curl_download(menu_data_url, file.path(temp_dir, "menu_data.tgz"))
untar(file.path(temp_dir, "menu_data.tgz"), exdir = temp_dir)
dish <- read_csv(file.path(temp_dir, "Dish.csv"))
menu <- read_csv(file.path(temp_dir, "Menu.csv"))
menu_item <- read_csv(file.path(temp_dir, "MenuItem.csv"))
menu_page <- read_csv(file.path(temp_dir, "MenuPage.csv"))

The resulting tables together describe the contents of the menus, but in order to know which dish was on which menu we need to join together the four tables. While doing this we’re also going to remove some uninteresting columns and remove some records that were not coded correctly.

d <- menu_item %>% select( id, menu_page_id, dish_id, price) %>%
  left_join(dish %>% select(id, name) %>% rename(dish_name = name),
            by = c("dish_id" = "id")) %>%
  left_join(menu_page %>% select(id, menu_id),
            by = c("menu_page_id" = "id")) %>%
  left_join(menu %>% select(id, date, place, location),
            by = c("menu_id" = "id")) %>%
  mutate(year = lubridate::year(date)) %>%
  filter(!is.na(year)) %>%
  filter(year > 1800 & year <= 2016) %>%
  select(year, location, menu_id, dish_name, price, place)

d$decennium = floor(d$year / 10) * 10 foods <- c("coffee", "tea", "pancake", "ice cream", "french frie", "french peas", "apple", "banana", "strawberry") # Above I dropped the "d" in French fries in order # to also match "French fried potatoes." food_over_time <- map_df(foods, function(food) { d %>% filter(year >= 1900 & year <= 1980) %>% group_by(decennium, menu_id) %>% summarise(contains_food = any(str_detect(dish_name, regex(food, ignore_case = TRUE)), na.rm = TRUE)) %>% summarise(prop_food = mean(contains_food, na.rm = TRUE)) %>% mutate(food = food) })

# A reusable list of ggplot2 directives to produce a lineplot food_time_plot <- list( geom_line(), geom_point(), scale_y_continuous("% of menus include",labels = scales::percent, limits = c(0, NA)), scale_x_continuous(""), facet_wrap(~ food), theme_minimal(), theme(legend.position = "none")) food_over_time %>% filter(food %in% c("coffee", "tea")) %>% ggplot(aes(decennium, prop_food, color = food)) + food_time_plot