A Fun Gastronomical Dataset: What’s on the Menu?

[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


I just found a fun food themed dataset that I’d never heard about and that I thought I’d share. It’s from a project called What’s on the menu where the New York Public Library has crowdsourced a digitization of their collection of historical restaurant menus. The collection stretches all the way back to the 19th century and well into the 1990’s, and on the home page it is stated that there are “1,332,271 dishes transcribed from 17,545 menus”. Here is one of those menus, from a turn of the (old) century Chinese-American restaurant:

The data is freely available in csv format (yay!) and here I ‘ll just show how to the get the data into R and I’ll use it to plot the popularity of some foods over time.

First we’re going to download the data, “unzip” csv files into a temporary directory, and read them into R.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">library(tidyverse)
library(stringr)
library(curl)

<span style="color: #408080; font-style: italic"># This url changes every month, check what's the latest at http://menus.nypl.org/data</span>
menu_data_url <span style="color: #666666"><-</span> <span style="color: #BA2121">"https://s3.amazonaws.com/menusdata.nypl.org/gzips/2016_09_16_07_00_30_data.tgz"</span>
temp_dir <span style="color: #666666"><-</span> tempdir()
curl_download(menu_data_url, file.path(temp_dir, <span style="color: #BA2121">"menu_data.tgz"</span>))
untar(file.path(temp_dir, <span style="color: #BA2121">"menu_data.tgz"</span>), exdir <span style="color: #666666">=</span> temp_dir)
dish <span style="color: #666666"><-</span> read_csv(file.path(temp_dir, <span style="color: #BA2121">"Dish.csv"</span>))
menu <span style="color: #666666"><-</span> read_csv(file.path(temp_dir, <span style="color: #BA2121">"Menu.csv"</span>))
menu_item <span style="color: #666666"><-</span> read_csv(file.path(temp_dir, <span style="color: #BA2121">"MenuItem.csv"</span>))
menu_page <span style="color: #666666"><-</span> read_csv(file.path(temp_dir, <span style="color: #BA2121">"MenuPage.csv"</span>))
</pre></div>

The resulting tables together describe the contents of the menus, but in order to know which dish was on which menu we need to join together the four tables. While doing this we’re also going to remove some uninteresting columns and remove some records that were not coded correctly.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d <span style="color: #666666"><-</span> menu_item <span style="color: #666666">%>%</span> select( id, menu_page_id, dish_id, price) <span style="color: #666666">%>%</span>
  left_join(dish <span style="color: #666666">%>%</span> select(id, name) <span style="color: #666666">%>%</span> rename(dish_name <span style="color: #666666">=</span> name),
            by <span style="color: #666666">=</span> c(<span style="color: #BA2121">"dish_id"</span> <span style="color: #666666">=</span> <span style="color: #BA2121">"id"</span>)) <span style="color: #666666">%>%</span>
  left_join(menu_page <span style="color: #666666">%>%</span> select(id, menu_id),
            by <span style="color: #666666">=</span> c(<span style="color: #BA2121">"menu_page_id"</span> <span style="color: #666666">=</span> <span style="color: #BA2121">"id"</span>)) <span style="color: #666666">%>%</span>
  left_join(menu <span style="color: #666666">%>%</span> select(id, date, place, location),
            by <span style="color: #666666">=</span> c(<span style="color: #BA2121">"menu_id"</span> <span style="color: #666666">=</span> <span style="color: #BA2121">"id"</span>)) <span style="color: #666666">%>%</span>
  mutate(year <span style="color: #666666">=</span> lubridate<span style="color: #666666">::</span>year(date)) <span style="color: #666666">%>%</span>
  filter(<span style="color: #666666">!</span>is.na(year)) <span style="color: #666666">%>%</span>
  filter(year <span style="color: #666666">></span> <span style="color: #666666">1800</span> <span style="color: #666666">&</span> year <span style="color: #666666"><=</span> <span style="color: #666666">2016</span>) <span style="color: #666666">%>%</span>
  select(year, location, menu_id, dish_name, price, place)
</pre></div>

What we are left with in the d data frame is a table of what dishes were served, where they were served and when. Here is a sampler:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d[sample(<span style="color: #666666">1:</span>nrow(d), <span style="color: #666666">10</span>), ]
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># A tibble: 10 × 6</span>
    year                      location menu_id                         dish_name price
   <span style="color: #666666"><</span>dbl<span style="color: #666666">></span>                         <span style="color: #666666"><</span>chr<span style="color: #666666">></span>   <span style="color: #666666"><</span>int<span style="color: #666666">></span>                             <span style="color: #666666"><</span>chr<span style="color: #666666">></span> <span style="color: #666666"><</span>dbl<span style="color: #666666">></span>
<span style="color: #666666">1</span>   <span style="color: #666666">1900</span>            Fifth Avenue Hotel   <span style="color: #666666">25394</span>            Broiled Mutton Kidneys    <span style="color: #008000; font-weight: bold">NA</span>
<span style="color: #666666">2</span>   <span style="color: #666666">1971</span>                 Tadlich Grill   <span style="color: #666666">26670</span>                       Mixed Green  <span style="color: #666666">0.85</span>
<span style="color: #666666">3</span>   <span style="color: #666666">1939</span>                Maison Prunier   <span style="color: #666666">30325</span>                  Entrecote Minute    <span style="color: #008000; font-weight: bold">NA</span>
<span style="color: #666666">4</span>   <span style="color: #666666">1914</span>          The Beekman Café Co.   <span style="color: #666666">33898</span>                  Camembert cheese  <span style="color: #666666">0.10</span>
<span style="color: #666666">5</span>   <span style="color: #666666">1900</span>         Carlton Hotel Company   <span style="color: #666666">21865</span>                        Pork Chops  <span style="color: #666666">0.15</span>
<span style="color: #666666">6</span>   <span style="color: #666666">1914</span> Gutmann<span style="color: #BA2121">'</span><span style="border: 1px solid #FF0000">s Café and Restaurant   33982 Cold Boiled Ham with Potato Salad  0.40</span>
<span style="color: #666666">7</span>   <span style="color: #666666">1912</span>               Waldorf<span style="color: #666666">-</span>Astoria   <span style="color: #666666">34512</span>            Stuffed Figs and Dates  <span style="color: #666666">0.30</span>
<span style="color: #666666">8</span>   <span style="color: #666666">1933</span>                   Hotel Astor   <span style="color: #666666">31262</span>              Assorted Small Cakes  <span style="color: #666666">0.25</span>
<span style="color: #666666">9</span>   <span style="color: #666666">1933</span>              Ambassador Grill   <span style="color: #666666">31291</span>                    Stuffed celery  <span style="color: #666666">0.55</span>
<span style="color: #666666">10</span>  <span style="color: #666666">1901</span>            Del Coronado Hotel   <span style="color: #666666">14512</span>                           peaches    <span style="color: #008000; font-weight: bold">NA</span>
<span style="color: #408080; font-style: italic"># ... with 1 more variables: place <chr></span>
</pre></div>

Personally I’d go for the Stuffed Figs and Dates at the Waldorf-Astoria followed by some Assorted Small Cakes 21 years later at the Astor. If you want to download this slightly processed version of the dataset it’s available here in csv format. We can also see which are the most common menu items in the dataset:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d <span style="color: #666666">%>%</span> count(tolower(dish_name)) <span style="color: #666666">%>%</span> arrange(desc(n)) <span style="color: #666666">%>%</span> head(<span style="color: #666666">10</span>)
</pre></div>

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># A tibble: 10 × 2</span>
   <span style="color: #BA2121">`tolower(dish_name)`</span>     n
                  <span style="color: #666666"><</span>chr<span style="color: #666666">></span> <span style="color: #666666"><</span>int<span style="color: #666666">></span>
<span style="color: #666666">1</span>                coffee  <span style="color: #666666">8532</span>
<span style="color: #666666">2</span>                celery  <span style="color: #666666">4865</span>
<span style="color: #666666">3</span>                olives  <span style="color: #666666">4737</span>
<span style="color: #666666">4</span>                   tea  <span style="color: #666666">4682</span>
<span style="color: #666666">5</span>              radishes  <span style="color: #666666">3426</span>
<span style="color: #666666">6</span>       mashed potatoes  <span style="color: #666666">2999</span>
<span style="color: #666666">7</span>       boiled potatoes  <span style="color: #666666">2502</span>
<span style="color: #666666">8</span>     vanilla ice cream  <span style="color: #666666">2379</span>
<span style="color: #666666">9</span>         chicken salad  <span style="color: #666666">2306</span>
<span style="color: #666666">10</span>                 milk  <span style="color: #666666">2218</span>
</pre></div>

That coffee is king isn’t that surprising, but the popularity of celery seems weird. My current hypothesis is that “celery” often refers to some kind of celery salad, or maybe it was common as a snack in the New York area in the 1900s. It should be remembered that the dataset does not represent what people ate in general, but is based on what menus were collected by the New York public library (presumably from the New York area). Also the bulk of the menus are from between 1900 and 1980:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">ggplot(d, aes(year)) <span style="color: #666666">+</span>
  geom_histogram(binwidth <span style="color: #666666">=</span> <span style="color: #666666">5</span>, center <span style="color: #666666">=</span> <span style="color: #666666">1902.5</span>, color <span style="color: #666666">=</span> <span style="color: #BA2121">"black"</span>, fill <span style="color: #666666">=</span> <span style="color: #BA2121">"lightblue"</span>) <span style="color: #666666">+</span>
  scale_y_continuous(<span style="color: #BA2121">"N.o. menu items"</span>)
</pre></div>

Even though it’s not completely clear what the dataset represents we could still have a look at some food trends over time. Below I’m going to go through a couple of common foodstuffs and, for each decennium, calculate what proportion of menus includes that foodstuff.

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">d<span style="color: #666666">$</span>decennium <span style="color: #666666">=</span> floor(d<span style="color: #666666">$</span>year <span style="color: #666666">/</span> <span style="color: #666666">10</span>) <span style="color: #666666">*</span> <span style="color: #666666">10</span>
foods <span style="color: #666666"><-</span> c(<span style="color: #BA2121">"coffee"</span>, <span style="color: #BA2121">"tea"</span>, <span style="color: #BA2121">"pancake"</span>, <span style="color: #BA2121">"ice cream"</span>, <span style="color: #BA2121">"french frie"</span>,
           <span style="color: #BA2121">"french peas"</span>, <span style="color: #BA2121">"apple"</span>, <span style="color: #BA2121">"banana"</span>, <span style="color: #BA2121">"strawberry"</span>)
<span style="color: #408080; font-style: italic"># Above I dropped the "d" in French fries in order </span>
<span style="color: #408080; font-style: italic"># to also match "French fried potatoes."</span>
food_over_time <span style="color: #666666"><-</span> map_df(foods, <span style="color: #008000; font-weight: bold">function</span>(food) {
  d <span style="color: #666666">%>%</span>
    filter(year <span style="color: #666666">>=</span> <span style="color: #666666">1900</span> <span style="color: #666666">&</span> year <span style="color: #666666"><=</span> <span style="color: #666666">1980</span>) <span style="color: #666666">%>%</span>
    group_by(decennium, menu_id) <span style="color: #666666">%>%</span>
    summarise(contains_food <span style="color: #666666">=</span>
      any(str_detect(dish_name, regex(food, ignore_case <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)),
          na.rm <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)) <span style="color: #666666">%>%</span>
    summarise(prop_food <span style="color: #666666">=</span> mean(contains_food, na.rm <span style="color: #666666">=</span> <span style="color: #008000; font-weight: bold">TRUE</span>)) <span style="color: #666666">%>%</span>
    mutate(food <span style="color: #666666">=</span> food)
})
</pre></div>

First up, Coffee vs. Tea:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #408080; font-style: italic"># A reusable list of ggplot2 directives to produce a lineplot</span>
food_time_plot <span style="color: #666666"><-</span> list(
  geom_line(),
  geom_point(),
  scale_y_continuous(<span style="color: #BA2121">"% of menus include"</span>,labels <span style="color: #666666">=</span> scales<span style="color: #666666">::</span>percent,
                     limits <span style="color: #666666">=</span> c(<span style="color: #666666">0</span>, <span style="color: #008000; font-weight: bold">NA</span>)),
  scale_x_continuous(<span style="color: #BA2121">""</span>),
  facet_wrap(<span style="color: #666666">~</span> food),
  theme_minimal(),
  theme(legend.position <span style="color: #666666">=</span> <span style="color: #BA2121">"none"</span>))

food_over_time <span style="color: #666666">%>%</span> filter(food <span style="color: #666666">%in%</span> c(<span style="color: #BA2121">"coffee"</span>, <span style="color: #BA2121">"tea"</span>)) <span style="color: #666666">%>%</span>
  ggplot(aes(decennium, prop_food, color <span style="color: #666666">=</span> food)) <span style="color: #666666">+</span> food_time_plot
</pre></div>

Both pretty popular menu items, but I’m not sure what to make of the trends… Next up Ice cream vs. Pancakes:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">food_over_time <span style="color: #666666">%>%</span> filter(food <span style="color: #666666">%in%</span> c(<span style="color: #BA2121">"pancake"</span>, <span style="color: #BA2121">"ice cream"</span>)) <span style="color: #666666">%>%</span>
  ggplot(aes(decennium, prop_food, color <span style="color: #666666">=</span> food)) <span style="color: #666666">+</span> food_time_plot
</pre></div>

Ice cream wins, but again I’m not sure what to make of how ice cream varies over time. Maybe it’s just an artifact of how the data was collected or maybe it actually reflects the icegeist somehow. What about French fries vs. French peas:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">food_over_time <span style="color: #666666">%>%</span> filter(food <span style="color: #666666">%in%</span> c(<span style="color: #BA2121">"french frie"</span>, <span style="color: #BA2121">"french peas"</span>)) <span style="color: #666666">%>%</span>
  ggplot(aes(decennium, prop_food, color <span style="color: #666666">=</span> food)) <span style="color: #666666">+</span> food_time_plot
</pre></div>

Seems like the heyday of French peas are over, but French fries also seemed to peak in the 40s… Finally let’s look at some fruit:

<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">food_over_time <span style="color: #666666">%>%</span> filter(food <span style="color: #666666">%in%</span> c(<span style="color: #BA2121">"apple"</span>, <span style="color: #BA2121">"banana"</span>, <span style="color: #BA2121">"strawberry"</span>)) <span style="color: #666666">%>%</span>
  ggplot(aes(decennium, prop_food, color <span style="color: #666666">=</span> food)) <span style="color: #666666">+</span> food_time_plot
</pre></div>

Banana has really dropped in menu popularity since the early 1900s…

Anyway, this is a really cool dataset and I barely scratched the surface of what could be done with it. If you decide to explore this dataset further, and you make some plots and/or analyses, do send me a link and I will link to it here.

To finish off let’s look at this elegant cocktail menu from 1937 which, among cocktails and fizzes, advertises tiny cocktail tamales:

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)