Fun data: open data that is fun to analyse

[This article was first published on Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Joe Russell, Adnan Fiaz

Jeremy Singer-Vine sends out a newsletter every week where he highlights a number of interesting open datasets (you can explore all the datasets here). At Mango we are all for open data so we thought we would also share some of the open datasets we think are fun to explore.

Open Food Facts

Open Food Facts is a collaborative, free and open database of food products. It is a prime example of how effective crowdsourcing your data is. People from around the world have collected details about more than 300.000 food products and uploaded the information through mobile apps. The data is available as a MongoDB dump, CSV export and an experimental API. We have downloaded the CSV export and will try to visualise the ingredients across all products.

# http://world.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
foodFacts <- read_tsv("data/en.openfoodfacts.org.products.csv")
dim(foodFacts)
## [1] 295958    162
library(stringr)
ingredients <- foodFacts %>%
  # ideally, the unnest_tokens function is what we want but it was too slow
  #tidytext::unnest_tokens(ingredient, ingredients_text,
  #                        token = stringr::str_split, pattern=",|\\(|\\)|\\[|\\]") %>%
  # so instead we chose a different approach involving transmute and unlist
  # transmute will give us a list-column
  transmute(ingredients = str_split(ingredients_text, pattern=",|\\(|\\)|\\[|\\]")) %>%
  # unlist will convert the list-column into a character vector
  unlist() %>%
  # enframe will convert the vector into a data frame which is easier to groupby
  enframe() %>%
  # now we clean up some of the text
  mutate(value = str_replace(value, "org|organic", ""),
         value = str_replace(value, "-> en:\\w+", ""),
         value = str_replace_all(value, "'", ""),
         value = str_trim(value)) %>%
  # and finally filter some of the weirder entries
  filter(value!="", value!=".",
         !str_detect(value, "completed|\\d")) %>%
  # to then group by and count the ingredients
  count(value) %>%
  arrange(desc(n))
head(ingredients, 10)
## # A tibble: 10 x 2
##          value      n
##          <chr>  <int>
##  1        salt 122183
##  2       sugar  88463
##  3       water  80037
##  4         sel  38852
##  5       sucre  29971
##  6         eau  28502
##  7 citric acid  28475
##  8  riboflavin  21527
##  9        milk  21265
## 10      niacin  21201

There are no surprises at the top but further down there are a few ingredients that are odd. Let’s create a word cloud to show the relative frequencies.

library(wordcloud)
top100 <- head(ingredients, 100)
wordcloud::wordcloud(top100$value, top100$n)

Ingredients are only one aspect of this very interesting dataset. We could go on and look at the co-occurrence of certain ingredients (using network analysis) and then continue analysing their quantities. We could also include the data on nutritional value and calculate correlations. The data could also use some more cleaning considering there are some ingredients in different languages (e.g. water and eau).

Food prices

Following in the edible theme, VAM collate commodity prices from the globe’s poorer nations and use them in helping to identify food insecurity hotspots. The data we will be using can be downloaded here, from which we’ll attempt to visualise how prices have changed over the past 20 years.

# https://data.humdata.org/dataset/wfp-food-prices
#Providing column names and types to make workings easier later on
colnames <- c("Country_ID", "Country", "Region_ID", "Region", "Market_ID", "Market", 
              "Commodity_ID", "Commodity", "Currency_ID", "Currency", "Sector_ID", "Sector",
              "Units_ID", "Units", "Month", "Year", "Price", "Commodity_Source")
coltypes <- cols(Year = "n", Price = "n", Month = "n")
foodPrices <- read_csv("data/FoodPrices/WFPVAM_FoodPrices_13-03-2017.csv", 
                       col_names = colnames,
                       col_types = coltypes) %>%
  filter(row_number() != 1)

#Large number of commodities - won't be able to plot them all!
length(unique(foodPrices$Commodity))
## [1] 304
#Overall price trend - trend of all commodity prices over time
overallPriceTrend <- foodPrices %>%
  group_by(Commodity, Year) %>%
  #As different commodities will clearly have different prices, we'll make
  #them comparable by scaling based on their max price within our timeframe
  summarise(globalAveragePrice = mean(Price)/max(Price))

#Food prices trend over time, grouped by commodity - same as above, 
#this time selecting a smaller sample for plotting
commodoties <- c("Wheat", "Milk", "Coffee", "Bananas", "Sugar")
priceTrend <- foodPrices %>%
  select(Price, Commodity, Year) %>%
  #selecting our reduced commodities
  filter(Commodity %in% commodoties) %>%
  group_by(Commodity, Year) %>%
  summarise(globalAveragePrice = mean(Price)/max(Price))

priceTrend
## # A tibble: 77 x 3
## # Groups:   Commodity [?]
##    Commodity  Year globalAveragePrice
##        <chr> <dbl>              <dbl>
##  1   Bananas  2008          1.0000000
##  2   Bananas  2009          0.4487272
##  3   Bananas  2010          0.1727820
##  4   Bananas  2011          0.2909874
##  5   Bananas  2012          0.4470188
##  6   Bananas  2013          0.1271515
##  7   Bananas  2014          0.1263784
##  8   Bananas  2015          0.1129815
##  9   Bananas  2016          0.2050480
## 10   Bananas  2017          0.2630983
## # ... with 67 more rows

We can see what our data looks like for each commodity.

Now let’s create a graphic.

library(forcats)
#We'll create a ggplot graphic, using geom_smooth
#Specify some colours semi-related to their genuine appearence
colours <- c("Bananas" = "#fea814", "Coffee" = "#383838", "Sugar" = "#4fcfff", 
             "Milk" = "#cccccc", "Wheat" = "#005692")
#Specify commodity levels for use in our legend
fctLevels <- c("Coffee", "Bananas", "Milk", "Sugar", "Wheat")
ggplot(priceTrend, aes(x = Year, y = globalAveragePrice)) +
  geom_smooth(aes(colour = fct_relevel(Commodity, fctLevels)), se = FALSE, size = 1.8, linetype = 5) +
  geom_smooth(data = overallPriceTrend, colour = "red", se = FALSE, size = 3.5) +
  geom_line(aes(size = "All Commodities", linetype = NA), colour = "red") +
  scale_colour_manual(values = colours) +
  scale_x_continuous(breaks = seq(1992, 2017, 2)) +
  scale_y_continuous(limits = c(0, 1), labels = c("Min", "Max"), breaks = c(0, 1)) +
  labs(title = "Average Global Commodity Prices over Time",
       subtitle = "Commodity Price Relative to Max in Period",
       caption = "Data from https://data.humdata.org/dataset/wfp-food-prices",
       x = "", 
       y = "",
       colour = "Commodity",
       size = "") +
  theme_classic()

The trend for our selected commodities seems to show a gradual decrease in prices, as all but coffee and milk prices are now lower than at the beginning of our timeframe. This is somewhat reflected in our overall price trend, as we can see there has been a slight downward trend, although this has been somewhat negated in the past three years.

In our analysis we took only a mere peek into the dataset. For example, we could look at seasonality trends, subset by country or region or even by market. Indeed it could be taken a step further, as VAM have, and be used as a tool for predicting when and where food security will occur in the future.

North Korea Missile Tests

The next dataset is from the James Martin Center for Nonproliferation Studies (CNS) North Korea Missile Test Database. We agree it’s not a fun topic but we also wanted to show the breadth of open data that is out there. You can get the data as an Excel file but it is also hosted on data.world. And fortunately for the R community data.world have a R package to access their API.

library(data.world)
# We've set the configuration in a previous session
path <- "ian/the-cns-north-korea-missile-test-database"
missiles <- query(qry_sql("SELECT f1, facility_latitude, facility_longitude, 
                          distance_travelled_km, facility_name 
                          FROM missile_tests 
                          WHERE success='Success'"), 
                  path)
# additional filtering outside of query
missiles <- missiles %>% 
  filter(distance_travelled_km!="Unknown", distance_travelled_km!="N/A") %>% 
  drop_na() %>% 
  mutate(distance_travelled_km = as.integer(distance_travelled_km),
         facility_name = substr(facility_name, 0, 20))
head(missiles)
## # A tibble: 6 x 5
##      f1 facility_latitude facility_longitude distance_travelled_km
##   <int>             <dbl>              <dbl>                 <int>
## 1   101          39.65960           124.7057                  1000
## 2   102          39.65960           124.7057                  1000
## 3    12          40.85000           129.6667                   500
## 4    40          38.99083           127.6236                   200
## 5    41          38.99083           127.6236                   200
## 6    42          38.99083           127.6236                   200
## # ... with 1 more variables: facility_name <chr>

The data contains information on the launch location but not on the precise location of where the missile landed. However we can use the distance travelled to approximate this by calculating a radius.

# slightly adapted from https://stackoverflow.com/questions/34183049/plot-circle-with-a-certain-radius-around-point-on-a-map-in-ggplot2#34187454

# drop duplicate locations and distances
dups <- duplicated(missiles %>% select(facility_name, distance_travelled_km))
missiles <- missiles %>% filter(!dups)

# define the circle we want for each missile
circles <- data_frame(f1 = rep(missiles$f1, each = 100),
                      angle = rep(seq(0, 2*pi, length.out = 100), nrow(missiles)))

meanLatitude <- mean(missiles$facility_latitude)

missile_circles <- missiles %>% 
  # length per longitude changes with latitude, so need correction
  mutate(radiusLon = distance_travelled_km/111/cos(meanLatitude/57.3),
         radiusLat = distance_travelled_km/111) %>% 
  left_join(circles) %>% 
  mutate(longitude = facility_longitude + radiusLon * cos(angle),
         latitude = facility_latitude + radiusLat * sin(angle))

So now we have our circles we can plot them on a map.

library(ggmap)
nk = get_map(location = c(lon = mean(missile_circles$facility_longitude), 
                          lat = mean(missile_circles$facility_latitude)), 
             zoom = 5, maptype = "terrain")
ggmap(nk, extent = "panel") + 
  geom_point(aes(x = facility_longitude, y = facility_latitude, colour=facility_name), 
             data = missiles) +
  ########### add circles
  geom_polygon(data = missile_circles, aes(longitude, latitude, group = f1, 
                                           colour=facility_name), alpha = 0)

These are obviously not missile ranges and none of the missiles will have gone over China. The CNS have also created visualisations with this data and from that we can see that the time dimension is important. That is something we could add to our visualisation or we could perform more analyses on the success/fail dimension.

Flight Delays

The Bureau of Transportation Statistics are a leading source of U.S. transportation systems data, helping to shape transportation policy and research projects across the US. We pulled through their data, selecting the Month and DayOfWeek fields, as well as Departure and Arrival delays. The site only allows us to download data one month at a time, so we need to begin by reading in 12 files to a list and binding them together.

# https://www.transtats.bts.gov/DL_SelectFields.asp
# Month files were downloaded to separate csvs, so we'll read them in and combine them
files <- list.files(path = "data/FlightData/", pattern = "*.csv", full.names = TRUE)
flightsRaw <- map_df(files, read_csv)

Now lets recode our Months and Weekdays to be character names, rather than just integers. Then we can prepare our data for plotting. In this plot, we’ll be aiming to plot the difference between departure and arrival delays, grouped by Weekday and Month.

#Begin by renaming integer values to days of the week and months
flightData <- flightsRaw %>%
  mutate(DAY_OF_WEEK = as.factor(DAY_OF_WEEK),
         DAY_OF_WEEK = fct_recode(DAY_OF_WEEK, Monday = "1", Tuesday = "2", Wenesday = "3",
                              Thursday = "4", Friday = "5", Saturday = "6", Sunday = "7"),
         MONTH = as.factor(MONTH),
         MONTH = fct_recode(MONTH, January = "1", February = "2", March = "3", April = "4",
                            May = "5", June = "6", July = "7", August = "8", September = "9",
                            October = "10", November = "11", December = "12"))

#Summarise the data to show the mean delay/arrival time split by Month and Weekday
monthDayDelays <- flightData %>%
  select(DEP_DELAY, ARR_DELAY, MONTH, DAY_OF_WEEK) %>%
  group_by(MONTH, DAY_OF_WEEK) %>%
  #Calculate means split by Month and Weekday
  summarise(Departures = mean(DEP_DELAY, na.rm = TRUE), Arrivals = mean(ARR_DELAY, na.rm = TRUE)) %>%
  filter(!is.na(MONTH)) %>%
  ungroup() %>%
  #Gather our data so Depature and Arrival delay times are stored within one column
  gather(key = delayType, value = Delay, Arrivals, Departures) %>%
  #Reorder Departure/Arrival factors so Depatures appear first in our graphic
  mutate(delayType = fct_rev(delayType))
monthDayDelays
## # A tibble: 168 x 4
##       MONTH DAY_OF_WEEK delayType       Delay
##      <fctr>      <fctr>    <fctr>       <dbl>
##  1  January      Monday  Arrivals  3.08526171
##  2  January     Tuesday  Arrivals  0.49934650
##  3  January    Wenesday  Arrivals -0.64372809
##  4  January    Thursday  Arrivals -0.61050874
##  5  January      Friday  Arrivals  3.37778214
##  6  January    Saturday  Arrivals  0.05099645
##  7  January      Sunday  Arrivals  4.17985079
##  8 February      Monday  Arrivals  1.30443327
##  9 February     Tuesday  Arrivals  5.09620906
## 10 February    Wenesday  Arrivals  3.00267779
## # ... with 158 more rows

We now have our Delay times in one column, with delayType either arrival or departure. We also have our Month and Weekday columns for grouping our data in the plot.

Now let’s create a graphic. Note, the colours used for our graphic were obtained from Color Brewer 2.0.

#Specify colours vector for use in scale_fill_manual.
colours <- c("#ffffb2", "#fed976", "#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#b10026")
#Graphic
ggplot(monthDayDelays, aes(x = MONTH, y = Delay)) +
  #Adding reference lines to our (soon to become) circular chart
  geom_hline(yintercept = seq(0, 20, by = 5), alpha = 0.1) +
  #Bars for every day of the week grouped by month
  geom_bar(stat = "identity", aes(fill = DAY_OF_WEEK), position = "dodge", colour = "black") +
  #This bit is hacky. To get a single label to appear on our reference lines at zero degrees, we select a
  #single datapoint from January (Month at zero degrees) and use it as our data argument.
  #monthDayDelays[1, ] is a datapoint for mean Arrival time in January 
  geom_text(data = monthDayDelays[1, ], y = 5, label = "5", size = 3) +
  geom_text(data = monthDayDelays[1, ], y = 10, label = "10", size = 3) +
  geom_text(data = monthDayDelays[1, ], y = 15, label = "15", size = 3) +
  geom_text(data = monthDayDelays[1, ], y = 20, label = "20", size = 3) +
  #monthDayDelays[85, ] is a datapoint for mean Departure time in January 
  geom_text(data = monthDayDelays[85, ], y = 10, label = "10", size = 3) +
  geom_text(data = monthDayDelays[85, ], y = 15, label = "15", size = 3) +
  geom_text(data = monthDayDelays[85, ], y = 20, label = "20", size = 3) +
  #Specify colours for chart
  scale_fill_manual(values = rev(colours)) + 
  #Remove yAxis scale from side of plot
  scale_y_continuous(breaks = NULL) +
  #Make plot circular, starting at due north
  coord_polar(theta = "x", start = -pi/12) + 
  #Separate our departure and arrival plots
  facet_wrap(~delayType, labeller = label_parsed) +
  #Add labels
  labs(title = "Average Minute Delay of American Flights - 2016",
       x = "",
       y = "", 
       fill = "",
       caption = "Data from https://www.transtats.bts.gov/DL_SelectFields.asp") +
  theme_minimal(base_size = 8) +
  #Make our facet titles pretty
  theme(strip.text.x = element_text(size = 15, colour = "#f03b20"))

Interestingly, we can see that even if your flight is delayed, you can expect that delay to have reduced upon arrival. In fact, on average 5 minutes and 25 seconds will be recovered during air time.

Perhaps not so interestingly, we can see that the months with the longest delays fall in the summer and winter holidays. Conversely, if you’re not bound by school holidays, November looks to be an excellent time to travel, as you can expect to arrive a whole 2 minutes early!

Again with this analysis, there are many more pathways we could explore. For example, FiveThirtyEight have produced a similar delay chart, only this time grouped by airport. This, when combined with our brief analysis, makes your dream, undelayed, vacation a possible reality!!

This blogpost turned out longer than we expected but that’s what happens when you have fun. If you have a cool dataset that’s fun to analyse let us know on twitter. You can find the code for this blogpost on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)