{tidycovid19}: New data and documentation

Posted on May 9, 2020 by An Accounting and Data Science Nerd's Corner in R bloggers | 0 Comments

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent update to the {tidycovid19} package brings data on testing, alternative case data, some regional data and proper data documentation. Using all this, you can use the package to explore the associations of (the lifting of) governmental measures, citizen behavior and the Covid-19 spread.

Installation

The Package is hosted on Github. As the underlying data sources change their format and access methods often, I have no plans to publish the package on CRAN for the time being. So, to install the package, you need to have the {remotes} package installed. Then, you can install and attach {tidycovid19} along with some other packages that we will need below by:

remotes::install_github("joachim-gassen/tidycovid19")
library(tidycovid19)
library(tidyverse)
library(ggrepel)
library(gghighlight)
library(zoo)

Included Data Sources

By now, the packages contains the code to download data from nine authoritative data sources:

Covid-19 data from the Johns Hopkins University CSSE Github Repo. This data has developed to a standard resource for researchers and the general audience interested in assessing the global spreading of the virus. The data is provided at country and sub-country levels.
Covid-19 case data provided by the European Centre for Disease Prevention and Control. The data is updated daily and contains the latest available public data on the number of new Covid-19 cases reported per day and per country.
Testing data collected by the ‘Our World in Data’ team. This team systematically collects data on Covid-19 testing from multiple national sources.
Government measures dataset provided by the Assessment Capacities Project (ACAPS). These data allow researchers to study the effect of non-pharmaceutical interventions on the development of the virus.
Oxford Covid-19 Government Response Tracker, an alternative data source for governmental interventions.
Mobility Trends Reports provided by Apple related to Covid-19. The data is provided at country and sub-country levels.
Google COVID-19 Community Mobility Reports data. This data is available at the country, regional and U.S. county level.
Google Trends data on the search volume for the term “coronavirus” This data can be used to assess the public attention to Covid-19 across countries and over time within a given country. The data is available at the country, regional and city level but availability varies across countries.
Country level information provided by the World Bank. These data allow researchers to calculate per capita measures of the virus spread and to assess the association of macro-economic variables with the development of the virus.

Each data can be downloaded separately, using its specific download function (download_..._data()). By default, the functions will download the data from the authoritative source and provide some diagnostic messages and a short data description. You can shut up the functions by adding the silent = TRUE parameter. If you use the cached = TRUE parameter, the data will be downloaded from the Github repository of the package, speeding things up considerably. The data in the Github repository is updated daily.

download_merged_data() provides a country-day data frame pulling together data from various sources. The data frame tidycovid19_variable_definitions contains variable definitions for the merged data frame.

merged <- download_merged_data(silent = TRUE, cached = TRUE)
tidycovid19_variable_definitions %>%
  select(var_name, var_def) %>%
  kable() %>% 
  kable_styling()

var_name	var_def
iso3c	Country name
country	ISO3c country code as defined by ISO 3166-1 alpha-3
date	Calendar date
confirmed	Confirmed Covid-19 cases as reported by JHU CSSE (accumulated)
deaths	Covid-19-related deaths as reported by JHU CSSE (accumulated)
recovered	Covid-19 recoveries as reported by JHU CSSE (accumulated)
ecdc_cases	Covid-19 cases as reported by ECDC (accumulated)
ecdc_deaths	Covid-19-related deaths as reported by ECDC (accumulated)
total_tests	Accumulated test counts as reported by Our World in Data
tests_units	Definition of what constitutes a ‘test’
soc_dist	Number of social distancing measures reported up to date by ACAPS, net of lifted restrictions
mov_rest	Number of movement restrictions reported up to date by ACAPS, net of lifted restrictions
pub_health	Number of public health measures reported up to date by ACAPS, net of lifted restrictions
gov_soc_econ	Number of social and economic measures reported up to date by ACAPS, net of lifted restrictions
lockdown	Number of lockdown measures reported up to date by ACAPS, net of lifted restrictions
apple_mtr_driving	Apple Maps usage for driving directions, as percentage*100 relative to the baseline of Jan 13, 2020
apple_mtr_walking	Apple Maps usage for walking directions, as percentage*100 relative to the baseline of Jan 13, 2020
apple_mtr_transit	Apple Maps usage for public transit directions, as percentage*100 relative to the baseline of Jan 13, 2020
gcmr_retail_recreation	Google Community Mobility Reports data for the frequency that people visit retail and recreation places expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gcmr_grocery_pharmacy	Google Community Mobility Reports data for the frequency that people visit grocery stores and pharmacies expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gcmr_parks	Google Community Mobility Reports data for the frequency that people visit parks expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gcmr_transit_stations	Google Community Mobility Reports data for the frequency that people visit transit stations expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gcmr_workplaces	Google Community Mobility Reports data for the frequency that people visit workplaces expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gcmr_residential	Google Community Mobility Reports data for the frequency that people visit residential places expressed as a percentage*100 change relative to the baseline period Jan 3 – Feb 6, 2020
gtrends_score	Google search volume for the term ‘coronavirus’, relative across time with the country maximum scaled to 100
gtrends_country_score	Country-level Google search volume for the term ‘coronavirus’ over a period starting Jan 1, 2020, relative across countries with the country having the highest search volume scaled to 100 (time-stable)
region	Country region as classified by the World Bank (time-stable)
income	Country income group as classified by the World Bank (time-stable)
population	Country population as reported by the World Bank (original identifier ‘SP.POP.TOTL’, time-stable)
land_area_skm	Country land mass in square kilometers as reported by the World Bank (original identifier ‘AG.LND.TOTL.K2’, time-stable)
pop_density	Country population density as reported by the World Bank (original identifier ‘EN.POP.DNST’, time-stable)
pop_largest_city	Population in the largest metropolian area of the country as reported by the World Bank (original identifier ‘EN.URB.LCTY’, time-stable)
life_expectancy	Average life expectancy at birth of country citizens in years as reported by the World Bank (original identifier ‘SP.DYN.LE00.IN’, time-stable)
gdp_capita	Country gross domestic product per capita, measured in 2010 US-$ as reported by the World Bank (original identifier ‘NY.GDP.PCAP.KD’, time-stable)
timestamp	Date and time where data has been collected from authoritative sources

Included Visualization Methods

Based on this data, you can use the visualizations functions of the package to quickly produce visuals of the spread. The function plot_covid19_spread() allows many customization options. See:

plot_covid19_spread(
  merged, type = "deaths", min_cases = 1000, edate_cutoff = 60, 
  cumulative = FALSE, change_ave = 7, 
  highlight = c("USA", "ESP", "ITA", "FRA", "GBR", "DEU", "BRA", "RUS", "TUR")  
)

To customize it, you can also spin up its shiny variant, customize the plot until it fits your needs and then export the code to the clipboard with a simple click.

shiny_covid19_spread()

For many countries, the function plot_covid19_spread() provides an alternative way on how to visualize the spread.

plot_covid19_stripes(merged, type = "deaths", min_cases = 1000, cumulative = FALSE)

And, if you like maps (who does not?), you can also visualize the spread that way.

map_covid19(merged)

By the way, if you provide multiple days to map_covid19() it will produce an animated map, but this will take a while to complete.

Some stuff that you can do with the data

Governmental measures over time

The ACAPS data allows for a quick impression on governmental restrictions are implemented and lifted over time.

acaps <- download_acaps_npi_data(cached = TRUE, silent = TRUE)
df <- acaps %>%
  rename(date = date_implemented) %>%
  mutate(nobs = 1*(log_type == "Introduction / extension of measures") -
           1*(log_type != "Introduction / extension of measures")) %>%
  select(iso3c, date, log_type, category, nobs) %>%
  filter(date <= "2020-05-10")

ggplot(df, aes(x = date, fill = category, weight = nobs)) +
  geom_histogram(data = df %>% filter(log_type == "Introduction / extension of measures"),
                 position = "stack", binwidth = 24*3600*7) +
  geom_histogram(data = df %>% filter(log_type != "Introduction / extension of measures"),
                 position = "stack", binwidth = 24*3600*7) +
  theme_minimal() +
  labs(title = "Implementation and Lifting of Interventions over Calendar Time",
       x = NULL,
       y = "Number of interventions",
       fill = "Intervention") +
  theme(legend.position = c(0.25, 0.8),
        legend.background = element_rect(fill = "white", color = NA),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 7))

Association between testing and deaths

Is there an association between testing during the fist 30 days of the spread and the amount of deaths that a country observes?

early_tests <- merged %>%
  group_by(iso3c) %>%
  filter(population > 10e6) %>%
  filter(confirmed > 0) %>%
  filter(!all(is.na(total_tests))) %>%
  mutate(total_tests = na.approx(c(0, total_tests), rule = 2)[-1]) %>%
  filter(date - min(date) < 30) %>%
  summarise(early_tests = unique(1e5*max(total_tests, na.rm = TRUE)/population)) %>%
  filter(!is.na(early_tests))
  
deaths <- merged %>%  
  group_by(iso3c) %>%
  filter(deaths > 1000) %>%
  filter(population > 10e6) %>%
  summarise(deaths = unique(1e5*max(deaths, na.rm = TRUE)/population)) 

deaths %>% left_join(early_tests, by = "iso3c") %>%
  filter(!is.na(early_tests)) %>%
  ggplot(aes(x = early_tests, y = deaths)) + 
  geom_point() +
  theme_minimal() + 
  geom_label_repel(aes(label = iso3c)) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    x = "Tests within the first 30 days by 100,000 inhabitants (interpolated)",
    y = "Deaths per 100,000 inhabitants",
    caption = "Case data: JHU CSSE, Test data: Our World in Data."
  )

Regional variance in individual behavior

Do the social distancing measures work differently in East and West Germany?

gcmr <- download_google_cmr_data(type = "country_region", cached = TRUE, silent = TRUE)

east_regions <- c("Berlin", "Brandenburg", "Mecklenburg-Vorpommern",
          "Saxony", "Saxony-Anhalt", "Thuringia")

df <- gcmr %>% 
  filter(iso3c == "DEU") %>%
  mutate(east = ifelse(region %in% east_regions, "East Germany", "West Germany")) %>%
  select(-iso3c, -region, -timestamp) %>%
  group_by(date, east) %>%
  summarise_all(mean)

ggplot(df, aes(x = date, y = retail_recreation, color = east)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Percentage change of visits in retail shopping\nand recreational areas",
    caption = "Movement data: Google CMR."
  ) + 
  gghighlight(TRUE, label_key = east)

The same analysis for Apple Data (but without Berlin as it is classified as a city in Apple data)

amtr <- download_apple_mtr_data(type = "country_region", cached = TRUE, silent = TRUE)

east_regions <- c("Brandenburg", "Mecklenburg-Vorpommern",
          "Saxony", "Saxony-Anhalt", "Thuringia")

df <- amtr %>% 
  filter(iso3c == "DEU") %>%
  mutate(east = ifelse(region %in% east_regions, "East Germany", "West Germany")) %>%
  select(-iso3c, -region, -timestamp) %>%
  group_by(date, east) %>%
  summarise_all(mean)

ggplot(df, aes(x = date, y = driving, color = east)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Percentage change of Apple Map requests\nfor driving directions",
    caption = "Movement data: Apple MTR."
  ) + 
  gghighlight(TRUE, label_key = east)

Compare Apple Mobility Trend Reports across major European cities

For driving directions:

amtr <- download_apple_mtr_data(type = "country_city", cached = TRUE, silent = TRUE)

cities <- c("Berlin", "London", "Madrid", 
            "Paris", "Rome", "Stockholm")

df <- amtr %>% 
  filter(city %in% cities) %>%
  select(-iso3c, -timestamp) %>%
  group_by(date, city) %>%
  summarise_all(mean)

ggplot(df, aes(x = date, y = driving, color = city)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Percentage change of Apple Map\nrequests for driving directions",
    caption = "Movement data: Apple MTR."
  ) + 
  gghighlight(TRUE, label_key = city)

And for public transport directions:

ggplot(df, aes(x = date, y = transit, color = city)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Percentage change of Apple Map requests\nfor public transport directions",
    caption = "Movement data: Apple MTR."
  ) + 
  gghighlight(TRUE, label_key = city)

Wrapping Up

I hope that this quick walk through helped you to assess the old and new content of the {tidycovid19} package. There are some more use cases in the file example_code.R of the tidycovid19 Github repository.

Everybody: Enjoy, stay well and keep #FlattenTheCurve!

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

{tidycovid19}: New data and documentation

Installation

Included Data Sources

Included Visualization Methods

Some stuff that you can do with the data

Governmental measures over time

Association between testing and deaths

Regional variance in individual behavior

Compare Apple Mobility Trend Reports across major European cities

Wrapping Up

Related

Installation

Included Data Sources

Included Visualization Methods

Some stuff that you can do with the data

Governmental measures over time

Association between testing and deaths

Regional variance in individual behavior

Compare Apple Mobility Trend Reports across major European cities

Wrapping Up

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)