Decoding Case Data Through the COVID-19 Data Hub

[This article was first published on R Consortium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Emanuele Guidotti & David Ardia, originally published in the Journal of Open Source Software

In December 2019 the first cases of pneumonia of unknown etiology were reported in Wuhan city, People’s Republic of China.[1] Since the outbreak of the disease, officially called COVID–19 by World Health Organization (WHO), a multitude of papers have appeared. By one estimate, the COVID-19 literature published in January-May 2019 has reached more than 23,000 papers — among the biggest explosions of scientific literature ever.[2]

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset[3], a resource of over 134,000 scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses. The Center for Systems Science and Engineering at the Whiting School of Engineering, with technical support from ESRI and the Johns Hopkins University Applied Physics Laboratory, is maintaining an interactive web-based dashboard to track COVID-19 in real time[4]. All data collected and displayed are made freely available through a GitHub repository. [5] A team of over one hundred Oxford University students and staff from every part of the world is collecting information on several different common policy responses governments have taken. The data are aggregated in The Oxford COVID-19 Government Response Tracker[6]. Google and Apple released mobility reports [7] [8] to help public health officials. Governments all over the world are releasing COVID-19 data to track the outbreak as it unfolds.

It becomes critical to harmonize the amount of heterogeneous data that have become available to help researchers and policy makers in containing the epidemic. To this end, we developed the COVID-19 Data Hub, designed to aggregate the data from several sources and allow contributors to collaborate on the implementation of additional data providers.[9] The goal of our project is to provide the research community with a unified data hub by collecting worldwide fine-grained case data, merged with exogenous variables helpful for a better understanding of COVID-19.

The data are hourly crunched by a dedicated server and harmonized in CSV format on a cloud storage, in order to be easily accessible from R, Python, MATLAB, Excel, and any other software. The data are available at different levels of granularity: 1) administrative area of top-level, usually countries; 2) states, regions, cantons; 3) cities, municipalities.

The first prototype of the platform was developed in spring 2020, initially as part of a research project that was later published in Springer Nature and showcased on the website of the Joint Research Center of the European Commission. The project was then started at the #CodeVSCovid19 hackathon in March, funded by the Canadian Institute for Data Valorization IVADO in April, won the CovidR contest in May, presented at the European R Users Meeting eRum2020 in June, and published in the Journal of Open Source Software in July. At the time of writing, we count 3.43 million downloads and more than 100 members in the community around the project.

COVID-19 Data Hub has recently received support by the R Consortium[10], the worlwide organization that promotes key organizations and groups developing, maintaining, distributing and using R software as a leading platform for data science and statistical computing.[11]

We are now in the process of establishing close cooperation with professors from the Department of Statistics and Biostatistics of the California State University, in a joint effort to maintain the project.

Emanuele Guidotti & David Ardia


Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

[1] World Health Organization, Novel Coronavirus (2019-nCoV) SITUATION REPORT 1 – 21 JANUARY 2020


[3] Wang, Lucy Lu, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, et al. 2020. “CORD-19: The Covid-19 Open Research Dataset.” arXiv Preprint arXiv:2004.10706.

[4] Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. “An Interactive Web-Based Dashboard to Track Covid-19 in Real Time.” The Lancet Infectious Diseases 20 (5): 533–34.


[6] Hale, Thomas, Samuel Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. 2020. “Oxford Covid-19 Government Response Tracker.” Blavatnik School of Government.






The post Decoding Case Data Through the COVID-19 Data Hub appeared first on R Consortium.

To leave a comment for the author, please follow the link and comment on their blog: R Consortium. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)