An Overview of Data for COVID Analysis

[This article was first published on R on Robert Kubinec, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The COVID-19 pandemic has led to a wealth of research studies examining various aspects of the pandemic. In this post, I will discuss some of the available datasets for doing aggregate level analysis, i.e., comparing countries, provinces, or counties to see what the correlates of COVID-19 infections are (or related outcomes like COVID-19 restrictions). This overview is not comprehensive, of course, but it can hopefully give you a sense of what is out there and what is worth downloading and potentially incorporating into your analysis.

The Basics: COVID-19 Cases, Deaths and Tests

The most standard indicator available for COVID-19 are reported confirmed cases and deaths. This data is displayed in many data visualizations, but it is not actually as standard as may seem. The most commonly utilized source is the John Hopkins COVID-19 dashboard (visualization and Github page). The JHU dashboard has been one of the earliest and continually-updated resources for COVID-19 cases, and is even used by Google for COVID-19 reporting when using its search engine.

While this is a good resource as a first cut, especially when aggregating to the country level, it is not perfect, especially when looking for recent data. The world is simply too large for an automated system to do this perfectly, and JHU’s Github has 1,400+ open issues reflecting data quality problems. The data in general are fine, in my experience, but some caution is advised when using the most recent data and also for potential anomalies in different countries.

To collect higher quality, and also more specialized cases data, it is best to turn to lower-level aggregators, which are primarily country health ministries. For example, the US Center for Disease Control has its own historical dataset for states and even down to the county level. The advantage of using this data is that it is probably the most reliable source of data on cases for the United States. The New York Times also collected this data and have an easy to use dataset via Github, though this was more important earlier in the pandemic when the CDC did not release timely data on cases.

The European CDC released its own world-wide data that appeared be higher quality than JHU’s data, though they have since discontinued the world-wide releases. They do though have detailed data for EU countries updated regularly.

As such, if the aim is to analyze a specific region, then using a regional or country-level source of COVID cases is more reliable than using JHU or other aggregators due to known measurement issues. However, JHU remains the only available source of world-wide COVID cases. I would certainy not use other sources of information like Statista, which are simply re-packaging JHU data.

However, there is a big caveat here. It’s widely known at this point that there is serious measurement error in both cases and deaths if the aim is to study the severity of the disease. My own work shows how poor the U.S.’ reported case numbers match actual infections at the state level. So you might want to consider using some kind of statistically-adjusted measure of COVID-19. You can use our estimates from the paper above for infections at the state level, though it only covers the first four months of the pandemic. For world-wide coverage, consider looking at the Economist’ release of adjusted death numbers at the country level. Otherwise, it can be very hard to remove bias from these numbers without careful thought.

One variable which can help adjust the numbers is reported tests. However, there is only one source of reported tests at the worldwide level, and that is released by Our World In Data. Including tests in your model is arguably the best way to tackle the under-reporting in COVID cases apart from modeling it directly, as we do above and others do with SIR/SEIR models. However, this source of testing data is incomplete as Our World in Data is a relatively small team and this is only one dataset they put together and visualize. It is possible to get more detailed test numbers, though even the US CDC does not release test counts at the country level (they are still only available via the volunteer-led COVID Tracking Project).

The only other source of testing data is to check each country or region-level website individually. Again, reporting of tests varies widely and is quite frustrating to collect, so be warned before attempting this. On the whole, the recent Economist adjusted death data release may be the simplest way to model COVID-19 severity if you don’t want to get into the details of the disease too much. It is important to adjust for factors that affect the death rate, though, such as the age structure of the population.

Time-Varying COVID-19 Predictors

The COVID-19 pandemic varied tremendously over time, and to understand it, it is imperative to have as finely grained time-varying information. Unfortunately, there is not as much of this as we would like as academics and other researchers were caught off-guard and it took a long time to get data collection up to speed.

To leave a comment for the author, please follow the link and comment on their blog: R on Robert Kubinec. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)