Getting correct data on covid-19 cases is important to obtain up-to-date information on how the disease is progressing. It’s also necessary create models and make accurate forecasts.
However, I’m starting to think most of the charts we see for covid cases over time are incorrect. The reason is these datasets with aggregate statistics by state scrape data and looks at the changes in ‘headline’ numbers – how many total cases of covid by day. However scraping the data only accounts for when new deaths were reported and not when they happened.
For instance take a look at deaths by covid for Colorado at the NY Times infographic here. One can see a dramatic spike of over 100 deaths in CO on April 24. However this due to how ‘backfilling’ and reporting on previously unreported cases. I believe this methodology permeates other data sources including the automatic graph on google, IHME, and covidtracking.
The official data can be found on the covid19 colorado website and looks like a smooth and somewhat symmetric curve. Below is a chart comparing the two datasets.
This appears to be a known issue but they still seem ok publishing these incorrect charts. It appears that this can give erroneous conclusions both to people looking at these charts and anyone trying to do an analysis with them.
It looks like the CDC is modeling on similar data. They use the variable “Daily change in cumulative COVID – 19 death counts” not “Daily Deaths” see slide 4. It also may be a reason that forecasting models are not doing so well.