[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every country is facing a global pandemic caused by COVID19 and it’s quite scary for everyone. Unlike any other pandemic we faced before, COVID19 is providing plenty of quality data in near real time. Making this available for general public has helped citizen data scientists to share their reports, forecast trends and building real-time dashboards.

Like everyone else, I am just as curious as anyone else as to “How long will all this last?”. So, I decided to pull up some data for my state and see if I build a prediction model.

## Getting all the Data Needed

CDC and your state gov websites should be publishing data every day. I got my data from Michigan.gov and click on Detroit. Here is the link to compiled data on my GitHub.

## Visualize Data

From the above plot we can clearly see that the data is increasing in an exponential trend for total cases and the total deaths seems to be in a similar trend.

## Correlation

The correlation between each of the variables is as shown below. We will just use Day and Cases for the model building. The reason for this is because we want to be able to extrapolate our data to visualize future trends.

               Day 		Cases 	  Daily    Previous  Deaths
Day 		1.0000000 0.8699299 0.8990702 0.8715494 0.7617497
Cases 		0.8699299 1.0000000 0.9614424 0.9570949 0.9597218
Daily 		0.8990702 0.9614424 1.0000000 0.9350738 0.8990124
Previous 	0.8715494 0.9570949 0.9350738 1.0000000 0.9004541
Deaths 		0.7617497 0.9597218 0.8990124 0.9004541 1.0000000

## Build a Model for Total Cases

To build the model, we will first split the data in to train and test. The split ratio is set at 80%. Next, we build an exponential regression model by using our simple lm function. Finally, we can view the summary of the model.

# create samples from the data
samples = sample(1:16, size = 16*0.8)

# build an exponential regression model
model = lm(log(Cases) ~ Day + I(Day^2) , data = data[samples,])

# look at the summary of the model
summary(model)

In the below summary we can see that Day column is highly significant for our prediction and Day^2 is not highly significant. We will still keep this. Our adjusted R-squared is 0.97 indicating the model is significant and p-value is less than 0.05.

Note: Don’t bash me about number of samples. I agree this is not a good amount of samples and I might be over fitting.

Call:
lm(formula = log(Cases) ~ Day + I(Day^2), data = data[samples,
])

Residuals:
Min 1Q Median 3Q Max
-0.58417 -0.13007 0.07647 0.17218 0.56305

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.091554 0.347073 -0.264 0.7979
Day 0.711025 0.104040 6.834 7.61e-05 ***
I(Day^2) -0.013296 0.006391 -2.080 0.0672 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3772 on 9 degrees of freedom
Multiple R-squared: 0.9806, Adjusted R-squared: 0.9763
F-statistic: 228 on 2 and 9 DF, p-value: 1.951e-08
Prediction for New Data

## Prediction Time

Now that we have a model, we can do predictions on the test data. In all honesty, I did not intend to make the prediction call this complicated but, here it is . From the prediction, we have calculated Mean Absolute Error. This is indicating that our average error rate is 114 cases. We are either over estimating or under estimating.

“Seems like Overfitting!!”

results = data.frame(actual = data[-samples,]$Cases, Prediction = exp(predict(model, data.frame(Day = data$Day[-samples])))
)
# view test results
results

# actual Prediction
# 1 25 12.67729
# 2 53 40.28360
# 3 110 186.92442
# 4 2294 2646.77897

# calculate mae
Metrics::mae(results$actual, results$Prediction)
# [1] 113.6856

## Visualize the Predictions

Let’s plot over entire model results train and test to see how close are we. The plot seems to show that we are very accurate with our predictions. This might be because of scaling.

Now, let’s try with log scale and is as shown below. Now, we can see that our prediction model was over estimating the total cases. This is also a valuable lesson to show how two different charts can interpret the results differently.

## Conclusion

From the above analysis and model building we saw how we can predict the number of pandemic cases in Michigan. On further analyzing the model, we found that the model was too good to be true or over fitting. For now, I don’t have a lot of data to work with. I will give this model another try in a week to see how it performs with feeding more data. This would be a good experiment.

Let me know what you think of this and comment some of your comments on how differently should I have done it.

The post COVID-19 Data and Prediction for Michigan appeared first on Hi! I am Nagdev.