Every country is facing a global pandemic caused by COVID19 and it’s quite scary for everyone. Unlike any other pandemic we faced before, COVID19 is providing plenty of quality data in near real time. Making this available for general public has helped citizen data scientists to share their reports, forecast trends and building real-time dashboards.
Like everyone else, I am just as curious as anyone else as to “How long will all this last?”. So, I decided to pull up some data for my state and see if I build a prediction model.
Getting all the Data Needed
From the above plot we can clearly see that the data is increasing in an exponential trend for total cases and the total deaths seems to be in a similar trend.
The correlation between each of the variables is as shown below. We will just use Day and Cases for the model building. The reason for this is because we want to be able to extrapolate our data to visualize future trends.
Day Cases Daily Previous Deaths Day 1.0000000 0.8699299 0.8990702 0.8715494 0.7617497 Cases 0.8699299 1.0000000 0.9614424 0.9570949 0.9597218 Daily 0.8990702 0.9614424 1.0000000 0.9350738 0.8990124 Previous 0.8715494 0.9570949 0.9350738 1.0000000 0.9004541 Deaths 0.7617497 0.9597218 0.8990124 0.9004541 1.0000000
Build a Model for Total Cases
To build the model, we will first split the data in to train and test. The split ratio is set at 80%. Next, we build an exponential regression model by using our simple lm function. Finally, we can view the summary of the model.
# create samples from the data samples = sample(1:16, size = 16*0.8) # build an exponential regression model model = lm(log(Cases) ~ Day + I(Day^2) , data = data[samples,]) # look at the summary of the model summary(model)
In the below summary we can see that Day column is highly significant for our prediction and Day^2 is not highly significant. We will still keep this. Our adjusted R-squared is 0.97 indicating the model is significant and p-value is less than 0.05.
Note: Don’t bash me about number of samples. I agree this is not a good amount of samples and I might be over fitting.
Call: lm(formula = log(Cases) ~ Day + I(Day^2), data = data[samples, ]) Residuals: Min 1Q Median 3Q Max -0.58417 -0.13007 0.07647 0.17218 0.56305 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.091554 0.347073 -0.264 0.7979 Day 0.711025 0.104040 6.834 7.61e-05 *** I(Day^2) -0.013296 0.006391 -2.080 0.0672 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3772 on 9 degrees of freedom Multiple R-squared: 0.9806, Adjusted R-squared: 0.9763 F-statistic: 228 on 2 and 9 DF, p-value: 1.951e-08 Prediction for New Data
Now that we have a model, we can do predictions on the test data. In all honesty, I did not intend to make the prediction call this complicated but, here it is . From the prediction, we have calculated Mean Absolute Error. This is indicating that our average error rate is 114 cases. We are either over estimating or under estimating.
“Seems like Overfitting!!”
results = data.frame(actual = data[-samples,]$Cases, Prediction = exp(predict(model, data.frame(Day = data$Day[-samples]))) ) # view test results results # actual Prediction # 1 25 12.67729 # 2 53 40.28360 # 3 110 186.92442 # 4 2294 2646.77897 # calculate mae Metrics::mae(results$actual, results$Prediction) #  113.6856
Visualize the Predictions
Let’s plot over entire model results train and test to see how close are we. The plot seems to show that we are very accurate with our predictions. This might be because of scaling.
Now, let’s try with log scale and is as shown below. Now, we can see that our prediction model was over estimating the total cases. This is also a valuable lesson to show how two different charts can interpret the results differently.
From the above analysis and model building we saw how we can predict the number of pandemic cases in Michigan. On further analyzing the model, we found that the model was too good to be true or over fitting. For now, I don’t have a lot of data to work with. I will give this model another try in a week to see how it performs with feeding more data. This would be a good experiment.
Let me know what you think of this and comment some of your comments on how differently should I have done it.