[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following my previous post, I wanted to spend more time, on the time series with “global weather-related disaster losses as a proportion of global GDP” over the time period 1990-2016 that Roger Pilke sent me last night.

db=data.frame(year=1990:2016, ratio=c(.23,.27,.32,.37,.22,.26,.29,.15,.40,.28,.14,.09,.24,.18,.29,.51,.13,.17,.25,.13,.21,.29,.25,.2,.15,.12,.12))

In my previous post, I spend some time explaining that we should provide some sort of ‘confidence interval’ when we try to predict a pattern. That was what we call ‘model uncertainty’. But there are two (important) issues that I did not mention. (1) it is a time series, so why not use techniques dedicated to time series objects ? (2) we do not really care actually about ‘model uncertainty’ (unless we want to assess if a decreasing trend is significant, or not), and we care more about real prediction uncertainty: in the next ten years, what could be the range for the this ratio, with some given probability (say 95%)? Could we say that with 95% chance the global weather-related disaster losses as a proportion of global GDP should be (each year) within 0 and 0.35 or 0 and 0.7?

A first idea might be to use exponential smoothing techniques (without a seasonal component here).

ratio=ts(db$ratio,start=1990,frequency=1) plot(ratio,xlim=c(1990,2030)) hw=HoltWinters(ratio,gamma=FALSE) phw=predict(hw,n.ahead=15,prediction.interval = TRUE) plot(hw,phw,xlim=c(1990,2030)) polygon(c(2017:2031,rev(2017:2031)), c(phw[,2],rev(phw[,3])),border=NA,col=rgb(0,0,1,.2)) The decreasing trend is coming from the fact that exponential smoothing is here a linear regression, with weight exponentially decaying with time (the older, the smaller the weight). But we cannot use that prediction, since the ratio cannot (obviously) be negative. So why not consider, here, the logarithm of the ratio plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") hw=HoltWinters(log(ratio),gamma=FALSE) phw=predict(hw,n.ahead=15,prediction.interval = TRUE) abline(v=2016,lty=2,col="grey") lines(2017:2031,exp(phw[,2]),col="blue") lines(2017:2031,exp(phw[,3]),col="blue") lines(c(1992:2016,2017:2031),c(exp(hw$fitted[,1]),exp(phw[,1])),col="red") polygon(c(2017:2031,rev(2017:2031)),exp(c(phw[,2],rev(phw[,3]))),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2)

The confidence band is huge, here. What if we consider some ARIMA model here?

fit=auto.arima(ratio) farma=forecast(fit,15) farma=cbind(as.numeric(farma$fitted)[1:15],as.numeric(farma$lower[,1]),as.numeric(farma$upper[,1]),as.numeric(farma$lower[,2]),as.numeric(farma$upper[,2])) plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016,lty=2,col="grey") lines(2017:2031,farma[,4],col="blue") lines(2017:2031,farma[,5],col="blue") lines(2017:2031,farma[,1],col="red") polygon(c(2017:2031,rev(2017:2031)),c(farma[,4],rev(farma[,5])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2) Here, there is an intercept, but no dynamics for the time series (which is considered, here, as a pure white noise). We get exactly the same if we consider the average value of the series fit=lm(ratio~1,data=db) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016,lty=2,col="grey") ndb=data.frame(year=2017:2031) pf=predict(fit,newdata=ndb) farma=cbind(pf,pf-1.96*s,pf+1.96*s) lines(2017:2031,farma[,2],col="blue") lines(2017:2031,farma[,3],col="blue") lines(1990:2031,c(predict(fit),farma[,1]),col="red") polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2)

Here, we get back to my previous post, if we want to consider a possible trend (and not only an intercept)

fit=lm(ratio~year,data=db) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016,lty=2,col="grey") ndb=data.frame(year=2017:2031) pf=predict(fit,newdata=ndb) farma=cbind(pf,pf-1.96*s,pf+1.96*s) lines(2017:2031,farma[,2],col="blue") lines(2017:2031,farma[,3],col="blue") lines(1990:2031,c(predict(fit),farma[,1]),col="red") polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2) Again, the confidence region is not based on inference related error, but on model uncertainty: we try to visualize where future observations might be with (say) 95% chance. Note we can also consider (why not?) a quadratic regression fit=lm(ratio~poly(year,2),data=db) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016,lty=2,col="grey") ndb=data.frame(year=2017:2031) pf=predict(fit,newdata=ndb) farma=cbind(pf,pf-1.96*s,pf+1.96*s) lines(2017:2031,farma[,2],col="blue") lines(2017:2031,farma[,3],col="blue") lines(1990:2031,c(predict(fit),farma[,1]),col="red") polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2)

I am usually not a huge fan of those polynomial regression, but recently, I’ve seen that a lot in economic papers (like “if it’s not linear, add a squared version of the explanatory variable”, which is a rather odd strategy, I’ll publish some posts on that issue this year).

Here again, it might be more clever to consider a logarithmic transformation of the ratio, to insure that the ratio remains positive

fit=lm(log(ratio)~year,data=db) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016,lty=2,col="grey") ndb=data.frame(year=2017:2031) pf=predict(fit,newdata=ndb) farma=cbind(exp(pf+s^2/2),exp(pf-1.96*s),exp(pf+1.96*s)) lines(2017:2031,farma[,2],col="blue") lines(2017:2031,farma[,3],col="blue") lines(1990:2031,c(exp(predict(fit)+s^2/2),farma[,1]),col="red") polygon(c(2017:2031,rev(2017:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2) Observe that future trend is mainly driven by the three latest observations, that were rather low (compared with older observations). What if we remove them? dbna=db db$ratio[25:27]=NA fit=lm(ratio~1,data=dbna) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016-3,lty=2,col="grey") ndb=data.frame(year=2014:2031) pf=predict(fit,newdata=ndb) farma=cbind(pf,pf-1.96*s,pf+1.96*s) lines(2014:2031,farma[,2],col="blue") lines(2014:2031,farma[,3],col="blue") lines(1990:2031,c(predict(fit)[1:24],farma[,1]),col="red") polygon(c(2014:2031,rev(2014:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2) More funny, if we consider a quadratic regression, we obtain an increasing trend for the future fit=lm(ratio~poly(year,2),data=dbna) s=summary(fit)$sigma plot(db$year,db$ratio,type="l",xlim=c(1990,2030),ylim=c(-.2,.7),xlab="year",ylab="ratio") abline(v=2016-3,lty=2,col="grey") ndb=data.frame(year=2014:2031) pf=predict(fit,newdata=ndb) farma=cbind(pf,pf-1.96*s,pf+1.96*s) lines(2014:2031,farma[,2],col="blue") lines(2014:2031,farma[,3],col="blue") lines(1990:2031,c(predict(fit)[1:24],farma[,1]),col="red") polygon(c(2014:2031,rev(2014:2031)),c(farma[,2],rev(farma[,3])),border=NA,col=rgb(0,0,1,.2)) abline(h=0,lty=2)

As we can see, it is rather difficult to get relevant prediction for the future, based on 25 observations…. If anyone has a suggestion, comments are open…