Site icon R-bloggers

Autocorrelation in project Tycho’s measles data

[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.I have looked at Ptoject Tycho’s measles data before, general look, incidence, some high incidence data and correlation between states. After a detour, it is now time to look at the autocorrelations in these data. These show positive correlation at three years.

Data

first steps, see correlation between states

Preparation for autocorrelation

As detailed before, the data contain weekly counts. Summer has less incidence, winter more. For this reason calender year is abolished and a shifted year used (named cycle), which runs from summer to summer. The data, real world as they are, contain plenty of missings. Arbitrarily chosen, if a cycle has data from at least 40 weeks, then I will us this particular year.
r7 <- aggregate(
    r6$incidence,
    list(cycle=r6$cycle,
        State=r6$State),
    function(x)
        if(length(x)>40)
            sum(x) else
            NA)

To calculate an autocorrelation, a set of consecutive years are needed. Again arbitrarily chosen, 15 years is the minimum. As a first attempt I just kicked out all missing years. Since that resulted in sufficient states with data, no attempts to refine were made. As additional item the number of data points is stored. All in a nice list.
la <- lapply(levels(r7$State),function(x) {
        datain <- r7[r7$State==x,]
        datain <- datain[complete.cases(datain),]
        if (nrow(datain)==(1+max(datain$cycle)-min(datain$cycle)) &
            nrow(datain)>15)
            aa <- acf(datain$x,plot=FALSE,lag.max=6) else aa <- TRUE
        list(aa=aa,nr=nrow(datain))
    })


To make a plot, the autocorrelations are pulled out and it is all stuck in a dataframe.
la2 <- la[which(sapply(la,function(x) class(x$aa))==’acf’)]
scfs <- as.data.frame(t(sapply(la2,function(x) as.numeric(x$aa$acf))))
scfs$state <- levels(r7$State)[sapply(la,function(x) class(x$aa)==’acf’)]
scfs$n <- sapply(la2,function(x) x$nr)

And a reshape prior to plotting.
tc <- reshape(scfs,
    idvar=c(‘state’,’n’),
    varying=list(names(scfs[1:7])),
    timevar=’lag’,
    times=0:6,
    direction=’long’,
    v.names=’acf’)

Plot

The plot shows a somewhat negative correlation after one year and a positive correlation after 3 years. The only state not to show the positive correlation after 3 years had much less data, hence for conclusion I ignore that result.
library(ggplot2)
ggplot(tc,
        aes(y=acf, x=lag,group=state,col=state) )+
    geom_line(aes(size = n)) +
    scale_size(range = c(0.5, 3))+
    theme(text=element_text(family=’Arial’))

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.