# Project Tycho, Correlation between states

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this fourth post on Measles data I want to have a look at correlation between states. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.**Wiekvoet**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

### Data

I discovered an error in previous code which made 1960 to appear twice. Hence updated script.setwd(‘/home/kees/Documents/tycho/’)

r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',

na.strings=’-‘,

skip=2)

r2 <- reshape(r1,

varying=names(r1)[-c(1,2)],

v.names=’Cases’,

idvar=c(‘YEAR’ , ‘WEEK’),

times=names(r1)[-c(1,2)],

timevar=’STATE’,

direction=’long’)

r2$STATE=factor(r2$STATE)

####################3

years <- dir(pattern='+.txt')

years

pop1 <-

lapply(years,function(x) {

rl <- readLines(x)

print(x)

sp <- grep('^U.S.',rl)

st1 <- grep('^AL',rl)

st2 <- grep('^WY',rl)

rl1 <- rl[c language="(sp[1"][/c]-2,st1[1]:st2[1])]

rl2 <- rl[c language="(sp[2"][/c]-2,st1[2]:st2[2])]

read1 <- function(rlx) {

rlx[1] <- paste('abb',rlx[1])

rlx <- gsub(',','',rlx,fixed=TRUE)

rt <- read.table(textConnection(rlx),header=TRUE)

rt[,grep(‘census’,names(rt),invert=TRUE)]

}

rr <- merge(read1(rl1),read1(rl2))

ll <- reshape(rr,

list(names(rr)[-1]),

v.names=’pop’,

timevar=’YEAR’,

idvar=’abb’,

times=as.integer(gsub(‘X’,”,names(rr)[-1])),

direction=’long’)

})

pop <- do.call(rbind,pop1)

pop <- pop[grep('19601',rownames(pop),invert=TRUE),]

states <- rbind(

data.frame(

abb=datasets::state.abb,

State=datasets::state.name),

data.frame(abb=’DC’,

State=’District of Columbia’))

states$STATE=gsub(‘ ‘,’.’,toupper(states$State))

r3 <- merge(r2,states)

r4 <- merge(r3,pop)

r4$incidence <- r4$Cases/r4$pop

r5 <- subset(r4,r4$YEAR>1927,-STATE)

r6 <- r5[complete.cases(r5),]

#### New variable

In previous posts it became clear there is in general a yearly cycle. However, the minimum in this cycle is in summer. This means for yearly summary it might be best not to use calender years, but rather something which breaks during summer. My choice is week 37.with(r6[r6$WEEK>30 & r6$WEEK<45,],

aggregate(incidence,by=list(WEEK=WEEK),mean))

WEEK x

1 31 0.016757440

2 32 0.013776182

3 33 0.011313391

4 34 0.008783259

5 35 0.007348603

6 36 0.006843930

7 37 0.006528467

8 38 0.007078171

9 39 0.008652546

10 40 0.016784205

11 41 0.013392375

12 42 0.016158805

13 43 0.018391632

14 44 0.021788221

r6$cycle <- r6$YEAR + (r6$WEEK>37)

### Plot

#### States over time

Since not all states have complete data, it was decided to use state-year combinations with at least 40 observations (weeks). As can be seen there is some correlation between states, especially in 1945. If anything, correlation gets weaker past 1955.library(ggplot2)ggplot(with(r6,aggregate(incidence,

list(cycle=cycle,

State=State),

function(x)

if(length(x)>40)

sum(x) else

NA)),

aes(cycle, x,group=State)) +

geom_line(size=.1) +

ylab(‘Incidence registered Measles Cases Per Year’) +

theme(text=element_text(family=’Arial’)) +

scale_y_log10()

#### Between states

I have seen too many examples of people rebuilding maps based on traveling times or distances. Now I want to do the same. Proper (euclidean) distance of the states would make the variables the year/week combinations, which gives all kind of scaling issues. What I did is to use correlation and transform that into something distance like. ftime is just a helper variable, so I am sure the reshape works correctly.r6$ftime <- interaction(r6$YEAR,r6$WEEK)

xm <- reshape(r6,

v.names=’incidence’,

idvar=’ftime’,

timevar=’State’,

direction=’wide’,

drop=c(‘abb’,’Cases’,’pop’))

xm2 <- xm[,grep('incidence',names(xm))]

cc <- cor(xm2,use='pairwise.complete.obs')

dimnames(cc) <- lapply(dimnames(cc),function(x) sub('incidence.','',x))

dd <- as.dist(1-cc/2)

The heatmap reveals the structure best.

heatmap(as.matrix(dd),dist=as.dist,symm=TRUE)

MDS is most nice to look at. I will leave comparisons to the US map to those who actually know all these state’s relative locations.

library(MASS)

mdsx <- isoMDS(dd)

par(mai=rep(0,4))

plot(mdsx$points,

type = “n”,

axes=FALSE,

xlim=c(-1,1),

ylim=c(-1,1.1))

text(mdsx$points, labels = dimnames(cc)[[1]])

### References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present.*NEJM*2013; 369(22): 2152-2158.

To

**leave a comment**for the author, please follow the link and comment on their blog:**Wiekvoet**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.