# Looking at Measles Data in Project Tycho

**Wiekvoet**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.

### Measles

Measles are in level 2 data. *These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing.* The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.

The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.

r1 <- read.csv(‘MEASLES_Cases_1909-1982_20140323140631.csv’,

na.strings=’-‘,

skip=2)

r1 <- subset(r1,,-NORTHERN.MARIANA.ISLANDS)

r2 <- reshape(r1,

varying=names(r1)[-c(1,2)],

v.names=’Cases’,

idvar=c(‘YEAR’ , ‘WEEK’),

times=names(r1)[-c(1,2)],

timevar=’State’,

direction=’long’)

r2$State=factor(r2$State)

r3 <- r2[r2$YEAR>1927,]

### Plotting

The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963.^{} An improved measles vaccine became available in 1968.)

qplot(x=Year,

y=x,

data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),

ylab=’Sum registered Measles Cases’)+

theme(text=element_text(family=’Arial’))

#### Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

qplot(x=Week,

y=x,

data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],

aggregate(Cases,list(Week=WEEK),

function(x) sum(x,na.rm=TRUE))),

ylab=’Sum Registered Measles Cases’,

main=’Measles 1928-1962′)+

theme(text=element_text(family=’Arial’))

#### A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and ‘-‘. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are ‘-‘, which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task.

### References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. *NEJM* 2013; 369(22): 2152-2158.

**leave a comment**for the author, please follow the link and comment on their blog:

**Wiekvoet**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.