Looking at Measles Data in Project Tycho

[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.

Measles

Measles are in level 2 data. These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing. The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.
The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings=’-‘,
    skip=2)
r1 <- subset(r1,,-NORTHERN.MARIANA.ISLANDS)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names=’Cases’,
    idvar=c(‘YEAR’ , ‘WEEK’),
    times=names(r1)[-c(1,2)],
    timevar=’State’,
    direction=’long’)
r2$State=factor(r2$State)
r3 <- r2[r2$YEAR>1927,]

Plotting

The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963. An improved measles vaccine became available in 1968.)
qplot(x=Year,
        y=x,
        data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),


        ylab=’Sum registered Measles Cases’)+
    theme(text=element_text(family=’Arial’))


Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

qplot(x=Week,
        y=x,
        data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],
            aggregate(Cases,list(Week=WEEK),
                function(x) sum(x,na.rm=TRUE))),
        ylab=’Sum Registered Measles Cases’,
        main=’Measles 1928-1962′)+
    theme(text=element_text(family=’Arial’))

A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and ‘-‘. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are ‘-‘, which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task. 

References


Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)