Looking at Measles Data in Project Tycho

March 30, 2014

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.


Measles are in level 2 data. These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing. The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.
The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.
r1 <- read.csv(‘MEASLES_Cases_1909-1982_20140323140631.csv’,
r2 <- reshape(r1,
    idvar=c(‘YEAR’ , ‘WEEK’),
r3 <- r2[r2$YEAR>1927,]


The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963. An improved measles vaccine became available in 1968.)

        data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),

        ylab=’Sum registered Measles Cases’)+

Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

        data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],
                function(x) sum(x,na.rm=TRUE))),
        ylab=’Sum Registered Measles Cases’,
        main=’Measles 1928-1962′)+

A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and ‘-‘. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are ‘-‘, which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task. 


Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

plotly webpage

dominolab webpage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training





CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)