analyze the surveillance epidemiology and end results (seer) with r and monetdb

Posted on July 15, 2013 by Anthony Damico in R bloggers | 0 Comments

[This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

the surveillance epidemiology and end results program is the aggregation of all cancer registry statistics in the united states. created by congressional decree, seer has captured a nationally-representative quarter of american cancer incidence since 1973. when acs, cdc, nci, and naaccr publish their collaborative annual report, they use seer. when the aacr predicts that america will have 18 million cancer survivors by 2022, they use seer too. you can use seer three.

the national cancer institute blessedly provides a bouquet of free statistical software to import and analyze this microdata. obviously, my code won’t compete with the legions of epidemiological software programmers at the largest of the nih institutes. but plenty of other r users have written packages to work with this stuff, so maybe, just maybe, someone will find value in my automated importation syntax. plus, the seer microdata include a sas import script – which triggers my fight or fight harder reflex. list of things i hate, descending sort order: mosquitoes, cancer, then sas a very distant third. but still.

aside from easing the importation of this data into the r language, i suppose i have contributed one tangible improvement to the seer-analyst community: these download and import scripts will put all eight million records into wickedly-fast monetdb. so long as you can perform your analysis using sql, you can perform your analysis (on all eight million records) in basically one second. haa-cha! i’ve said it before, i’ll say it again: the import takes forrrrrever (leave it overnight). but once it’s loaded, it’ll outrun lightning. this new github repository contains four scripts:

download.R

after setting your username and password, download and unzip the seer text data file to some working directory

import all tables into rda.R

grep through the unzipped seer text folders to find individual- and population-level tables
import each individual-level table into an r data.frame with sascii, then save to disk for fast loading later.
import each population-level table into an r data.frame with sascii, then save to disk for fast loading later.

import individual-level tables into monetdb.R

grep through the unzipped seer text folders to find individual-level tables
initiate a monetdb server on the local disk, then import each individual-level table with read.sascii.monetdb
stack all of the imported individual-level tables into one, thereby replicating the total case count
create a well-documented block of code to re-initiate the monetdb server in the future

replicate case counts table.R

connect to the seer microdata stored in monetdb
replicate the count statistics shown on the nci’s seer data page with sql
shut the whole thing down

click here to view these four scripts

for more detail about surveillance epidemiology and end results microdata, visit:

the seer datasets and software tab and brochure, both good starting points
seer recodes you’ll need to implement in r if you want to match nci-created (free) software

notes:

seer is publicly-available, you just gotta sign and e-mail in this form, then wait two business days for them to send you the login and password needed for the box that pops up when you click this download link.

confidential to sas, spss, stata, and sudaan users: it’s black tie dinner night at the governor’s mansion and you’re still wearing a t-shirt. ready to change into your tuxedo? time to transition to r. 😀

To leave a comment for the author, please follow the link and comment on their blog: asdfree by anthony damico.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

analyze the surveillance epidemiology and end results (seer) with r and monetdb

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)