the national cancer institute blessedly provides a bouquet of free statistical software to import and analyze this microdata. obviously, my code won't compete with the legions of epidemiological software programmers at the largest of the nih institutes. but plenty of other r users have written packages to work with this stuff, so maybe, just maybe, someone will find value in my automated importation syntax. plus, the seer microdata include a sas import script - which triggers my fight or fight harder reflex. list of things i hate, descending sort order: mosquitoes, cancer, then sas a very distant third. but still.
aside from easing the importation of this data into the r language, i suppose i have contributed one tangible improvement to the seer-analyst community: these download and import scripts will put all eight million records into wickedly-fast monetdb. so long as you can perform your analysis using sql, you can perform your analysis (on all eight million records) in basically one second. haa-cha! i've said it before, i'll say it again: the import takes forrrrrever (leave it overnight). but once it's loaded, it'll outrun lightning. this new github repository contains four scripts:
- after setting your username and password, download and unzip the seer text data file to some working directory
import all tables into rda.R
- grep through the unzipped seer text folders to find individual- and population-level tables
- import each individual-level table into an r data.frame with sascii, then save to disk for fast loading later.
- import each population-level table into an r data.frame with sascii, then save to disk for fast loading later.
import individual-level tables into monetdb.R
- grep through the unzipped seer text folders to find individual-level tables
- initiate a monetdb server on the local disk, then import each individual-level table with read.sascii.monetdb
- stack all of the imported individual-level tables into one, thereby replicating the total case count
- create a well-documented block of code to re-initiate the monetdb server in the future
replicate case counts table.R
- connect to the seer microdata stored in monetdb
- replicate the count statistics shown on the nci's seer data page with sql
- shut the whole thing down
click here to view these four scripts
for more detail about surveillance epidemiology and end results microdata, visit:
- the seer datasets and software tab and brochure, both good starting points
- seer recodes you'll need to implement in r if you want to match nci-created (free) software
seer is publicly-available, you just gotta sign and e-mail in this form, then wait two business days for them to send you the login and password needed for the box that pops up when you click this download link.
confidential to sas, spss, stata, and sudaan users: it's black tie dinner night at the governor's mansion and you're still wearing a t-shirt. ready to change into your tuxedo? time to transition to r. :D