a speed test of three sql queries on sixty-seven million records using my personal computer —
# calculate the sum, mean, median, and standard deviation of a single variable
system.time( dbGetQuery( db , ‘select sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08’ ) )
user system elapsed
0.00 0.03 25.96
# calculate the same statistics, broken down by six age and two gender categories
system.time( dbGetQuery( db , ‘select bene_sex_ident_cd, bene_age_cat_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd’ ) )
user system elapsed
0.00 0.02 121.56
# calculate the same statistics, broken down by six age, two gender, and 924 icd-9 diagnosis code categories
system.time( dbGetQuery( db , ‘select bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), stddev( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd’ ) )
user system elapsed
0.30 0.03 125.16
— you’re not using computer hardware built in 1966, you shouldn’t use software written for that era, either.
like so many other legends, our story begins at the 2007 r statistical programming conference in iowa. dr. thomas lumley presented his idea for big survey data to his contemporaries, who – in predictable contemporary form – failed to acknowledge its genius. over the next half-decade, only ill-advised attempts were made at analyzing the big survey data beast.
for work (invention’s mama), i needed to access the fifteen-million record, five year american community survey files, but since database-backed survey objects read all replicate weights into ram, my sixteen gigabyte desktop hissed, popped, crapped out. so i e-mailed dr. lumley and asked for ideas. next thing i know, he had developed:
- an r package to connect to the column-oriented, ultra-fast monetdb
- a sql-driven branch of his broader survey software.
turns out, monetdb is lightning fast on any big data, not just surveys. no reason for demographers to hog all the fun.
for more detail about monetdb, visit:
- the monetdb homepage
- the monetdb sql reference guide
- the monetdb funders (scroll down)
- one, two bugs with the windows version of monetdb that i found and they promptly fixed. heroes.
- blog post outlining why column-store databases are faster
there’s a price for such fast data access. importing a table into monetdb takes a while, so prepare to let your initial import code run overnight. it’s a good deal: leaving your computer on for one night beats running hour-long commands for every new analysis.
the RMonetDB and sqlsurvey packages are experimental. the more you use them, the sooner they won’t be. if you hit something you don’t understand (especially a line of code that works without error in the survey package), read the documentation carefully before contacting the author. sqlsurvey commands do not uniformly behave like survey commands.
remember, all scripts on this archive work on my 2009-era windows seven laptop (with four gigabytes of ram). by default, r reads objects into memory. when a data set is too big, the analysis scripts presented on this website work around memory limitations by connecting to either a monetdb (speedy) or sqlite (easy-to-use) database.
the folks at monetdb have begun work on a direct monetdb-to-r connector. until that’s complete, dr. lumley’s java-driven connector (the RMonetDB package) works just fine.
many government data sets are only available as fixed-width (flat) files accompanied by a sas import script, and the big data that necessitates RMonetDB is no exception. i’ve written a variant of the read.SAScii() function to import ascii files directly into a monet database all in one step. you may notice it in the code for some of the large surveys.
confidential to sas, spss, stata, sudaan users: if you start analyzing big data with r + monetdb, you will no longer have to wait around long enough to take a coffee break after running each command. for that, i apologize. 😀