analyze the program for international student assessment (pisa) with r and monetdb

December 9, 2013
By

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

the authoritative source for evaluating educational achievement across nations, the program(me) for international student assessment ranks the math, science, and reading skills of 15-year-olds in more than sixty countries.  coordinated by the organisation for economic co-operation and development (oecd) and released every three years, this data set gives finland reason to gloat and anti-poverty advocates in the united states reason to fight.  participating countries must sample at least 5,000 teenagers, though some governments survey many more in order to provide education researchers with enough of a sample to perform within-country comparisons.  in the world of cross-border standardized testing, this is the big momma.

to understand what's possible with pisa, either visit the international products page or - if you only care about one country - start on the participating economies page and click through to the country-specific website (so here's america's).

instead of processing the pisa microdata line-by-line, the r language stoically attempts to read everything into memory at once.  to avoid the unpleasantness of a seized-up computer, dr. lumley wrote the entire sqlsurvey package (to deal with this monster), and i tweaked, pruned, manicured that code to work on multiply-imputed big survey data.  if you're already familiar with syntax used for the survey package, be patient and read my sqlsurvey examples carefully when something doesn't behave as you expect it to - some sqlsurvey commands require a different structure (i.e. svyby gets called through svymean) and others might not exist anytime soon (like svyolr).  gimme some good news: sqlsurvey uses ultra-fast monetdb (click here for speed tests), so follow the monetdb installation instructions before running my code.  monetdb imports, writes, recodes data slowly, but reads it hyper-fast.  a magnificent trade-off: data exploration typically requires you to think, send an analysis command, think some more, send another query, repeat.  importation scripts (especially the ones i've already written for you) can be left running overnight sans hand-holding.

pisa is a pita to analyze, because it's both multiply-imputed (like the survey of consumer finances) and big data (like the american community survey).  to help researchers deal with that complexity, the twentieth-century-dwelling statisticians at oecd wrote sas macros and spss functions as part of their analysis manual.  well guess what?  those languages are prohibitively expensive, so i've done gone and translated everything over to the r language, precisely reproducing their published results, then automating the download and importation into everybody's favorite monetdb.  say buh-bye to buying proprietary statistical software.  this new github repository contains four scripts:


download import and design.R
  • create the batch (.bat) file needed to initiate the monet database in the future
  • download, unzip, and import each file for every year and size specified by the user
  • split all `plausible value` variables into five, yeah, five tables to account for the uncertainty of imputed responses
  • create a well-documented block of code to re-initiate the monetdb server in the future

analysis examples.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing all five, yup, five replicate weight designs for the 2012 file
  • detour and coerce a numeric variable to categorical, then match some compendium statistics in the ict file.
  • perform the standard repertoire of analysis examples, using a jolly mix of sqlsurvey and custom functions

variable recode example.R
  • run the well-documented block of code to re-initiate the monetdb server
  • copy the five, yes, five tables to maintain the pristine originals
  • add a new categorical variable by hand
  • re-create, then save the multiply-imputed sqlsurvey object from these sparkling new tables
  • close everything, then load everything back up in a fresh instance of r
  • run a simple statistic on the brand new variable

replicate oecd publications.R
  • run the well-documented block of code to re-initiate the monetdb server
  • load the r data file (.rda) containing the five, yay, five designs for the 2009 file
  • match every type of statistic in the oecd's official technical documentation



click here to view these four scripts



for more detail about the program for international student assessment (pisa), visit:

if you're just looking for a couple data points, you ought to give the australian council for educational research's interactive data selection tools a spin.  it's a menu-drive table creator, so easy-to-use but inflexible.

you wouldn't be analyzing the program for international student assessment right now without the work of not one but two dr. thomas lumleys.  (or, in latin, lumlii)  if you decide to hand-write a thank-you letter for all of their hard work using jefferson's polygraph, you won't even need to switch out the paper to fill in specific names.  just another example of the unparalleled efficiencies you'll find when working in the r language with monetdb.

confidential to sas, spss, stata, and sudaan users: you are kissing the wrong frogs.  time to transition to r.  :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.