The PISA2009lite package is released

[This article was first published on SmarterPoland » PISA in english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post introduces a new R package named PISA2009lite. I will show how to install this package, what is inside and how to use it.


PISA (Programme for International Student Assessment) is a worldwide study focused on measuring performance of 15-year-old school pupils. More precisely, scholastic performance on mathematics, science and reading is measured in more than 500 000 pupils from 65 countries.

First PISA study was performed in 2000, second in 2003, and then in 2006, 2009 and the last one in 2012. Data from the last study will be made public on December 2013. Data from previous studies are accessible through the PISA website
Note that this data set is quite large. This is why the PISA2009lite package will use more than 220MB of your disk [data sets are compressed on disk] and much more RAM [data sets will be decompressed after [lazy]loading to R].

Let's see some numbers. In PISA 2009 study, number of examined pupils: 515 958 (437 variables for each pupil), number of examined parents: 106 287 (no, their questions are not related to scholastic performance), number of schools from which pupils were sampled: 18 641. Pretty large, complex and interesting dataset!

On the official PISA webpage there are instructions how to read data from 2000-2009 studies into SAS and SPSS statistical packages. I am transforming these dataset to R packags, to make them easier to use for R users.
Right now, PISA2009lite is mature enough to share it. There are still many things to correct/improve/add. Fell free to point them [or fix them].

This is not the first attempt to get the PISA data available for R users. On the github you can find 'pisa' package maintained by Jason Bryer ( with data from PISA 2009 study.
But since I need data from all PISA editions, namely 2000, 2003, 2006, 2009 and 2012 I've decided to create few new packages, that will support consistent way to access data from different PISA studies.

Open the R session

The package is on github, so in order to install it you need just

# dont download 220MB of compressed data if the package is already
# installed
if (length(find.package("PISA2009lite", quiet = TRUE)) == 0) install_github("PISA2009lite", 

now PISA2009lite is ready to be loaded


You will find five data sets in this package [actually ten, I will explain this later]. These are: data from student questionnaire, school questionnaire, parent questionnaire, cognitive items and scored cognitive items.


## [1] 515958    437


## [1] 106287     90


## [1] 18641   247


## [1] 515958    273


## [1] 515958    227

For most of variables in each data set there is a dictionary which decode answers for particular question. Dictionaries for all questions for a given data set are stored as a list of named vectors, these lists are named after corresponding data sets [just add suffix 'dict'].
For example fist six entries in a dictionary for variable CNT in the data set student2009.


##          ALB          ARG          AUS          AUT          AZE 
##    "Albania"  "Argentina"  "Australia"    "Austria" "Azerbaijan" 
##          BEL 
##    "Belgium"

You can do a lot of things with these data sets. And I am going to show some examples in next posts.

Country ranking in just few lines of code

But as a warmer let's use it to calculate average performance in mathematics for each country.

Note that student2009$W_FSTUWT stands for sampling weights, student2009$PV1MATH stands for first plausible value from MATH scale while student2009$CNT stands for country

means <- unclass(by(student2009[, c("PV1MATH", "W_FSTUWT")], student2009[, "CNT"], 
    function(x) weighted.mean(x[, 1], x[, 2])))
# sort them
means <- sort(means)

Let's add proper country names [here dictionaries are useful] and plot it.

names(means) <- student2009dict$CNT[names(means)]
dotchart(means, pch = 19)
abline(v = seq(350, 600, 50), lty = 3, col = "grey")

plot of chunk unnamed-chunk-6

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland » PISA in english. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)