The PISA2003lite package is released. Let’s explore!

[This article was first published on SmarterPoland » PISA in english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I’m going to show how to install PISA2003lite, what is inside and how to use this R package. Datasets from this package will be used to compare student performance in four math sub-areas across different countries.
At the end of the day we will find out in which areas top performers from different countries are stronger and in which they are weaker.

In the post ,,The PISA2009lite is released” an R package PISA2009lite with data from PISA 2009 was introduced. The same approach was applied to convert data from PISA 2003 into an R package. PISA (Programme for International Student Assessment) is a worldwide study focused on measuring scholastic performance on mathematics, science and reading of 15-year-old school pupils.
The study performed in 2003 was focused on mathematics. Note, that PISA 2012 was focused on mathematics as well, so it will be very interesting to compare results between both studies [when only date from PISA 2012 data become public].

The package PISA2003lite is on github, so to install it you just need

library(devtools)
 
# don't download 120MB of compressed data if the package is already installed
if (length(find.package("PISA2003lite", quiet = TRUE)) == 0) 
     install_github("PISA2003lite", "pbiecek")
 
# and load the package
library(PISA2003lite)

You will find three data sets in this package. These are: data from student questionnaire, school questionnaire and cognitive items.

dim(student2003)
## [1] 276165    407
 
dim(school2003)
## [1] 10274   192
 
dim(scoredItem2003)
## [1] 276165    178

Let's plot something! What about strong and weak sides in particular MATH sub-areas?
In this dataset the overall performance is represented by five plausible values: PV1MATH, PV2MATH, PV3MATH, PV4MATH, PV5MATH. But for each student also performance in four sub-scales is measured. These sub-scales are: Space and Shape, Change and Relationships, Uncertainty and Quantity (plausible values: PVxMATHx, x=1..5, y=1..4).

Let's find out how good are top performers in each country in different sub-scales.
For every country, let's calculate the 95% quantile of performance in every subscale.

cnts <- as.character(unique(student2003$CNT))
res <- t(sapply(cnts, function(cnt) {
    singleCountry <- student2003[student2003$CNT == cnt, ]
    sapply(c("PV1MATH1", "PV1MATH2", "PV1MATH3", "PV1MATH4"), function(x) quantile(singleCountry[, 
        x], 0.95))
}))

Just few more lines to add the proper row- and col- names.

subAreasNames <- sapply(strsplit(student2003dict$colnames[substr(colnames(res), 
    1, 8)], split = " *- *"), `[`, 2)
colnames(res) <- subAreasNames
library(PISA2009lite)
rownames(res) <- student2009dict$CNT[rownames(res)]

And here are results. Table looks nice, but there is so many numbers.
Let's use PCA to reduce dimensionality of the data.

res
 
##                    Space and Shape Change and Relationships Uncertainty
## Australia                    687.5                    679.8       687.2
## Austria                      706.3                    669.0       652.5
## Belgium                      704.8                    712.2       692.1
## Brazil                       513.7                    546.0       523.2
## Canada                       664.8                    674.7       672.0
## Czech Republic               744.4                    702.3       673.4
## Denmark                      674.2                    666.2       661.4
## Finland                      685.7                    694.6       679.1
## France                       675.9                    677.9       653.9
## Germany                      684.4                    679.0       649.0
## Greece                       595.7                    604.9       601.7
## Hong Kong-China              733.4                    702.0       712.8
## Hungary                      660.3                    656.3       631.4
## Iceland                      655.1                    663.7       678.9
## Indonesia                    508.1                    506.8       497.5
## Ireland                      632.8                    647.1       664.8
## Italy                        675.8                    644.5       641.2
## Japan                        724.9                    707.6       682.5
## Korea                        743.2                    705.4       680.9
## Latvia                       656.1                    651.9       613.7
## Liechtenstein                701.3                    708.5       670.5
## Luxembourg                   653.9                    652.5       649.7
## Macao-China                  688.7                    674.3       674.8
## Mexico                       535.8                    534.8       535.0
## Netherlands                  676.9                    704.1       692.2
## New Zealand                  697.1                    693.5       700.1
## Norway                       651.4                    646.2       675.9
## Poland                       668.4                    654.0       632.2
## Portugal                     609.4                    623.7       604.8
## Russian Federation           666.2                    644.6       593.7
## Slovak Republic              702.8                    667.7       622.8
## Spain                        628.9                    642.3       632.1
## Sweden                       657.9                    680.9       672.8
## Switzerland                  701.7                    684.5       663.8
## Thailand                     590.9                    583.9       561.1
## Tunisia                      513.5                    510.8       485.6
## Turkey                       596.9                    629.8       613.0
## United Kingdom               661.7                    671.3       672.8
## United States                631.3                    639.3       649.8
## Uruguay                      579.3                    601.8       582.8
##                    Quantity
## Australia             671.3
## Austria               655.2
## Belgium               692.2
## Brazil                542.8
## Canada                668.2
## Czech Republic        700.9
## Denmark               660.5
## Finland               681.4
## France                657.1
## Germany               675.1
## Greece                605.8
## Hong Kong-China       699.2
## Hungary               645.5
## Iceland               666.5
## Indonesia             509.8
## Ireland               645.5
## Italy                 662.3
## Japan                 683.5
## Korea                 679.3
## Latvia                619.4
## Liechtenstein         675.9
## Luxembourg            647.2
## Macao-China           669.7
## Mexico                562.0
## Netherlands           681.7
## New Zealand           670.9
## Norway                646.2
## Poland                636.2
## Portugal              619.3
## Russian Federation    625.8
## Slovak Republic       663.3
## Spain                 652.3
## Sweden                655.5
## Switzerland           669.2
## Thailand              591.4
## Tunisia               520.1
## Turkey                611.3
## United Kingdom        664.2
## United States         640.7
## Uruguay               603.4
 
par(xpd = NA)
biplot(prcomp(res))

plot of chunk unnamed-chunk-5

Quite interesting!
It looks like first PCA coordinate is an average over all sub-scales. Thus on the plot above the further left, the better are top performers in all sub-scales.
But the second PCA coordinate differentiates between countries in which top performers are better in 'Space and Shape' [top] versus ‘Uncertainty’ [bottom]. Compare this plot with the table above, and you will see that for Czech Republic, Slovak Republic, Russian Federation the 'Space and Shape' is the strongest side of top performers.
On the other side Sweden, USA, Norway, Ireland score higher in ‘Uncertainty’.

As you may easily check, that results are pretty similar if you focus on averages or performers from the bottom tail.

Which direction is better? Of course it depends. But since ,,change is the only sure thing'', understanding of uncertainty might be useful.

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland » PISA in english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)