# 24 Days of R: Day 16

December 16, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

Yesterday I said that I'd carry on with the monte carlo simulation of insurance data. I'm not going to as I don't think I've got enough time and mental energy to do it justice. I'm sure tens of people are disappointed to learn this.

Instead, I'm going to have a look at the recently released PISA study, which assesses student performance in many countries around the world. This is always a subject of great (if fleeting) interest here in the US as we tend to punch well below our weight. There are many, many issues around education reform in the US and I couldn't possibly treat them here. Suffice it to say that the PISA study prompts many questions about why one of the world's wealthiest nations (seemingly) can't educate its children as well as other countries with fewer resources. News outlets will have their say, pundits will have theirs and wingnuts will also air their views.

But this isn't a site devoted to conjecture. I'd rather devote my time to the objective assessment of data. Noting that this is a nearly impossible goal- my biases will invariably surface- let's talk about data.

First, it's not easy to get and interpret. Some digging on the PISA site leads us to the spot where we can get data. All that may be found here (Thanks Australia!) After a few minutes sifting through various options, I opted to look at the schools questionnairre file. This is a fixed width file, so I get to have my first experience in using something other than read.csv. After a cursory look at the codebook, I'm going to focus on just a few columns of information. I'm not clever enought to wrap my head around how to use the fwf function to pull just the columns I want, so I'm going to write a helper function.

filename = "./Data/INT_SCQ12_DEC03/INT_SCQ12_DEC03.txt"
filewidth = 1271

ReadColumn = function(filename, start, width, filewidth) {
if (start == 1) {
df = read.fwf(filename, c(width, width - filewidth))
} else {
df = read.fwf(filename, c(-(start - 1), width, start + width - filewidth +
1))
}
df
}


And I quickly find that these column specifications are wrong. At this late hour, I can't spend any more time trying to decode every column, but I can identify whether or not a school is in an OECD country and whether or not it's private.

public = ReadColumn(filename, 32, 1, filewidth)
OECD = ReadColumn(filename, 18, 1, filewidth)

df = cbind(public, OECD)
colnames(df) = c("Public", "OECD")
library(reshape2)
df$variable = 1 pivot = dcast(df, "Public ~ OECD", sum) pivot = pivot[pivot$Public <= 2, ]

public.oecd = pivot[1, 2]/sum(pivot[, 2])
public.other = pivot[1, 3]/sum(pivot[, 3])


The fraction of schools that are private in OECD countries is 0.8107 as compared to 0.8066 in other countries. That was a long walk to learn very little. There's undoubtedly loads of great information in here. It's a shame that the file specification isn't more clear. No wonder the pundits spend little time looking at the data.

sessionInfo()

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] knitr_1.5        RWordPress_0.2-3 reshape2_1.2.2   plyr_1.8
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.1 formatR_0.10   markdown_0.6.3 RCurl_1.95-4.1
## [5] stringr_0.6.2  tools_3.0.2    XML_3.98-1.1   XMLRPC_0.3-0