24 Days of R: Day 24

December 24, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

OK, so I phoned it in last night. Final post and maybe this one will be a bit better. Can't recall what got me thinking about it, but I was running over the issue of school performance and the erroneous notion that small class sizes will produce better students. This is occasionally debunked, but I thought it'd be fun to demonstrate the appeal of this notion through a random simulation. If memory serves, this is somewhat similar to something that Markus Gesman wrote about on his blog earlier this year.

I'll construct a set of 100 classes, each of which has a random number of students. Each student will have the same probability distribution that describes their performance on an exam. In each case, I'll use a normal distribution, but will round the results so that I get only integral values. I'll also ensure that I only have positive values.

IntegralNormal = function(n, mean, sd, min, max) {
    val = rnorm(n, mean = mean, sd = sd)
    val = round(val)
    val = pmin(val, max)
    val = pmax(val, min)
    val
}

set.seed(1234)
numClasses = 100
numStudents = IntegralNormal(numClasses, 20, 3, 10, 30)
dfClass = data.frame(ClassID = 1:numClasses, NumStudents = numStudents)

CreateStudentResults = function(x, mean, sd, min, max) {
    numStudents = x$NumStudents
    ClassID = x$ClassID
    Score = IntegralNormal(numStudents, mean, sd, min, max)
    df = data.frame(ClassID = rep(ClassID, numStudents), Student = 1:numStudents, 
        Score = Score)
    df
}

library(plyr)
dfTestResults = ddply(dfClass, .(ClassID), CreateStudentResults, mean = 70, 
    sd = 10, min = 0, max = 100)

So how did the students do? Looks like they average about a C, with some extraordinary performers at either end.

hist(dfTestResults$Score, xlab = "Test score", main = "Histogram of student test scores")

plot of chunk StudentResults

I'm sure that any administrator would like to know which classrooms managed to produce the best overall results.

dfClassResults = ddply(dfTestResults, .(ClassID), summarize, AverageScore = mean(Score))
dfClassResults = merge(dfClassResults, dfClass)
dfClassResults = dfClassResults[order(dfClassResults$AverageScore, decreasing = TRUE), 
    ]
sum(dfClassResults$NumStudents[1:25] < 20)
## [1] 15

Amazing! 60% of classes in the top quartile had a class size which was below average. Class size is clearly a strong indicator of student performance. But perhaps this was just a fluke. What happens in the following year?

dfTestResultsTwo = ddply(dfClass, .(ClassID), CreateStudentResults, mean = 70, 
    sd = 10, min = 0, max = 100)
dfClassResultsTwo = ddply(dfTestResultsTwo, .(ClassID), summarize, AverageScore2 = mean(Score))
dfClassResults = merge(dfClassResults, dfClassResultsTwo)
dfClassResults = dfClassResults[order(dfClassResults$AverageScore2, decreasing = TRUE), 
    ]
sum(dfClassResults$NumStudents[1:25] < 20)
## [1] 14
sum(dfClassResults$NumStudents[1:10] < 20)
## [1] 8

Similar results in the second year! In fact, 8 out of the top 10 schools have class sizes below average. Well, that proves it.

Anyone who's reading this blog knows that it doesn't prove anything. I've cherry picked the data, I've not observed the results in the same class from the first to the second year, I've not looked at any other variables which may cause these results and I've not constructed and vetted any model to explain the results.

fit = lm(AverageScore ~ 1 + NumStudents, data = dfClassResults)
summary(fit)
## 
## Call:
## lm(formula = AverageScore ~ 1 + NumStudents, data = dfClassResults)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.276 -1.379 -0.163  1.078  4.240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.1965     1.3097    52.8   <2e-16 ***
## NumStudents   0.0401     0.0663     0.6     0.55    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.99 on 98 degrees of freedom
## Multiple R-squared:  0.00372,    Adjusted R-squared:  -0.00645 
## F-statistic: 0.366 on 1 and 98 DF,  p-value: 0.547

The number of students is- as we would expect- not a meaningful predictor of test scores. I'll emphasize the obvious point that this is an artificial example and does not demonstrate that smaller class sizes have impact on student performance. This is merely demonstrates that if there were no impact, we could find one. Smaller samples have greater volatility and will appear as outliers on either end of the spectrum.

And that does it. 24 days. 24 posts about R. Later, I'll ruminate on what the exercise meant. For now, thanks to folks who have read and responded. I had a lot of fun doing this and hope that it was- at a minimum- a pleasant diversion.

sessionInfo
## function (package = NULL) 
## {
##     z <- list()
##     z$R.version <- R.Version()
##     z$platform <- z$R.version$platform
##     if (nzchar(.Platform$r_arch)) 
##         z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
##     z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer, 
##         "-bit)")
##     z$locale <- Sys.getlocale()
##     if (is.null(package)) {
##         package <- grep("^package:", search(), value = TRUE)
##         keep <- sapply(package, function(x) x == "package:base" || 
##             !is.null(attr(as.environment(x), "path")))
##         package <- sub("^package:", "", package[keep])
##     }
##     pkgDesc <- lapply(package, packageDescription, encoding = NA)
##     if (length(package) == 0) 
##         stop("no valid packages were specified")
##     basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) && 
##         x$Priority == "base")
##     z$basePkgs <- package[basePkgs]
##     if (any(!basePkgs)) {
##         z$otherPkgs <- pkgDesc[!basePkgs]
##         names(z$otherPkgs) <- package[!basePkgs]
##     }
##     loadedOnly <- loadedNamespaces()
##     loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
##     if (length(loadedOnly)) {
##         names(loadedOnly) <- loadedOnly
##         pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
##         z$loadedOnly <- pkgDesc[loadedOnly]
##     }
##     class(z) <- "sessionInfo"
##     z
## }
## <bytecode: 0x0000000010f44640>
## <environment: namespace:utils>

To leave a comment for the author, please follow the link and comment on his blog: PirateGrunt » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.