Student Performance Indicators
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Check out my:
Source: http://archive.ics.uci.edu/ml/datasets/Student+Performance
This project is based upon two datasets of the academic performance of Portuguese students in two different classes: Math and Portuguese. Initially, I show the simplicity of predicting student performance using linear regression. Later, I show that it is still possible, yet more difficult, to predict the final grade without Period 1 and Period 2 grades but we we learn from those predictions provides much deeper insight. I ask deeper questions about the mathematical structure of student performance and potential indicators that can be used for early support and intervention.
Preparation¶
Load R and packages.
%load_ext rpy2.ipython
%%R suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(MASS)) suppressPackageStartupMessages(library(leaps)) suppressPackageStartupMessages(library(relaimpo)) suppressPackageStartupMessages(library(mgcv))
Read in data.
%%R student.mat <- read.csv("student-mat.csv",sep=";") student.por <- read.csv("student-por.csv",sep=";") head(student.mat)
Linear Model¶
For determining the best linear model, we will use student.mat as a training set and student.por as a test set.
%%R train <- student.mat test <- student.por
Saturated Model¶
Let’s fit a linear model to all of the variables. The saturated model will overfit the data, but it will provide a control that can be used to test against.
%%R fit <- lm(G3 ~ ., train)
Compare Adjusted R2, BIC, and Mallow’s CP With Best Subsets¶
5 variables give the lowest BIC and Mallow’s CP while providing an optimal Adjusted R2.
%%R subs <- regsubsets(G3 ~ ., data = train) df <- data.frame(est = c(summary(subs)$adjr2, summary(subs)$cp, summary(subs)$bic), x = rep(1:8, 33), type = rep(c("adjr2", "cp", "bic"), each = 8)) qplot(x, est, data = df, geom = "line") + theme_bw() + facet_grid(type ~ ., scales = "free_y")
From the summary, we need to pick the top 5 variables. G1, G2, absences, and famrel will be the first four and the fifth will either be age or activities.
%%R fit <- lm(formula = G3 ~ ., data = train) summary(fit)
ANOVA¶
The ANOVA test tells us that the best model is the one with age.
%%R model1 <- lm(G3~ G1 + G2 + absences + famrel + age, data = train) model2 <- lm(G3~ G1 + G2 + absences + famrel + activities, data = train) anova(fit, model1, model2)
Test Set¶
Very quickly, we have an accurate model that did a great job predicting our test set. Notice the darker alpha areas snugly against the line.
We can visually compare the success of the final model versus the saturated model by graphing the predicted values versus the actual values. The line represents a perfect model.
Please note the outliers around the actual values of 0. I will go into more detail about this group later in this project.
%%R #Saturated Model control.model <- lm(G3 ~ ., data = test) control.graph <- qplot(G3, predict(control.model), data = test, geom = "point", position = "jitter", alpha=.5, main="Saturated Model") + geom_abline(intercept=0, slope=1) + theme(legend.position="none") #Final Model final.model <- lm(G3~ G1 + G2 + absences + famrel + age, data = test) final.graph <- qplot(G3, predict(final.model), data = test, geom = "point", position = "jitter", alpha=.5, main="Final Model", guide=FALSE) + geom_abline(intercept=0, slope=1) + theme(legend.position="none") grid.arrange(control.graph,final.graph,nrow=2)
Diagnostics¶
Overall, our model looks pretty good. The main issue with our model is the cluster when G3 is 0.
It affects the residuals at the lower end of our distribution.
%%R plot(final.model)
The 0-Cluster¶
Upon further inspection of the data, it becomes obvious that this cluster most likely belongs to students who dropped the course.
- They have G1 and/or G2 grades but final grades of 0.
- There are no G1s of 0 but there are G2s with 0 value.
- The exploratory model predicts these students as scoring between 0 and 10 which would constitute failing grades.
As a result, we should drop these data points before continuing our analysis since they will not be useful for the question we are researching.
%%R score0 <- subset(student.por, G3==0) score0
Final Model¶
Here is the final model for students who finish the course.
%%R #Final Model test <- subset(train, G3!=0) final.model.no0 <- lm(G3~ G1 + G2 + absences + famrel + age, data = test) qplot(G3, predict(final.model.no0), data = test, geom = "point", position = "jitter", alpha=.5, main="Final Model", guide=FALSE) + geom_abline(intercept=0, slope=1) + theme(legend.position="none")
Deeper Questions and Analysis¶
Our model does a great job at predicting student success; however, there are deeper questions that this model doesn’t address. In particular, it doesn’t demonstrate how we can pick which students are most likely to fail classes at an early age when they lack the best predictors in this model.
As we’ve seen, the best predictors of success are current grades within the course (G1 and G2), age, quality of family relationships, and absences.
Current grades are already present once a problem exists.
Let’s try to see if we can determine what factors can be more useful at preventing student failure and promoting academic success.
Let’s start by looking at all the variables within a linear model, but remove our strongest indicators, G1 and G2, which overshadow other potential factors.
%%R fit <- lm(G3 ~ . -G1 -G2, student.mat)
Our predictions stop at 15 but actual scores rise until 20. Without G1 and G2, our model is unable to make predictions that are any higher.
A score of 15 shows a clear dividing line where the "potential" futures merge into current academic success. This line is important in that it can help us determine what deeper differences successful students have from their peers and also allows to create a definition of a "successful" student that we can use.
For this section, it becomes clear that two models will need to be analyzed: one for grades below 15 and another for grades above 15.
%%R qplot(G3, predict(fit), data = student.mat, geom = "point", position = "jitter", alpha=.8) + geom_abline(intercept=0, slope=1) + theme(legend.position="none")
Breaking Up the Analysis¶
So far, the data has shown that it should be broken into three parts in order to analyze deeper predictors of future success.
Students who drop 1. The first isolates students who drop a course. Their final outcome is 0 even though they should have a higher predicted outcome. These students have predicted scores below 10.
Students who finish 2. Between 0 and 15, one set of predictors (one model) will be used to predict student outcomes. 3. Between 15 and 20, a different set of predictors (a different model) will be used.
%%R #Prep Data score0 <- subset(student.mat, G3==0) score.no0 <- subset(student.mat, G3!=0) score14 <- subset(score.no0, G3<15) score15 <- subset(score.no0, G3>14)
Students Above 15¶
Students in this group have 3 things that stand out:
1. All of them have parents that live together. 2. None of them have had past class failures. 3. All of them plan on seeking higher education.
%%R
score15
Students Below 14¶
Create a training and test set for this group.
%%R set.seed(123) inTraining <- createDataPartition(score14$G3, p = .75, list = FALSE) training <- score14[ inTraining,] testing <- score14[-inTraining,]
Saturated Model¶
Below is a general model with all of our variables using the training set. This can help determine which predictors are statistically significant.
%%R saturated14 <- lm(G3 ~ . -G1 -G2, data = training) summary(saturated14)
Let’s use the step function to find a cut down version of Model 1 that removes uneccesary predictors.
%%R step(saturated14)
Model 2¶
Model 2 will be equivalent to the output of the step function.
%%R model2 <- lm(formula = G3 ~ sex + address + studytime + failures + goout + absences, data = training)
%%R subs <- regsubsets(G3 ~ sex + address + studytime + failures + goout + absences, data = training) df <- data.frame(est = c(summary(subs)$adjr2, summary(subs)$bic), x = rep(1:6, 6), type = rep(c("adjr2", "bic"), each = 6)) qplot(x, est, data = df, geom = "line") + theme_bw() + facet_grid(type ~ ., scales = "free_y")
%%R summary(model2)
Model 3¶
Model 3 will be our final model.
%%R model3 <- lm(formula = G3 ~ sex + failures, data = training) summary(model3)
ANOVA¶
We can now compare the 3 models we made using ANOVA.
%%R anova(saturated14,model2,model3)
In this case, ANOVA isn’t very useful since the strongest predictors from the original model have been cut out. By comparing models graphically, it’s easier to get an idea of what’s going on.
By removing the strong predictors of the original model, single predictors become less important and holistic models become more accurate. Below, we see that Model 1 performs the best on the test set.
This gives insight into how we should approach these students early on. One indicator will not make or break a child, but the overall profile can still be a strong indicator.
%%R #Models final1 <- lm(G3 ~ . -G1 -G2, data=testing) final2 <- lm(G3 ~ sex + address + studytime + failures + goout + absences, data= testing) final3 <- lm(G3 ~ sex + failures, data=testing) #Graphs plot1 <- qplot(G3, predict(final1), data = testing, geom = "point", position = "jitter", alpha=.8, main="Model 1") + geom_abline(intercept=0, slope=1) + theme(legend.position="none") plot2 <- qplot(G3, predict(final2), data = testing, geom = "point", position = "jitter", alpha=.8, main="Model 2") + geom_abline(intercept=0, slope=1) + theme(legend.position="none") plot3 <- qplot(G3, predict(final3), data = testing, geom = "point", position = "jitter", alpha=.8, main="Model 3") + geom_abline(intercept=0,slope=1) + theme(legend.position="none") grid.arrange(plot1,plot2,plot3,nrow=2,main="3 Models")
The most important influencers of the holistic model are: – The school the student attends – Access to school supplies – Past failures – Absences – How often the student goes out
%%R tester <- lm(G3 ~ . -G1 -G2, data=score14) summary(tester)
Students Who Drop (The 0-Cluster): Logistic Regression¶
%%R score0$drop <- 1 score14$drop <- 0 score15$drop <- 0 droppers <- rbind(score0,score14,score15) #Train and test sets set.seed(123) inTraining <- createDataPartition(droppers$G3, p = .75, list = FALSE) training <- droppers[ inTraining,] testing <- droppers[-inTraining,]
Half the people who have a G1 grade of 0 drop the course while all students with a G2 grade of 0 drop the course. As can be expected, the G2 curve is steeper since more students drop the course as the first two bad grades come in.
%%R plot1 <- qplot(G1, drop, data = droppers) + geom_smooth(se = FALSE) plot2 <- qplot(G2, drop, data = droppers) + geom_smooth(se = FALSE) grid.arrange(plot1,plot2)
The most influential predictor for dropping a course seems to be the initial grades in the course. The school one attends also plays a factor.
%%R mod <- gam(drop ~ school + G1 + G2 + Pstatus + higher, data = training, family = binomial) summary(mod)
The table confirms that the distribution is heavily skewed to one school.
%%R table(droppers$school, droppers$drop)
The final model does a pretty good job of predicting.
%%R final <- gam(drop ~ G2, data = testing, family = binomial) qplot(G2, drop, data = testing) + geom_smooth(se = FALSE) + geom_line(aes(y = fitted(final)), color = "red")
Final Results¶
The first part of this project shows how simple it can be to build an accurate linear model to predict student success. However accurate that model may be, it lacks the ability of providing insight that can allow one to intervene before a student reaches failure.
However, one can still build a fairly accurate model with weaker predictors… to a point. The holistic profile of a student works quite well at predicting a student’s failure or success until a grade of about 15. The model’s profile then breaks down and can no longer perform in predicting success beyond that point, requiring a separate model (or rather profile) to accurately predict a student’s performance. This change in models is suggestive of a dividing line in student profiles that reinforce themselves within the same model to continue to succeed or continue to fail. Within these Portuguese classes the natural point of division lies at a grade of 15.
Successful students tend to have parents who live together, a history of success, and a desire to continue on to higher education. The best weak predictors for students include school, past failures, access to school supplies, absences, and frequency of going out. Student are more likely to drop a course if they’ve had bad initial grades in that course.
In the end, solutions cannot be proposed, but instead insight in the structure of student performances, how to create student profiles based on that structure, and importantly where to draw the line of success we should push students to cross.