Student Performance Indicators

[This article was first published on NYC Data Science Academy » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Check out my:

Source: http://archive.ics.uci.edu/ml/datasets/Student+Performance

This project is based upon two datasets of the academic performance of Portuguese students in two different classes: Math and Portuguese. Initially, I show the simplicity of predicting student performance using linear regression. Later, I show that it is still possible, yet more difficult, to predict the final grade without Period 1 and Period 2 grades but we we learn from those predictions provides much deeper insight. I ask deeper questions about the mathematical structure of student performance and potential indicators that can be used for early support and intervention.


Preparation

Load R and packages.

In [1]:
<span class="o">%</span><span class="k">load_ext</span> <span class="n">rpy2</span><span class="o">.</span><span class="n">ipython</span>
In [2]:
<span class="o">%%</span>R
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>ggplot2<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>dplyr<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>caret<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>gridExtra<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>MASS<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>leaps<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>relaimpo<span class="p">))</span>
suppressPackageStartupMessages<span class="p">(</span>library<span class="p">(</span>mgcv<span class="p">))</span>

Read in data.

In [3]:
<span class="o">%%</span>R
student.mat <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"student-mat.csv"</span><span class="p">,</span>sep<span class="o">=</span><span class="s">";"</span><span class="p">)</span>
student.por <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"student-por.csv"</span><span class="p">,</span>sep<span class="o">=</span><span class="s">";"</span><span class="p">)</span>
head<span class="p">(</span>student.mat<span class="p">)</span>
school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
2     GP   F  17       U     GT3       T    1    1  at_home    other     course
3     GP   F  15       U     LE3       T    1    1  at_home    other      other
4     GP   F  15       U     GT3       T    4    2   health services       home
5     GP   F  16       U     GT3       T    3    3    other    other       home
6     GP   M  16       U     LE3       T    4    3 services    other reputation
  guardian traveltime studytime failures schoolsup famsup paid activities
1   mother          2         2        0       yes     no   no         no
2   father          1         2        0        no    yes   no         no
3   mother          1         2        3       yes     no  yes         no
4   mother          1         3        0        no    yes  yes        yes
5   father          1         2        0        no    yes  yes         no
6   mother          1         2        0        no    yes  yes        yes
  nursery higher internet romantic famrel freetime goout Dalc Walc health
1     yes    yes       no       no      4        3     4    1    1      3
2      no    yes      yes       no      5        3     3    1    1      3
3     yes    yes      yes       no      4        3     2    2    3      3
4     yes    yes      yes      yes      3        2     2    1    1      5
5     yes    yes       no       no      4        3     2    1    2      5
6     yes    yes      yes       no      5        4     2    1    2      5
  absences G1 G2 G3
1        6  5  6  6
2        4  5  5  6
3       10  7  8 10
4        2 15 14 15
5        4  6 10 10
6       10 15 15 15

Linear Model

For determining the best linear model, we will use student.mat as a training set and student.por as a test set.

In [4]:
<span class="o">%%</span>R
train <span class="o"><-</span> student.mat
test <span class="o"><-</span> student.por

Saturated Model

Let’s fit a linear model to all of the variables. The saturated model will overfit the data, but it will provide a control that can be used to test against.

In [5]:
<span class="o">%%</span>R
fit <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> train<span class="p">)</span>

Compare Adjusted R2, BIC, and Mallow’s CP With Best Subsets

5 variables give the lowest BIC and Mallow’s CP while providing an optimal Adjusted R2.

In [6]:
<span class="o">%%</span>R
subs <span class="o"><-</span> regsubsets<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span>
df <span class="o"><-</span> data.frame<span class="p">(</span>est <span class="o">=</span> c<span class="p">(</span>summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>adjr2<span class="p">,</span> 
                         summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>cp<span class="p">,</span>
                         summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>bic<span class="p">),</span>
                 x <span class="o">=</span> rep<span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">,</span> <span class="m">33</span><span class="p">),</span>
                 type <span class="o">=</span> rep<span class="p">(</span>c<span class="p">(</span><span class="s">"adjr2"</span><span class="p">,</span> <span class="s">"cp"</span><span class="p">,</span> <span class="s">"bic"</span><span class="p">),</span> 
                            each <span class="o">=</span> <span class="m">8</span><span class="p">))</span>
qplot<span class="p">(</span>x<span class="p">,</span> est<span class="p">,</span> data <span class="o">=</span> df<span class="p">,</span> geom <span class="o">=</span> <span class="s">"line"</span><span class="p">)</span> <span class="o">+</span>
      theme_bw<span class="p">()</span> <span class="o">+</span> facet_grid<span class="p">(</span>type <span class="o">~</span> .<span class="p">,</span> scales <span class="o">=</span> <span class="s">"free_y"</span><span class="p">)</span>

From the summary, we need to pick the top 5 variables. G1, G2, absences, and famrel will be the first four and the fifth will either be age or activities.

In [7]:
<span class="o">%%</span>R
fit <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span>
summary<span class="p">(</span>fit<span class="p">)</span>
Call:
lm(formula = G3 ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9339 -0.5532  0.2680  0.9689  4.6461 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.115488   2.116958  -0.527 0.598573    
schoolMS          0.480742   0.366512   1.312 0.190485    
sexM              0.174396   0.233588   0.747 0.455805    
age              -0.173302   0.100780  -1.720 0.086380 .  
addressU          0.104455   0.270791   0.386 0.699922    
famsizeLE3        0.036512   0.226680   0.161 0.872128    
PstatusT         -0.127673   0.335626  -0.380 0.703875    
Medu              0.129685   0.149999   0.865 0.387859    
Fedu             -0.133940   0.128768  -1.040 0.298974    
Mjobhealth       -0.146426   0.518491  -0.282 0.777796    
Mjobother         0.074088   0.332044   0.223 0.823565    
Mjobservices      0.046956   0.369587   0.127 0.898973    
Mjobteacher      -0.026276   0.481632  -0.055 0.956522    
Fjobhealth        0.330948   0.666601   0.496 0.619871    
Fjobother        -0.083582   0.476796  -0.175 0.860945    
Fjobservices     -0.322142   0.493265  -0.653 0.514130    
Fjobteacher      -0.112364   0.601448  -0.187 0.851907    
reasonhome       -0.209183   0.256392  -0.816 0.415123    
reasonother       0.307554   0.380214   0.809 0.419120    
reasonreputation  0.129106   0.267254   0.483 0.629335    
guardianmother    0.195741   0.252672   0.775 0.439046    
guardianother     0.006565   0.463650   0.014 0.988710    
traveltime        0.096994   0.157800   0.615 0.539170    
studytime        -0.104754   0.134814  -0.777 0.437667    
failures         -0.160539   0.161006  -0.997 0.319399    
schoolsupyes      0.456448   0.319538   1.428 0.154043    
famsupyes         0.176870   0.224204   0.789 0.430710    
paidyes           0.075764   0.222100   0.341 0.733211    
activitiesyes    -0.346047   0.205938  -1.680 0.093774 .  
nurseryyes       -0.222716   0.254184  -0.876 0.381518    
higheryes         0.225921   0.500398   0.451 0.651919    
internetyes      -0.144462   0.287528  -0.502 0.615679    
romanticyes      -0.272008   0.219732  -1.238 0.216572    
famrel            0.356876   0.114124   3.127 0.001912 ** 
freetime          0.047002   0.110209   0.426 0.670021    
goout             0.012007   0.105230   0.114 0.909224    
Dalc             -0.185019   0.153124  -1.208 0.227741    
Walc              0.176772   0.114943   1.538 0.124966    
health            0.062995   0.074800   0.842 0.400259    
absences          0.045879   0.013412   3.421 0.000698 ***
G1                0.188847   0.062373   3.028 0.002645 ** 
G2                0.957330   0.053460  17.907  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.901 on 353 degrees of freedom
Multiple R-squared:  0.8458,	Adjusted R-squared:  0.8279 
F-statistic: 47.21 on 41 and 353 DF,  p-value: < 2.2e-16

ANOVA

The ANOVA test tells us that the best model is the one with age.

In [8]:
<span class="o">%%</span>R
model1 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span>
model2 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> activities<span class="p">,</span> data <span class="o">=</span> train<span class="p">)</span>
anova<span class="p">(</span>fit<span class="p">,</span> model1<span class="p">,</span> model2<span class="p">)</span>
Analysis of Variance Table

Model 1: G3 ~ school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + famrel + freetime + goout + 
    Dalc + Walc + health + absences + G1 + G2
Model 2: G3 ~ G1 + G2 + absences + famrel + age
Model 3: G3 ~ G1 + G2 + absences + famrel + activities
  Res.Df    RSS  Df Sum of Sq      F Pr(>F)
1    353 1275.5                            
2    389 1376.1 -36  -100.589 0.7733 0.8248
3    389 1391.4   0   -15.309

Test Set

Very quickly, we have an accurate model that did a great job predicting our test set. Notice the darker alpha areas snugly against the line.

We can visually compare the success of the final model versus the saturated model by graphing the predicted values versus the actual values. The line represents a perfect model.

Please note the outliers around the actual values of 0. I will go into more detail about this group later in this project.

In [9]:
<span class="o">%%</span>R
<span class="c1">#Saturated Model</span>
control.model <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> .<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span>
control.graph <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>control.model<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> 
                       position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Saturated Model"</span><span class="p">)</span> <span class="o">+</span> 
                       geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
                       theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>
<span class="c1">#Final Model</span>
final.model <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span>
final.graph <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final.model<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> 
                     position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Final Model"</span><span class="p">,</span> guide<span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span> <span class="o">+</span> 
                     geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
                     theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>

grid.arrange<span class="p">(</span>control.graph<span class="p">,</span>final.graph<span class="p">,</span>nrow<span class="o">=</span><span class="m">2</span><span class="p">)</span>

Diagnostics

Overall, our model looks pretty good. The main issue with our model is the cluster when G3 is 0.

It affects the residuals at the lower end of our distribution.

In [10]:
<span class="o">%%</span>R
plot<span class="p">(</span>final.model<span class="p">)</span>




The 0-Cluster

Upon further inspection of the data, it becomes obvious that this cluster most likely belongs to students who dropped the course.

  1. They have G1 and/or G2 grades but final grades of 0.
  2. There are no G1s of 0 but there are G2s with 0 value.
  3. The exploratory model predicts these students as scoring between 0 and 10 which would constitute failing grades.

As a result, we should drop these data points before continuing our analysis since they will not be useful for the question we are researching.

In [11]:
<span class="o">%%</span>R
score0 <span class="o"><-</span> subset<span class="p">(</span>student.por<span class="p">,</span> G3<span class="o">==</span><span class="m">0</span><span class="p">)</span>
score0
school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
164     GP   M  18       U     LE3       T    1    1    other    other
441     MS   M  16       U     GT3       T    1    1  at_home services
520     MS   M  16       R     GT3       T    2    1    other services
564     MS   M  17       U     GT3       T    2    2    other    other
568     MS   M  18       R     GT3       T    3    2 services    other
584     MS   F  18       R     GT3       T    2    2    other    other
587     MS   F  17       U     GT3       T    4    2  teacher services
598     MS   F  18       R     GT3       T    2    2  at_home    other
604     MS   F  18       R     LE3       A    4    2  teacher    other
606     MS   F  19       U     GT3       T    1    1  at_home services
611     MS   F  19       R     GT3       A    1    1  at_home  at_home
627     MS   F  18       R     GT3       T    4    4    other  teacher
638     MS   M  18       R     GT3       T    2    1    other    other
640     MS   M  19       R     GT3       T    1    1    other services
641     MS   M  18       R     GT3       T    4    2    other    other
        reason guardian traveltime studytime failures schoolsup famsup fatherd
164     course   mother          1         1        2        no     no      no
441       home   mother          2         2        0        no    yes      no
520 reputation   mother          2         2        0        no     no      no
564     course   mother          1         1        1        no     no      no
568     course   mother          1         1        1        no     no      no
584      other   mother          2         1        1        no     no      no
587       home   mother          1         2        0       yes    yes      no
598     course   mother          3         2        1        no     no      no
604 reputation   mother          1         2        0        no     no      no
606      other   father          2         1        1        no     no      no
611     course    other          2         2        3        no    yes      no
627      other   father          3         2        0        no    yes      no
638      other   mother          2         1        0        no     no      no
640      other   mother          2         1        1        no     no      no
641       home   father          2         1        1        no     no     yes
    activities nursery higher internet romantic famrel freetime goout Dalc Walc
164         no     yes     no      yes      yes      2        3     5    2    5
441        yes     yes    yes       no      yes      5        4     5    4    5
520        yes     yes    yes      yes       no      5        2     1    1    1
564        yes     yes    yes       no      yes      1        2     1    2    3
568         no     yes     no      yes       no      2        3     1    2    2
584         no     yes     no      yes      yes      5        5     5    1    1
587        yes     yes    yes      yes       no      5        5     5    1    3
598        yes     yes    yes       no      yes      4        3     3    1    1
604        yes     yes    yes      yes      yes      5        3     1    1    1
606         no     yes     no       no       no      5        5     5    2    3
611        yes     yes     no       no      yes      3        5     4    1    4
627         no      no    yes      yes      yes      3        2     2    4    2
638        yes      no    yes      yes      yes      4        4     3    1    3
640         no     yes    yes       no       no      4        3     2    1    3
641         no     yes    yes       no       no      5        4     3    4    3
    health absences G1 G2 G3
164      4        0 11  9  0
441      3        0  7  0  0
520      2        0  8  7  0
564      5        0  7  0  0
568      5        0  4  0  0
584      3        0  8  6  0
587      5        0  8  8  0
598      4        0  9  0  0
604      5        0  5  0  0
606      2        0  5  0  0
611      1        0  8  0  0
627      5        0  7  5  0
638      5        0  7  7  0
640      5        0  5  8  0
641      3        0  7  7  0

Final Model

Here is the final model for students who finish the course.

In [12]:
<span class="o">%%</span>R
<span class="c1">#Final Model</span>
test <span class="o"><-</span> subset<span class="p">(</span>train<span class="p">,</span> G3<span class="o">!=</span><span class="m">0</span><span class="p">)</span>
final.model.no0 <span class="o"><-</span> lm<span class="p">(</span>G3<span class="o">~</span> G1 <span class="o">+</span> G2 <span class="o">+</span> absences <span class="o">+</span> famrel <span class="o">+</span> age<span class="p">,</span> data <span class="o">=</span> test<span class="p">)</span>
qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final.model.no0<span class="p">),</span> data <span class="o">=</span> test<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> 
      position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.5</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Final Model"</span><span class="p">,</span> guide<span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span> <span class="o">+</span> 
      geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
      theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>

Deeper Questions and Analysis

Our model does a great job at predicting student success; however, there are deeper questions that this model doesn’t address. In particular, it doesn’t demonstrate how we can pick which students are most likely to fail classes at an early age when they lack the best predictors in this model.

As we’ve seen, the best predictors of success are current grades within the course (G1 and G2), age, quality of family relationships, and absences.

Current grades are already present once a problem exists.

Let’s try to see if we can determine what factors can be more useful at preventing student failure and promoting academic success.

Let’s start by looking at all the variables within a linear model, but remove our strongest indicators, G1 and G2, which overshadow other potential factors.

In [13]:
<span class="o">%%</span>R
fit <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> student.mat<span class="p">)</span>

Our predictions stop at 15 but actual scores rise until 20. Without G1 and G2, our model is unable to make predictions that are any higher.

A score of 15 shows a clear dividing line where the “potential” futures merge into current academic success. This line is important in that it can help us determine what deeper differences successful students have from their peers and also allows to create a definition of a “successful” student that we can use.

For this section, it becomes clear that two models will need to be analyzed: one for grades below 15 and another for grades above 15.

In [14]:
<span class="o">%%</span>R
qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>fit<span class="p">),</span> data <span class="o">=</span> student.mat<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span> alpha<span class="o">=</span><span class="m">.8</span><span class="p">)</span> <span class="o">+</span> 
     geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
     theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>

Breaking Up the Analysis

So far, the data has shown that it should be broken into three parts in order to analyze deeper predictors of future success.

Students who drop 1. The first isolates students who drop a course. Their final outcome is 0 even though they should have a higher predicted outcome. These students have predicted scores below 10.

Students who finish 2. Between 0 and 15, one set of predictors (one model) will be used to predict student outcomes. 3. Between 15 and 20, a different set of predictors (a different model) will be used.

In [15]:
<span class="o">%%</span>R
<span class="c1">#Prep Data</span>
score0 <span class="o"><-</span> subset<span class="p">(</span>student.mat<span class="p">,</span> G3<span class="o">==</span><span class="m">0</span><span class="p">)</span>
score.no0 <span class="o"><-</span> subset<span class="p">(</span>student.mat<span class="p">,</span> G3<span class="o">!=</span><span class="m">0</span><span class="p">)</span>
score14 <span class="o"><-</span> subset<span class="p">(</span>score.no0<span class="p">,</span> G3<span class="o"><</span><span class="m">15</span><span class="p">)</span>
score15 <span class="o"><-</span> subset<span class="p">(</span>score.no0<span class="p">,</span> G3<span class="o">></span><span class="m">14</span><span class="p">)</span>

Students Above 15

Students in this group have 3 things that stand out:

1. All of them have parents that live together. 2. None of them have had past class failures. 3. All of them plan on seeking higher education.

In [16]:
<span class="o">%%</span>R
score15
school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
4       GP   F  15       U     GT3       T    4    2   health services
6       GP   M  16       U     LE3       T    4    3 services    other
9       GP   M  15       U     LE3       A    3    2 services    other
10      GP   M  15       U     GT3       T    3    4    other    other
15      GP   M  15       U     GT3       A    2    2    other    other
21      GP   M  15       U     GT3       T    4    3  teacher    other
22      GP   M  15       U     GT3       T    4    4   health   health
23      GP   M  16       U     LE3       T    4    2  teacher    other
28      GP   M  15       U     GT3       T    4    2   health services
32      GP   M  15       U     GT3       T    4    4 services services
33      GP   M  15       R     GT3       T    4    3  teacher  at_home
35      GP   M  16       U     GT3       T    3    2    other    other
37      GP   M  15       U     LE3       T    4    3  teacher services
38      GP   M  16       R     GT3       A    4    4    other  teacher
43      GP   M  15       U     GT3       T    4    4 services  teacher
48      GP   M  16       U     GT3       T    4    3   health services
57      GP   F  15       U     GT3       A    4    3 services services
58      GP   M  15       U     GT3       T    4    4  teacher   health
60      GP   F  16       U     GT3       T    4    2 services    other
66      GP   F  16       U     LE3       T    4    3  teacher services
70      GP   F  15       R     LE3       T    3    1    other    other
71      GP   M  16       U     GT3       T    3    1    other    other
84      GP   M  15       U     LE3       T    2    2 services services
92      GP   F  15       U     GT3       T    4    3 services    other
97      GP   M  16       R     GT3       T    4    3 services    other
102     GP   M  16       U     GT3       T    4    4 services  teacher
105     GP   M  15       U     GT3       A    3    4 services    other
108     GP   M  16       U     GT3       T    3    3 services    other
110     GP   F  16       U     LE3       T    4    4   health   health
111     GP   M  15       U     LE3       A    4    4  teacher  teacher
114     GP   M  15       U     LE3       T    4    2  teacher    other
116     GP   M  16       U     GT3       T    4    4  teacher  teacher
121     GP   F  15       U     GT3       T    1    2  at_home services
122     GP   M  15       U     GT3       T    2    2 services services
130     GP   M  16       R     GT3       T    4    4  teacher  teacher
140     GP   F  15       U     GT3       T    4    4  teacher  teacher
159     GP   M  16       R     GT3       T    2    2  at_home    other
168     GP   F  16       U     GT3       T    4    2   health services
172     GP   M  16       U     GT3       T    1    0    other    other
183     GP   F  17       U     GT3       T    2    4 services services
188     GP   M  16       U     LE3       T    2    1    other    other
196     GP   F  17       U     LE3       T    2    4 services services
197     GP   M  17       U     GT3       T    4    4 services  teacher
199     GP   F  17       U     GT3       T    4    4 services  teacher
201     GP   F  16       U     GT3       T    4    3   health    other
216     GP   F  17       U     LE3       T    3    2    other    other
223     GP   F  16       U     GT3       T    2    3 services  teacher
227     GP   F  17       U     GT3       T    3    2    other    other
246     GP   M  16       U     GT3       T    2    1    other    other
250     GP   M  16       U     GT3       T    0    2    other    other
261     GP   F  18       U     GT3       T    4    3 services    other
266     GP   M  18       R     LE3       A    3    4    other    other
287     GP   F  18       U     GT3       T    2    2  at_home  at_home
290     GP   M  18       U     LE3       A    4    4  teacher  teacher
292     GP   F  17       U     GT3       T    4    3   health services
294     GP   F  17       R     LE3       T    3    1 services    other
300     GP   M  18       U     LE3       T    4    4  teacher  teacher
304     GP   F  17       U     GT3       T    3    2   health   health
307     GP   M  20       U     GT3       A    3    2 services    other
324     GP   F  17       U     GT3       T    3    1 services services
325     GP   F  17       U     LE3       T    0    2  at_home  at_home
327     GP   M  17       U     GT3       T    3    3    other services
336     GP   F  17       U     GT3       T    3    4 services    other
339     GP   F  18       U     LE3       T    3    3 services services
343     GP   M  18       U     LE3       T    3    4 services    other
347     GP   M  18       R     GT3       T    4    3  teacher services
349     GP   F  17       U     GT3       T    4    3   health    other
360     MS   F  18       U     LE3       T    1    1  at_home services
364     MS   F  17       U     LE3       T    4    4  at_home  at_home
375     MS   F  18       R     LE3       T    4    4    other    other
377     MS   F  20       U     GT3       T    4    2   health    other
379     MS   F  18       U     GT3       T    3    3    other    other
392     MS   M  17       U     LE3       T    3    1 services services
        reason guardian traveltime studytime failures schoolsup famsup paid
4         home   mother          1         3        0        no    yes  yes
6   reputation   mother          1         2        0        no    yes  yes
9         home   mother          1         2        0        no    yes  yes
10        home   mother          1         2        0        no    yes  yes
15        home    other          1         3        0        no    yes   no
21  reputation   mother          1         2        0        no     no   no
22       other   father          1         1        0        no    yes  yes
23      course   mother          1         2        0        no     no   no
28       other   mother          1         1        0        no     no  yes
32  reputation   mother          2         2        0        no    yes   no
33      course   mother          1         2        0        no    yes   no
35        home   mother          1         1        0        no    yes  yes
37        home   mother          1         3        0        no    yes   no
38  reputation   mother          2         3        0        no    yes   no
43      course   father          1         2        0        no    yes   no
48  reputation   mother          1         4        0        no     no   no
57  reputation   mother          1         2        0        no    yes  yes
58  reputation   mother          1         2        0        no    yes   no
60      course   mother          1         2        0        no    yes   no
66      course   mother          3         2        0        no    yes   no
70  reputation   father          2         4        0        no    yes   no
71  reputation   father          2         4        0        no    yes  yes
84        home   mother          2         2        0        no     no  yes
92  reputation   mother          1         1        0        no     no  yes
97  reputation   mother          2         1        0       yes    yes   no
102      other   father          1         3        0        no    yes   no
105     course   mother          1         2        0        no    yes  yes
108       home   father          1         3        0        no    yes   no
110      other   mother          1         3        0        no    yes  yes
111     course   mother          1         1        0        no     no   no
114     course   mother          1         1        0        no     no   no
116     course   father          1         2        0        no    yes   no
121     course   mother          1         2        0        no     no   no
122       home   father          1         4        0        no    yes  yes
130     course   mother          1         1        0        no     no  yes
140     course   mother          2         1        0        no     no   no
159     course   mother          3         1        0        no     no   no
168       home   father          1         2        0        no     no  yes
172 reputation   mother          2         2        0        no    yes  yes
183 reputation   father          1         2        0        no    yes   no
188     course   mother          1         2        0        no     no  yes
196     course   father          1         2        0        no     no   no
197       home   mother          1         1        0        no     no   no
199       home   mother          2         1        1        no    yes   no
201       home   mother          1         2        0        no    yes   no
216 reputation   mother          2         2        0        no     no  yes
223      other   mother          1         2        0       yes     no   no
227     course   mother          1         2        0        no     no   no
246     course   mother          3         1        0        no     no   no
250      other   mother          1         1        0        no     no  yes
261       home   father          1         2        0        no    yes  yes
266 reputation   mother          2         2        0        no    yes  yes
287      other   mother          1         3        0        no    yes  yes
290 reputation   mother          1         2        0        no    yes  yes
292 reputation   mother          1         3        0        no    yes  yes
294 reputation   mother          2         4        0        no    yes  yes
300       home   mother          1         1        0        no    yes  yes
304 reputation   father          1         4        0        no    yes  yes
307     course    other          1         1        0        no     no   no
324     course   father          1         3        0        no    yes   no
325       home   father          2         3        0        no     no   no
327 reputation   mother          1         1        0        no     no   no
336     course   mother          1         3        0        no     no   no
339       home   mother          1         4        0        no    yes   no
343       home   mother          1         2        0        no     no   no
347     course   mother          1         3        0        no     no   no
349 reputation   mother          1         3        0        no    yes  yes
360     course   father          2         3        0        no     no   no
364     course   mother          1         2        0        no    yes  yes
375 reputation   mother          2         3        0        no     no   no
377     course    other          2         3        2        no    yes  yes
379       home   mother          1         2        0        no     no  yes
392     course   mother          2         1        0        no     no   no
    activities nursery higher internet romantic famrel freetime goout Dalc Walc
4          yes     yes    yes      yes      yes      3        2     2    1    1
6          yes     yes    yes      yes       no      5        4     2    1    2
9           no     yes    yes      yes       no      4        2     2    1    1
10         yes     yes    yes      yes       no      5        5     1    1    1
15          no     yes    yes      yes      yes      4        5     2    1    1
21          no     yes    yes      yes       no      4        4     1    1    1
22          no     yes    yes      yes       no      5        4     2    1    1
23         yes     yes    yes      yes       no      4        5     1    1    3
28          no     yes    yes      yes       no      2        2     4    2    4
32         yes     yes    yes      yes       no      4        3     1    1    1
33         yes     yes    yes      yes      yes      4        5     2    1    1
35          no      no    yes      yes       no      5        4     3    1    1
37         yes     yes    yes      yes       no      5        4     3    1    1
38         yes     yes    yes      yes      yes      2        4     3    1    1
43         yes     yes    yes      yes       no      4        3     3    1    1
48         yes     yes    yes      yes       no      4        2     2    1    1
57         yes     yes    yes      yes       no      4        3     2    1    1
58         yes     yes    yes       no       no      3        2     2    1    1
60          no     yes    yes      yes       no      4        2     3    1    1
66         yes     yes    yes      yes       no      5        4     3    1    2
70          no      no    yes      yes       no      4        4     2    2    3
71          no     yes    yes      yes       no      4        3     2    1    1
84         yes     yes    yes      yes       no      5        3     3    1    3
92         yes     yes    yes      yes       no      4        5     5    1    3
97         yes      no    yes      yes       no      3        3     3    1    1
102        yes     yes    yes      yes      yes      4        4     3    1    1
105        yes     yes    yes      yes       no      5        4     4    1    1
108        yes     yes    yes      yes       no      5        3     3    1    1
110        yes     yes    yes      yes      yes      5        4     5    1    1
111        yes     yes    yes      yes       no      5        5     3    1    1
114         no     yes    yes      yes       no      3        5     2    1    1
116        yes     yes    yes      yes       no      5        4     4    1    2
121         no      no    yes      yes       no      3        2     3    1    2
122        yes     yes    yes      yes       no      5        5     4    1    2
130        yes     yes    yes      yes       no      3        5     5    2    5
140        yes     yes    yes      yes       no      4        3     2    1    1
159         no      no    yes       no       no      4        2     2    1    2
168         no     yes    yes      yes      yes      4        2     3    1    1
172        yes     yes    yes      yes      yes      4        3     2    1    1
183        yes     yes    yes       no       no      5        4     2    2    3
188        yes     yes    yes      yes      yes      4        2     3    1    2
196        yes     yes    yes      yes      yes      4        3     2    1    1
197         no     yes    yes      yes       no      5        2     3    1    2
199         no     yes    yes      yes       no      4        2     4    2    3
201        yes     yes    yes      yes       no      4        3     5    1    5
216         no     yes    yes      yes       no      4        4     4    1    3
223         no     yes    yes      yes       no      2        3     1    1    1
227        yes      no    yes      yes       no      5        3     4    1    3
246         no     yes    yes      yes       no      4        3     3    1    1
250         no      no    yes      yes       no      4        3     2    2    4
261         no     yes    yes      yes      yes      3        1     2    1    3
266        yes     yes    yes      yes       no      4        2     5    3    4
287         no     yes    yes      yes       no      4        3     3    1    2
290        yes     yes    yes      yes       no      5        4     3    1    1
292         no     yes    yes      yes       no      4        2     2    1    2
294         no     yes    yes       no       no      3        1     2    1    1
300         no     yes    yes      yes      yes      1        4     2    2    2
304        yes      no    yes      yes       no      5        2     2    1    2
307        yes     yes    yes       no       no      5        5     3    1    1
324         no      no    yes      yes       no      3        4     3    2    3
325         no     yes    yes      yes       no      3        3     3    2    3
327        yes      no    yes      yes       no      4        3     5    3    5
336         no     yes    yes      yes       no      4        4     5    1    3
339         no     yes    yes      yes       no      5        3     3    1    1
343        yes     yes    yes      yes      yes      4        3     3    1    3
347         no     yes    yes      yes      yes      5        3     2    1    2
349        yes     yes    yes      yes      yes      4        4     3    1    3
360         no     yes    yes      yes       no      5        3     2    1    1
364        yes     yes    yes      yes      yes      2        3     4    1    1
375         no     yes    yes      yes       no      5        4     4    1    1
377         no      no    yes      yes      yes      5        4     3    1    1
379         no     yes    yes      yes      yes      4        1     3    1    2
392         no      no    yes      yes       no      2        4     5    3    4
    health absences G1 G2 G3
4        5        2 15 14 15
6        5       10 15 15 15
9        1        0 16 18 19
10       5        0 14 15 15
15       3        0 14 16 16
21       1        0 13 14 15
22       5        0 12 15 15
23       5        2 15 15 16
28       1        4 15 16 15
32       5        0 17 16 17
33       5        0 17 16 16
35       5        0 12 14 15
37       4        2 15 16 18
38       5        7 15 16 15
43       5        2 19 18 18
48       2        4 19 19 20
57       1        0 14 15 15
58       5        4 14 15 15
60       5        2 15 16 16
66       1        2 16 15 15
70       3       12 16 16 16
71       5        0 13 15 15
84       4        4 15 15 15
92       1        4 16 17 18
97       4        2 11 15 15
102      4        0 16 17 17
105      1        0 16 18 18
108      5        2 16 18 18
110      4        4 14 15 16
111      4        6 18 19 19
114      3       10 18 19 19
116      5        2 15 15 16
121      1        2 16 15 15
122      5        6 16 14 15
130      4        8 18 18 18
140      5        0 16 16 15
159      3        2 17 15 15
168      3        0 14 15 16
172      3        2 13 15 16
183      5        0 16 17 17
188      5        0 15 15 15
196      5        0 14 15 15
197      5        4 17 15 16
199      2       24 18 18 18
201      2        2 16 16 16
216      1        2 14 15 15
223      3        2 16 16 17
227      3       10 16 15 15
246      4        6 18 18 18
250      5        0 13 15 15
261      2       21 17 18 18
266      1       13 17 17 17
287      2        5 18 18 19
290      2        9 15 13 15
292      3        0 15 15 15
294      3        6 18 18 18
300      1        5 16 15 16
304      5        0 17 17 18
307      5        0 17 18 18
324      5        1 12 14 15
325      2        0 16 15 15
327      5        3 14 15 16
336      5       16 16 15 15
339      1        7 16 15 17
343      5       11 16 15 15
347      4        9 16 15 16
349      4        0 13 15 15
360      4        0 18 16 16
364      1        0 16 15 15
375      1        0 19 18 19
377      3        4 15 14 15
379      1        0 15 15 15
392      2        3 14 16 16

Students Below 14

Create a training and test set for this group.

In [17]:
<span class="o">%%</span>R
set.seed<span class="p">(</span><span class="m">123</span><span class="p">)</span>
inTraining <span class="o"><-</span> createDataPartition<span class="p">(</span>score14<span class="o">$</span>G3<span class="p">,</span> p <span class="o">=</span> <span class="m">.75</span><span class="p">,</span> list <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
training <span class="o"><-</span> score14<span class="p">[</span> inTraining<span class="p">,]</span>
testing  <span class="o"><-</span> score14<span class="p">[</span><span class="o">-</span>inTraining<span class="p">,]</span>
Saturated Model

Below is a general model with all of our variables using the training set. This can help determine which predictors are statistically significant.

In [18]:
<span class="o">%%</span>R
saturated14 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span>
summary<span class="p">(</span>saturated14<span class="p">)</span>
Call:
lm(formula = G3 ~ . - G1 - G2, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0532 -1.2297 -0.0758  1.5029  5.1073 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      10.740459   3.406452   3.153  0.00190 **
schoolMS         -0.669689   0.564845  -1.186  0.23739   
sexM              0.511533   0.416269   1.229  0.22079   
age               0.029858   0.168227   0.177  0.85933   
addressU          0.735560   0.458102   1.606  0.11016   
famsizeLE3        0.013457   0.379180   0.035  0.97173   
PstatusT          0.188910   0.558990   0.338  0.73581   
Medu             -0.008148   0.254834  -0.032  0.97453   
Fedu              0.006450   0.222098   0.029  0.97686   
Mjobhealth        1.584729   0.895228   1.770  0.07845 . 
Mjobother        -0.029289   0.528770  -0.055  0.95589   
Mjobservices      0.545055   0.619720   0.880  0.38033   
Mjobteacher      -0.175333   0.787911  -0.223  0.82416   
Fjobhealth       -0.772045   1.062648  -0.727  0.46849   
Fjobother        -0.539129   0.757634  -0.712  0.47767   
Fjobservices     -0.479938   0.785370  -0.611  0.54193   
Fjobteacher      -0.079528   1.048411  -0.076  0.93962   
reasonhome        0.688622   0.431180   1.597  0.11207   
reasonother       0.321321   0.607912   0.529  0.59778   
reasonreputation  0.435443   0.437953   0.994  0.32147   
guardianmother    0.217571   0.432852   0.503  0.61585   
guardianother     0.250323   0.754461   0.332  0.74045   
traveltime        0.092044   0.260529   0.353  0.72429   
studytime         0.411308   0.227115   1.811  0.07186 . 
failures         -0.806048   0.250153  -3.222  0.00152 **
schoolsupyes     -0.491972   0.479693  -1.026  0.30650   
famsupyes        -0.298508   0.366403  -0.815  0.41636   
paidyes           0.214191   0.373014   0.574  0.56656   
activitiesyes     0.388324   0.358694   1.083  0.28048   
nurseryyes       -0.407032   0.407825  -0.998  0.31964   
higheryes        -0.816840   0.799668  -1.021  0.30845   
internetyes      -0.134818   0.453433  -0.297  0.76657   
romanticyes       0.101779   0.360754   0.282  0.77818   
famrel           -0.016745   0.185782  -0.090  0.92829   
freetime          0.050437   0.184857   0.273  0.78530   
goout            -0.462603   0.179052  -2.584  0.01060 * 
Dalc              0.290830   0.240016   1.212  0.22727   
Walc             -0.025814   0.188709  -0.137  0.89135   
health           -0.056258   0.120458  -0.467  0.64106   
absences         -0.054226   0.020125  -2.694  0.00774 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.238 on 174 degrees of freedom
Multiple R-squared:  0.2508,	Adjusted R-squared:  0.08291 
F-statistic: 1.494 on 39 and 174 DF,  p-value: 0.04302

Let’s use the step function to find a cut down version of Model 1 that removes uneccesary predictors.

In [19]:
<span class="o">%%</span>R
step<span class="p">(</span>saturated14<span class="p">)</span>
Start:  AIC=380.6
G3 ~ (school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + famrel + freetime + goout + 
    Dalc + Walc + health + absences + G1 + G2) - G1 - G2

             Df Sum of Sq    RSS    AIC
- Fjob        4     5.065 876.93 373.84
- guardian    2     1.346 873.21 376.93
- reason      3    13.248 885.11 377.82
- Fedu        1     0.004 871.87 378.60
- Medu        1     0.005 871.87 378.60
- famsize     1     0.006 871.87 378.60
- famrel      1     0.041 871.91 378.61
- Walc        1     0.094 871.96 378.62
- age         1     0.158 872.02 378.64
- freetime    1     0.373 872.24 378.69
- romantic    1     0.399 872.26 378.69
- internet    1     0.443 872.31 378.71
- Pstatus     1     0.572 872.44 378.74
- traveltime  1     0.625 872.49 378.75
- health      1     1.093 872.96 378.86
- paid        1     1.652 873.52 379.00
- famsup      1     3.326 875.19 379.41
- nursery     1     4.991 876.86 379.82
- higher      1     5.228 877.09 379.88
- schoolsup   1     5.271 877.14 379.89
- activities  1     5.873 877.74 380.03
- school      1     7.043 878.91 380.32
- Dalc        1     7.357 879.22 380.40
- Mjob        4    32.450 904.32 380.42
- sex         1     7.567 879.43 380.45
<none>                    871.86 380.60
- address     1    12.919 884.78 381.74
- studytime   1    16.434 888.30 382.59
- goout       1    33.447 905.31 386.65
- absences    1    36.379 908.24 387.34
- failures    1    52.025 923.89 391.00

Step:  AIC=373.84
G3 ~ school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + reason + guardian + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + famrel + freetime + goout + 
    Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- guardian    2     0.947 877.88 370.07
- reason      3    13.191 890.12 371.03
- famsize     1     0.002 876.93 371.84
- Medu        1     0.014 876.94 371.84
- Fedu        1     0.027 876.96 371.84
- famrel      1     0.110 877.04 371.86
- Walc        1     0.378 877.31 371.93
- age         1     0.383 877.31 371.93
- freetime    1     0.400 877.33 371.93
- romantic    1     0.484 877.41 371.95
- traveltime  1     0.554 877.48 371.97
- Pstatus     1     0.674 877.60 372.00
- internet    1     0.705 877.63 372.01
- health      1     1.025 877.95 372.09
- paid        1     1.608 878.54 372.23
- famsup      1     3.436 880.37 372.67
- schoolsup   1     4.248 881.18 372.87
- higher      1     4.945 881.87 373.04
- nursery     1     5.473 882.40 373.17
- Mjob        4    30.591 907.52 373.17
- activities  1     6.427 883.36 373.40
- sex         1     6.927 883.86 373.52
- school      1     7.118 884.05 373.57
<none>                    876.93 373.84
- Dalc        1     9.398 886.33 374.12
- address     1    12.765 889.69 374.93
- studytime   1    15.487 892.42 375.58
- goout       1    31.904 908.83 379.48
- absences    1    36.836 913.77 380.64
- failures    1    54.906 931.84 384.83

Step:  AIC=370.07
G3 ~ school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + reason + traveltime + studytime + failures + 
    schoolsup + famsup + paid + activities + nursery + higher + 
    internet + romantic + famrel + freetime + goout + Dalc + 
    Walc + health + absences

             Df Sum of Sq    RSS    AIC
- reason      3    13.466 891.34 367.33
- Medu        1     0.001 877.88 368.07
- Fedu        1     0.002 877.88 368.07
- famsize     1     0.003 877.88 368.07
- famrel      1     0.099 877.98 368.09
- Walc        1     0.382 878.26 368.16
- freetime    1     0.520 878.40 368.19
- traveltime  1     0.521 878.40 368.19
- Pstatus     1     0.541 878.42 368.20
- romantic    1     0.548 878.42 368.20
- age         1     0.660 878.54 368.23
- internet    1     0.818 878.69 368.27
- health      1     1.034 878.91 368.32
- paid        1     1.537 879.41 368.44
- famsup      1     3.107 880.98 368.82
- schoolsup   1     4.288 882.16 369.11
- higher      1     4.696 882.57 369.21
- Mjob        4    29.905 907.78 369.24
- nursery     1     5.667 883.54 369.44
- activities  1     6.202 884.08 369.57
- sex         1     6.609 884.49 369.67
- school      1     7.736 885.61 369.94
<none>                    877.88 370.07
- Dalc        1     9.348 887.22 370.33
- address     1    13.101 890.98 371.24
- studytime   1    17.000 894.88 372.17
- goout       1    32.920 910.80 375.95
- absences    1    35.981 913.86 376.66
- failures    1    57.841 935.72 381.72

Step:  AIC=367.33
G3 ~ school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + traveltime + studytime + failures + schoolsup + 
    famsup + paid + activities + nursery + higher + internet + 
    romantic + famrel + freetime + goout + Dalc + Walc + health + 
    absences

             Df Sum of Sq    RSS    AIC
- Fedu        1     0.001 891.34 365.33
- Medu        1     0.012 891.35 365.33
- famsize     1     0.034 891.38 365.33
- famrel      1     0.165 891.51 365.36
- Pstatus     1     0.222 891.56 365.38
- freetime    1     0.312 891.65 365.40
- romantic    1     0.765 892.11 365.51
- Walc        1     0.804 892.15 365.52
- age         1     0.852 892.19 365.53
- internet    1     0.989 892.33 365.56
- health      1     1.151 892.49 365.60
- traveltime  1     1.173 892.52 365.61
- schoolsup   1     3.406 894.75 366.14
- higher      1     3.426 894.77 366.15
- famsup      1     3.993 895.34 366.28
- paid        1     4.108 895.45 366.31
- nursery     1     5.282 896.62 366.59
- Mjob        4    31.118 922.46 366.67
- activities  1     7.662 899.00 367.16
<none>                    891.34 367.33
- school      1     8.631 899.97 367.39
- sex         1     8.987 900.33 367.47
- Dalc        1     9.983 901.33 367.71
- address     1    14.591 905.93 368.80
- studytime   1    17.603 908.95 369.51
- absences    1    28.273 919.62 372.01
- goout       1    31.410 922.75 372.74
- failures    1    59.037 950.38 379.05

Step:  AIC=365.33
G3 ~ school + sex + age + address + famsize + Pstatus + Medu + 
    Mjob + traveltime + studytime + failures + schoolsup + famsup + 
    paid + activities + nursery + higher + internet + romantic + 
    famrel + freetime + goout + Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- Medu        1     0.014 891.36 363.33
- famsize     1     0.034 891.38 363.33
- famrel      1     0.167 891.51 363.37
- Pstatus     1     0.224 891.57 363.38
- freetime    1     0.313 891.66 363.40
- romantic    1     0.770 892.11 363.51
- Walc        1     0.808 892.15 363.52
- age         1     0.852 892.20 363.53
- internet    1     0.994 892.34 363.56
- health      1     1.155 892.50 363.60
- traveltime  1     1.174 892.52 363.61
- higher      1     3.434 894.78 364.15
- schoolsup   1     3.479 894.82 364.16
- famsup      1     4.105 895.45 364.31
- paid        1     4.110 895.45 364.31
- nursery     1     5.289 896.63 364.59
- Mjob        4    31.118 922.46 364.67
- activities  1     7.692 899.03 365.16
<none>                    891.34 365.33
- school      1     8.660 900.00 365.39
- sex         1     8.986 900.33 365.47
- Dalc        1     9.983 901.33 365.71
- address     1    14.827 906.17 366.86
- studytime   1    17.708 909.05 367.53
- absences    1    28.845 920.19 370.14
- goout       1    31.423 922.77 370.74
- failures    1    60.614 951.96 377.40

Step:  AIC=363.33
G3 ~ school + sex + age + address + famsize + Pstatus + Mjob + 
    traveltime + studytime + failures + schoolsup + famsup + 
    paid + activities + nursery + higher + internet + romantic + 
    famrel + freetime + goout + Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- famsize     1     0.027 891.38 361.34
- famrel      1     0.168 891.52 361.37
- Pstatus     1     0.233 891.59 361.38
- freetime    1     0.343 891.70 361.41
- romantic    1     0.761 892.12 361.51
- Walc        1     0.797 892.15 361.52
- age         1     0.892 892.25 361.54
- internet    1     0.982 892.34 361.56
- health      1     1.142 892.50 361.60
- traveltime  1     1.170 892.53 361.61
- schoolsup   1     3.478 894.83 362.16
- higher      1     3.506 894.86 362.17
- paid        1     4.104 895.46 362.31
- famsup      1     4.164 895.52 362.33
- nursery     1     5.298 896.65 362.60
- Mjob        4    31.633 922.99 362.79
- activities  1     7.679 899.04 363.16
<none>                    891.36 363.33
- school      1     8.652 900.01 363.40
- sex         1     8.973 900.33 363.47
- Dalc        1    10.176 901.53 363.76
- address     1    14.821 906.18 364.86
- studytime   1    17.697 909.05 365.54
- absences    1    29.530 920.89 368.30
- goout       1    32.214 923.57 368.93
- failures    1    61.869 953.23 375.69

Step:  AIC=361.34
G3 ~ school + sex + age + address + Pstatus + Mjob + traveltime + 
    studytime + failures + schoolsup + famsup + paid + activities + 
    nursery + higher + internet + romantic + famrel + freetime + 
    goout + Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- famrel      1     0.173 891.56 359.38
- Pstatus     1     0.261 891.65 359.40
- freetime    1     0.360 891.74 359.42
- romantic    1     0.754 892.14 359.52
- Walc        1     0.792 892.18 359.53
- age         1     0.890 892.27 359.55
- internet    1     0.972 892.36 359.57
- traveltime  1     1.147 892.53 359.61
- health      1     1.159 892.54 359.61
- schoolsup   1     3.481 894.86 360.17
- higher      1     3.512 894.90 360.18
- paid        1     4.133 895.52 360.33
- famsup      1     4.145 895.53 360.33
- nursery     1     5.441 896.83 360.64
- Mjob        4    31.713 923.10 360.82
- activities  1     7.657 899.04 361.17
<none>                    891.38 361.34
- school      1     8.746 900.13 361.42
- sex         1     8.948 900.33 361.47
- Dalc        1    10.150 901.53 361.76
- address     1    14.989 906.37 362.90
- studytime   1    17.738 909.12 363.55
- absences    1    29.551 920.93 366.31
- goout       1    32.349 923.73 366.96
- failures    1    61.924 953.31 373.71

Step:  AIC=359.38
G3 ~ school + sex + age + address + Pstatus + Mjob + traveltime + 
    studytime + failures + schoolsup + famsup + paid + activities + 
    nursery + higher + internet + romantic + freetime + goout + 
    Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- Pstatus     1     0.237 891.79 357.43
- freetime    1     0.320 891.88 357.45
- Walc        1     0.721 892.28 357.55
- romantic    1     0.759 892.32 357.56
- age         1     0.783 892.34 357.56
- internet    1     0.967 892.52 357.61
- traveltime  1     1.180 892.74 357.66
- health      1     1.302 892.86 357.69
- higher      1     3.509 895.07 358.22
- schoolsup   1     3.730 895.29 358.27
- famsup      1     4.043 895.60 358.34
- paid        1     4.088 895.64 358.36
- nursery     1     5.517 897.07 358.70
- Mjob        4    32.533 924.09 359.05
- activities  1     7.833 899.39 359.25
<none>                    891.56 359.38
- school      1     8.591 900.15 359.43
- sex         1     8.785 900.34 359.47
- Dalc        1    10.553 902.11 359.89
- address     1    15.168 906.72 360.99
- studytime   1    17.751 909.31 361.60
- absences    1    29.446 921.00 364.33
- goout       1    33.672 925.23 365.31
- failures    1    61.777 953.33 371.71

Step:  AIC=357.43
G3 ~ school + sex + age + address + Mjob + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + freetime + goout + Dalc + 
    Walc + health + absences

             Df Sum of Sq    RSS    AIC
- freetime    1     0.393 892.19 355.53
- romantic    1     0.734 892.53 355.61
- Walc        1     0.790 892.58 355.62
- internet    1     0.856 892.65 355.64
- age         1     0.865 892.66 355.64
- traveltime  1     1.167 892.96 355.71
- health      1     1.328 893.12 355.75
- higher      1     3.510 895.30 356.27
- schoolsup   1     3.851 895.64 356.36
- famsup      1     3.904 895.70 356.37
- paid        1     4.337 896.13 356.47
- nursery     1     5.610 897.40 356.78
- Mjob        4    32.387 924.18 357.07
<none>                    891.79 357.43
- activities  1     8.471 900.26 357.46
- school      1     8.555 900.35 357.48
- sex         1     9.120 900.91 357.61
- Dalc        1    10.434 902.23 357.92
- address     1    15.119 906.91 359.03
- studytime   1    17.625 909.42 359.62
- absences    1    31.137 922.93 362.78
- goout       1    33.963 925.76 363.43
- failures    1    61.749 953.54 369.76

Step:  AIC=355.53
G3 ~ school + sex + age + address + Mjob + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + goout + Dalc + Walc + health + 
    absences

             Df Sum of Sq    RSS    AIC
- romantic    1     0.743 892.93 353.71
- internet    1     0.873 893.06 353.74
- age         1     0.878 893.06 353.74
- Walc        1     0.943 893.13 353.75
- traveltime  1     1.099 893.28 353.79
- health      1     1.328 893.51 353.85
- higher      1     3.576 895.76 354.38
- famsup      1     3.807 895.99 354.44
- schoolsup   1     3.955 896.14 354.47
- paid        1     4.371 896.56 354.57
- nursery     1     5.809 897.99 354.92
- Mjob        4    32.890 925.08 355.27
<none>                    892.19 355.53
- school      1     8.466 900.65 355.55
- activities  1     8.650 900.84 355.59
- sex         1     9.728 901.91 355.85
- Dalc        1    11.010 903.20 356.15
- address     1    14.924 907.11 357.08
- studytime   1    17.236 909.42 357.62
- absences    1    31.582 923.77 360.97
- goout       1    34.992 927.18 361.76
- failures    1    61.373 953.56 367.76

Step:  AIC=353.71
G3 ~ school + sex + age + address + Mjob + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + goout + Dalc + Walc + health + absences

             Df Sum of Sq    RSS    AIC
- Walc        1     0.768 893.70 351.89
- internet    1     0.822 893.75 351.90
- age         1     1.116 894.04 351.97
- traveltime  1     1.200 894.13 351.99
- health      1     1.440 894.37 352.05
- higher      1     3.931 896.86 352.65
- famsup      1     3.974 896.90 352.66
- schoolsup   1     4.136 897.06 352.69
- paid        1     4.382 897.31 352.75
- nursery     1     5.476 898.40 353.01
- school      1     8.246 901.17 353.67
<none>                    892.93 353.71
- Mjob        4    34.369 927.30 353.79
- activities  1     9.014 901.94 353.86
- sex         1     9.236 902.16 353.91
- Dalc        1    10.926 903.86 354.31
- address     1    15.568 908.50 355.40
- studytime   1    17.880 910.81 355.95
- absences    1    30.856 923.78 358.98
- goout       1    35.702 928.63 360.10
- failures    1    62.751 955.68 366.24

Step:  AIC=351.89
G3 ~ school + sex + age + address + Mjob + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + goout + Dalc + health + absences

             Df Sum of Sq    RSS    AIC
- internet    1     0.832 894.53 350.09
- traveltime  1     1.178 894.87 350.17
- age         1     1.280 894.98 350.20
- health      1     1.674 895.37 350.29
- schoolsup   1     3.747 897.44 350.78
- famsup      1     3.907 897.60 350.82
- higher      1     4.045 897.74 350.86
- paid        1     4.107 897.80 350.87
- nursery     1     5.185 898.88 351.13
- school      1     8.093 901.79 351.82
<none>                    893.70 351.89
- sex         1     8.529 902.23 351.92
- Mjob        4    34.632 928.33 352.03
- activities  1     9.147 902.84 352.07
- Dalc        1    10.831 904.53 352.47
- address     1    16.038 909.73 353.70
- studytime   1    18.926 912.62 354.37
- absences    1    33.120 926.82 357.68
- goout       1    44.792 938.49 360.36
- failures    1    64.389 958.09 364.78

Step:  AIC=350.09
G3 ~ school + sex + age + address + Mjob + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + goout + Dalc + health + absences

             Df Sum of Sq    RSS    AIC
- traveltime  1     1.205 895.73 348.38
- age         1     1.372 895.90 348.42
- health      1     1.457 895.99 348.44
- paid        1     3.700 898.23 348.97
- famsup      1     3.862 898.39 349.01
- schoolsup   1     3.869 898.40 349.01
- higher      1     4.046 898.58 349.05
- nursery     1     5.122 899.65 349.31
- school      1     8.066 902.60 350.01
- sex         1     8.380 902.91 350.08
<none>                    894.53 350.09
- Mjob        4    34.276 928.81 350.14
- activities  1     9.092 903.62 350.25
- Dalc        1    10.697 905.23 350.63
- address     1    15.359 909.89 351.73
- studytime   1    18.504 913.03 352.47
- absences    1    35.357 929.89 356.38
- goout       1    46.356 940.89 358.90
- failures    1    63.557 958.09 362.78

Step:  AIC=348.38
G3 ~ school + sex + age + address + Mjob + studytime + failures + 
    schoolsup + famsup + paid + activities + nursery + higher + 
    goout + Dalc + health + absences

             Df Sum of Sq    RSS    AIC
- age         1     1.129 896.86 346.65
- health      1     1.463 897.20 346.73
- paid        1     3.902 899.64 347.31
- famsup      1     3.993 899.73 347.33
- schoolsup   1     4.111 899.84 347.36
- higher      1     4.645 900.38 347.48
- nursery     1     4.855 900.59 347.53
- school      1     7.129 902.86 348.07
- Mjob        4    33.484 929.22 348.23
<none>                    895.73 348.38
- sex         1     8.708 904.44 348.45
- activities  1     9.307 905.04 348.59
- Dalc        1    10.420 906.15 348.85
- address     1    14.186 909.92 349.74
- studytime   1    18.379 914.11 350.72
- absences    1    35.641 931.37 354.73
- goout       1    46.348 942.08 357.17
- failures    1    62.368 958.10 360.78

Step:  AIC=346.65
G3 ~ school + sex + address + Mjob + studytime + failures + schoolsup + 
    famsup + paid + activities + nursery + higher + goout + Dalc + 
    health + absences

             Df Sum of Sq    RSS    AIC
- health      1     1.494 898.36 345.00
- paid        1     3.929 900.79 345.58
- famsup      1     4.117 900.98 345.63
- nursery     1     5.106 901.97 345.86
- higher      1     5.181 902.04 345.88
- schoolsup   1     5.555 902.42 345.97
- school      1     6.043 902.91 346.08
- Mjob        4    32.895 929.76 346.35
<none>                    896.86 346.65
- sex         1     8.662 905.52 346.70
- activities  1     8.875 905.74 346.75
- Dalc        1    10.389 907.25 347.11
- address     1    13.763 910.63 347.91
- studytime   1    19.266 916.13 349.19
- absences    1    34.513 931.37 352.73
- goout       1    45.430 942.29 355.22
- failures    1    61.840 958.70 358.92

Step:  AIC=345
G3 ~ school + sex + address + Mjob + studytime + failures + schoolsup + 
    famsup + paid + activities + nursery + higher + goout + Dalc + 
    absences

             Df Sum of Sq    RSS    AIC
- paid        1     3.868 902.22 343.92
- famsup      1     4.211 902.57 344.00
- nursery     1     5.102 903.46 344.21
- higher      1     5.326 903.68 344.27
- schoolsup   1     5.371 903.73 344.28
- school      1     5.855 904.21 344.39
- Mjob        4    31.786 930.14 344.44
<none>                    898.36 345.00
- sex         1     8.595 906.95 345.04
- activities  1     9.235 907.59 345.19
- Dalc        1     9.792 908.15 345.32
- address     1    14.177 912.53 346.35
- studytime   1    20.456 918.81 347.82
- absences    1    33.903 932.26 350.93
- goout       1    44.888 943.24 353.44
- failures    1    62.365 960.72 357.37

Step:  AIC=343.92
G3 ~ school + sex + address + Mjob + studytime + failures + schoolsup + 
    famsup + activities + nursery + higher + goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- famsup      1     2.669 904.89 342.55
- nursery     1     4.396 906.62 342.96
- higher      1     4.483 906.71 342.98
- Mjob        4    31.089 933.31 343.17
- school      1     5.625 907.85 343.25
- schoolsup   1     5.748 907.97 343.28
- sex         1     7.586 909.81 343.71
- activities  1     8.104 910.33 343.84
<none>                    902.22 343.92
- Dalc        1    11.099 913.32 344.54
- address     1    13.441 915.67 345.09
- studytime   1    24.800 927.02 347.72
- absences    1    33.936 936.16 349.82
- goout       1    44.784 947.01 352.29
- failures    1    64.064 966.29 356.60

Step:  AIC=342.55
G3 ~ school + sex + address + Mjob + studytime + failures + schoolsup + 
    activities + nursery + higher + goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- Mjob        4    28.859 933.75 341.27
- nursery     1     4.102 909.00 341.52
- higher      1     4.575 909.47 341.63
- school      1     4.830 909.72 341.69
- schoolsup   1     6.597 911.49 342.11
- activities  1     8.247 913.14 342.50
<none>                    904.89 342.55
- sex         1     9.493 914.39 342.79
- Dalc        1    10.634 915.53 343.05
- address     1    14.000 918.89 343.84
- studytime   1    23.656 928.55 346.08
- absences    1    33.952 938.85 348.44
- goout       1    45.351 950.24 351.02
- failures    1    62.076 966.97 354.75

Step:  AIC=341.27
G3 ~ school + sex + address + studytime + failures + schoolsup + 
    activities + nursery + higher + goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- higher      1     2.544 936.30 339.85
- nursery     1     3.300 937.05 340.03
- school      1     5.738 939.49 340.58
- activities  1     6.604 940.36 340.78
- schoolsup   1     7.273 941.03 340.93
- Dalc        1     7.334 941.09 340.95
<none>                    933.75 341.27
- sex         1    12.155 945.91 342.04
- address     1    15.614 949.37 342.82
- studytime   1    18.323 952.08 343.43
- absences    1    32.306 966.06 346.55
- goout       1    36.907 970.66 347.57
- failures    1    57.539 991.29 352.07

Step:  AIC=339.85
G3 ~ school + sex + address + studytime + failures + schoolsup + 
    activities + nursery + goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- nursery     1     3.745 940.04 338.71
- school      1     5.372 941.67 339.08
- activities  1     6.023 942.32 339.23
- Dalc        1     6.560 942.86 339.35
- schoolsup   1     7.469 943.77 339.55
<none>                    936.30 339.85
- sex         1    14.398 950.69 341.12
- address     1    15.218 951.51 341.30
- studytime   1    17.742 954.04 341.87
- absences    1    30.266 966.56 344.66
- goout       1    38.949 975.25 346.58
- failures    1    55.098 991.39 350.09

Step:  AIC=338.71
G3 ~ school + sex + address + studytime + failures + schoolsup + 
    activities + goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- school      1     5.025 945.07 337.85
- activities  1     6.277 946.32 338.13
- Dalc        1     6.681 946.72 338.22
- schoolsup   1     7.978 948.02 338.52
<none>                    940.04 338.71
- address     1    13.750 953.79 339.82
- sex         1    15.573 955.61 340.22
- studytime   1    17.445 957.49 340.64
- absences    1    29.515 969.56 343.32
- goout       1    41.370 981.41 345.93
- failures    1    53.366 993.41 348.53

Step:  AIC=337.85
G3 ~ sex + address + studytime + failures + schoolsup + activities + 
    goout + Dalc + absences

             Df Sum of Sq    RSS    AIC
- Dalc        1     5.606 950.67 337.12
- schoolsup   1     6.666 951.73 337.35
- activities  1     8.239 953.31 337.71
<none>                    945.07 337.85
- sex         1    15.633 960.70 339.36
- studytime   1    20.818 965.88 340.51
- address     1    21.896 966.96 340.75
- absences    1    26.956 972.02 341.87
- goout       1    42.577 987.64 345.28
- failures    1    52.706 997.77 347.46

Step:  AIC=337.12
G3 ~ sex + address + studytime + failures + schoolsup + activities + 
    goout + absences

             Df Sum of Sq    RSS    AIC
- activities  1     5.991 956.66 336.46
- schoolsup   1     7.160 957.83 336.72
<none>                    950.67 337.12
- address     1    18.871 969.54 339.32
- studytime   1    20.347 971.02 339.65
- absences    1    23.952 974.62 340.44
- sex         1    24.752 975.42 340.62
- goout       1    37.787 988.46 343.46
- failures    1    49.046 999.72 345.88

Step:  AIC=336.46
G3 ~ sex + address + studytime + failures + schoolsup + goout + 
    absences

            Df Sum of Sq     RSS    AIC
- schoolsup  1     5.570  962.23 335.70
<none>                    956.66 336.46
- address    1    15.417  972.08 337.88
- absences   1    22.411  979.07 339.42
- studytime  1    23.997  980.66 339.76
- sex        1    29.866  986.53 341.04
- goout      1    36.975  993.64 342.57
- failures   1    51.533 1008.20 345.69

Step:  AIC=335.7
G3 ~ sex + address + studytime + failures + goout + absences

            Df Sum of Sq     RSS    AIC
<none>                    962.23 335.70
- address    1    13.323  975.56 336.64
- absences   1    21.649  983.88 338.46
- studytime  1    24.962  987.20 339.18
- goout      1    34.506  996.74 341.24
- sex        1    35.079  997.31 341.36
- failures   1    52.412 1014.65 345.05

Call:
lm(formula = G3 ~ sex + address + studytime + failures + goout + 
    absences, data = training)

Coefficients:
(Intercept)         sexM     addressU    studytime     failures        goout  
   10.25165      0.86537      0.60566      0.45018     -0.64907     -0.37654  
   absences  
   -0.03561
Model 2

Model 2 will be equivalent to the output of the step function.

In [20]:
<span class="o">%%</span>R
model2 <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> 
    absences<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span>
In [21]:
<span class="o">%%</span>R
subs <span class="o"><-</span> regsubsets<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> absences<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span>
df <span class="o"><-</span> data.frame<span class="p">(</span>est <span class="o">=</span> c<span class="p">(</span>summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>adjr2<span class="p">,</span> 
                         summary<span class="p">(</span>subs<span class="p">)</span><span class="o">$</span>bic<span class="p">),</span>
                 x <span class="o">=</span> rep<span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">,</span> <span class="m">6</span><span class="p">),</span>
                 type <span class="o">=</span> rep<span class="p">(</span>c<span class="p">(</span><span class="s">"adjr2"</span><span class="p">,</span> <span class="s">"bic"</span><span class="p">),</span> 
                            each <span class="o">=</span> <span class="m">6</span><span class="p">))</span>
qplot<span class="p">(</span>x<span class="p">,</span> est<span class="p">,</span> data <span class="o">=</span> df<span class="p">,</span> geom <span class="o">=</span> <span class="s">"line"</span><span class="p">)</span> <span class="o">+</span>
      theme_bw<span class="p">()</span> <span class="o">+</span> facet_grid<span class="p">(</span>type <span class="o">~</span> .<span class="p">,</span> scales <span class="o">=</span> <span class="s">"free_y"</span><span class="p">)</span>

In [22]:
<span class="o">%%</span>R
summary<span class="p">(</span>model2<span class="p">)</span>
Call:
lm(formula = G3 ~ sex + address + studytime + failures + goout + 
    absences, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2007 -1.3576 -0.1115  1.6244  4.4124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.25165    0.68067  15.061  < 2e-16 ***
sexM         0.86537    0.31502   2.747 0.006544 ** 
addressU     0.60566    0.35776   1.693 0.091972 .  
studytime    0.45018    0.19427   2.317 0.021464 *  
failures    -0.64907    0.19330  -3.358 0.000935 ***
goout       -0.37654    0.13820  -2.725 0.006990 ** 
absences    -0.03561    0.01650  -2.158 0.032075 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.156 on 207 degrees of freedom
Multiple R-squared:  0.1732,	Adjusted R-squared:  0.1492 
F-statistic: 7.226 on 6 and 207 DF,  p-value: 5.128e-07
Model 3

Model 3 will be our final model.

In [23]:
<span class="o">%%</span>R
model3 <span class="o"><-</span> lm<span class="p">(</span>formula <span class="o">=</span> G3 <span class="o">~</span> sex <span class="o">+</span> failures<span class="p">,</span> data <span class="o">=</span> training<span class="p">)</span>
summary<span class="p">(</span>model3<span class="p">)</span>
Call:
lm(formula = G3 ~ sex + failures, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3655 -1.3655 -0.0253  1.6345  3.8171 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.3655     0.2136  48.520  < 2e-16 ***
sexM          0.6599     0.3088   2.137   0.0337 *  
failures     -0.8424     0.1949  -4.323 2.37e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.235 on 211 degrees of freedom
Multiple R-squared:  0.09429,	Adjusted R-squared:  0.08571 
F-statistic: 10.98 on 2 and 211 DF,  p-value: 2.898e-05

ANOVA

We can now compare the 3 models we made using ANOVA.

In [24]:
<span class="o">%%</span>R
anova<span class="p">(</span>saturated14<span class="p">,</span>model2<span class="p">,</span>model3<span class="p">)</span>
Analysis of Variance Table

Model 1: G3 ~ (school + sex + age + address + famsize + Pstatus + Medu + 
    Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + 
    failures + schoolsup + famsup + paid + activities + nursery + 
    higher + internet + romantic + famrel + freetime + goout + 
    Dalc + Walc + health + absences + G1 + G2) - G1 - G2
Model 2: G3 ~ sex + address + studytime + failures + goout + absences
Model 3: G3 ~ sex + failures
  Res.Df     RSS  Df Sum of Sq      F   Pr(>F)   
1    174  871.86                                 
2    207  962.23 -33   -90.368 0.5465 0.978867   
3    211 1054.04  -4   -91.806 4.5805 0.001532 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, ANOVA isn’t very useful since the strongest predictors from the original model have been cut out. By comparing models graphically, it’s easier to get an idea of what’s going on.

By removing the strong predictors of the original model, single predictors become less important and holistic models become more accurate. Below, we see that Model 1 performs the best on the test set.

This gives insight into how we should approach these students early on. One indicator will not make or break a child, but the overall profile can still be a strong indicator.

In [25]:
<span class="o">%%</span>R
<span class="c1">#Models</span>
final1 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data<span class="o">=</span>testing<span class="p">)</span>
final2 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> address <span class="o">+</span> studytime <span class="o">+</span> failures <span class="o">+</span> goout <span class="o">+</span> absences<span class="p">,</span> data<span class="o">=</span> testing<span class="p">)</span>
final3 <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> sex <span class="o">+</span> failures<span class="p">,</span> data<span class="o">=</span>testing<span class="p">)</span>
             
<span class="c1">#Graphs</span>
plot1 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final1<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span>
         alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 1"</span><span class="p">)</span> <span class="o">+</span> 
         geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
         theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>
plot2 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final2<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span>
         alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 2"</span><span class="p">)</span> <span class="o">+</span> 
         geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span> slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
         theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>
plot3 <span class="o"><-</span> qplot<span class="p">(</span>G3<span class="p">,</span> predict<span class="p">(</span>final3<span class="p">),</span> data <span class="o">=</span> testing<span class="p">,</span> geom <span class="o">=</span> <span class="s">"point"</span><span class="p">,</span> position <span class="o">=</span> <span class="s">"jitter"</span><span class="p">,</span>
         alpha<span class="o">=</span><span class="m">.8</span><span class="p">,</span> main<span class="o">=</span><span class="s">"Model 3"</span><span class="p">)</span> <span class="o">+</span> 
         geom_abline<span class="p">(</span>intercept<span class="o">=</span><span class="m">0</span><span class="p">,</span>slope<span class="o">=</span><span class="m">1</span><span class="p">)</span> <span class="o">+</span>
         theme<span class="p">(</span>legend.position<span class="o">=</span><span class="s">"none"</span><span class="p">)</span>

grid.arrange<span class="p">(</span>plot1<span class="p">,</span>plot2<span class="p">,</span>plot3<span class="p">,</span>nrow<span class="o">=</span><span class="m">2</span><span class="p">,</span>main<span class="o">=</span><span class="s">"3 Models"</span><span class="p">)</span>

The most important influencers of the holistic model are: – The school the student attends – Access to school supplies – Past failures – Absences – How often the student goes out

In [26]:
<span class="o">%%</span>R
tester <span class="o"><-</span> lm<span class="p">(</span>G3 <span class="o">~</span> . <span class="o">-</span>G1 <span class="o">-</span>G2<span class="p">,</span> data<span class="o">=</span&g...

To leave a comment for the author, please follow the link and comment on their blog: NYC Data Science Academy » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)