An exercise in non-linear modeling

September 7, 2014
By

(This article was first published on G-Forge » R, and kindly contributed to R-bloggers)

Finding the right curve can be tricky. The image is CC by Martin Gommel.
Finding the right curve can be tricky. The image is CC by Martin Gommel.

In my previous post I wrote about the importance of age and why it is a good idea to try avoiding modeling it as a linear variable. In this post I will go through multiple options for (1) modeling non-linear effects in a linear regression setting, (2) benchmark the methods on a real dataset, and (3) look at how the non-linearities actually look. The post is based on the supplement in my article on age and health-related quality of life (HRQoL).

Background

What is linearity?

Wikipedia has an excellent explanation of linearity:

linearity refers to a mathematical relationship or function that can be graphically represented as a straight line

Why do we assume linearity?

Linearity is a common assumption that is made when building a linear regression model. In a linear relation, a continuous variable has the same impact throughout the variable’s span. This makes the estimate is easy to interpret; an increase of one unit gives the corresponding coefficient’s change in the outcome. While this generates simple models with its advantages, it is difficult to believe that nature follows a simple straight line. With todays large data sets I believe that our models should permit non-linear effects.

How do we do non-linearity?

There are plenty of non-linear alternatives that can be used to better find the actual relationships. Most of them rely on converting the single continuous variable into several. The simplest form is when we transform the variable into a polynomial, e.g. instead of having the model:

HRQoL ~ β0 + βage * Age + βsex * Sex

We expand age to also include the age squared:

HRQoL ~ β0 + βage * Age + βage” * Age2 + βsex * Sex

This allows for the line for a bend, unfortunately as we add the squared term the coefficients are more difficult to interpret, and after adding a cubic term, i.e. Age3, it is almost impossible to interpret the coefficients. Due to this difficulty in interpretation I either use the rms::contrast function or the stats::predict in order to illustrate how the variable behaves.

Splines

A frequently used alternative to polynomials are splines. The most basic form of a spline consists of lines that are connected at different “knots”. For instance, a linear spline with 1 knot can assume a V-shape, while 2 knots allow for an N-shaped relationship. The number of knots decide the flexibility of a spline, i.e. more knots allow a more detailed description of the relationship.

The models

The dataset comes from the Swedish Hip Arthroplasty Register‘s excellent PROM database and consists of more than 30,000 patients that have filled out the EQ-5D form both before and after surgery. We will focus on age and its impact on the EQ-5D index and the EQ-VAS one year after total hip replacement surgery. We will control for sex, Charnley class, pre-operative HRQoL, pre-operative pain.

The restricted cubic spline

I’m a big fan of Frank Harrell‘s rms-package so we will start there although we start by splitting the dataset using the caret::createDataPartition. We then need to set the rms::datadist in order for the rms-functions to work as expected:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(caret)
train_no <- createDataPartition(1:nrow(my_dataset), 
                                list = FALSE,
                                p = .7)
train <- my_dataset[train_no, ]
test <- my_dataset[-train_no, ]
 
# A list with all the fits that are later to be benchmarked
fits <- list(eq5d = list(),
             eq_vas = list())
 
# The rms-setup
library(rms)
ddist <- datadist(train)
options("datadist" = "ddist")

Frank Harrell is a proponent of restricted cubic splines, alias natural cubic splines. This is a type of spline that uses cubic terms in the center of the data and restricts the ends to a straight line, preventing the center from distorting the ends, i.e. curling. His rcs() also nicely integrates with the anova in order to check if non-linearity actually exists:

?View Code RSPLUS
1
2
3
4
5
idx_model <- ols(eq5d1 ~ rcs(Age, 3) + 
                   Sex * Charnley_class + 
                   rcs(eq5d0, 3)+rcs(pain0, 3),
             data=train, x=TRUE, y=TRUE)
anova(idx_model, ss=FALSE)

gives:

                Analysis of Variance          Response: eq5d1 

 Factor                                              F      d.f. P     
 Age                                                 140.71  2   <.0001
  Nonlinear                                           41.07  1   <.0001
 Sex  (Factor+Higher Order Factors)                   26.94  3   <.0001
  All Interactions                                     6.80  2   0.0011
 Charnley_class  (Factor+Higher Order Factors)       275.08  4   <.0001
  All Interactions                                     6.80  2   0.0011
 eq5d0                                               522.37  2   <.0001
  Nonlinear                                           36.54  1   <.0001
 pain0                                                 3.85  2   0.0213
  Nonlinear                                            4.56  1   0.0328
 Sex * Charnley_class  (Factor+Higher Order Factors)   6.80  2   0.0011
 TOTAL NONLINEAR                                      30.48  3   <.0001
 TOTAL NONLINEAR + INTERACTION                        21.02  5   <.0001
 TOTAL                                               330.16 11   <.0001

Error d.f.: 19096 

As we can see the model seems OK for the EQ-5D index, supporting both the non-linearity and the interaction between Charnley class and sex. If we for some reason cannot use the rms-package we can check for linearity by using the splines::ns with two regular models as suggested below:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
lm1 <- lm(eq5d1 ~ Age + 
            Sex * Charnley_class + 
            ns(eq5d0, 3)+ns(pain0, 3),
          data=train)
lm2 <- lm(eq5d1 ~ ns(Age, 3) + 
            Sex * Charnley_class + 
            ns(eq5d0, 3)+ns(pain0, 3),
          data=train)
anova(lm1, lm2)

gives:

Analysis of Variance Table

Model 1: eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 
    3)
Model 2: eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 
    3)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1  19095 193.01                                  
2  19093 192.66  2   0.35112 17.398 2.825e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In order to avoid overfitting we will try to select models based upon the AIC/BIC criteria. The selection is simply finding the lowest value where in general AIC allows slightly more complex models compared to BIC. We will start with finding the optimal number of knots for the EQ-5D index using the AIC:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#' A simple function for updating the formula and extracting
#' the information criteria
#' 
#' @param no A number that is used together with the add_var_str
#' @param fit A regression fit that is used for the update
#' @param rm_var The variable that is to be substituted
#' @param add_var_str A sprintf() string that accepts the no
#'  parameter for each update
#' @param ic_fn The information criteria function (AIC/BIC)
getInfCrit <- function(no, fit, rm_var, add_var_str, ic_fn) {
  new_var <- sprintf(add_var_str, no)
  updt_frml <- as.formula(sprintf(".~.-%s+%s", 
                                  rm_var,
                                  new_var))
  ret <- ic_fn(update(fit, updt_frml))
  names(ret) <- new_var
  return(ret)}
 
# We start with a model where the other splines 
# have been AIC-selected
idx_model <- ols(eq5d1 ~ rcs(Age, 3) + 
                   Sex * Charnley_class + 
                   rcs(eq5d0, 8)+rcs(pain0, 6),
             data=train, x=TRUE, y=TRUE)
 
sapply(3:9, getInfCrit, fit = idx_model,
  rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) 
#  -33678.89   -33674.68   -33686.30   -33686.93   -33685.95   -33686.73   -33699.37 
 
fits$eq5d[["RCS with AIC"]] <- update(idx_model, .~.-rcs(Age, 3)+rcs(Age, 5))

It can be discussed if the model should stop at 3 knots but I chose to continue a little higher as the drop was relatively large between the 4 and 5 knots. This is most likely due to a unfortunate fit for the 4 knots. We could also have motivated a larger number of knots but even with proper visualization these are difficult to interpret. When modeling confounders, such as the preoperative EQ-5D index (eq5d0) and the pre-operative pain (pain0), we may prefer a higher number of knots in order to avoid any residual confounding and we do not need to worry about visualizing/explaining the relations.

Now if we apply the same methodology to the EQ-VAS we get:

?View Code RSPLUS
1
2
3
4
5
6
7
8
vas_model <- ols(eq_vas1 ~ Sex * Charnley_cat + Sex + rcs(Age, 3) + 
                   rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3),
                 data=train, x=TRUE, y=TRUE)
sapply(3:9, getInfCrit, fit = vas_model,
  rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) 
#    166615.6    166619.8    166600.2    166600.1    166602.0    166603.0    166596.7 
fits$eq_vas[["RCS with AIC"]] <- update(vas_model, .~.-rcs(Age, 3)+rcs(Age, 5))

Now lets do the same for the BIC:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
idx_model <- ols(eq5d1 ~ rcs(Age, 3) + 
                   Sex * Charnley_cat + 
                   rcs(eq5d0, 3)+rcs(pain0, 3),
                 data=train, x=TRUE, y=TRUE)
 
sapply(3:9, getInfCrit, fit = idx_model,
  rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) 
#   -33486.35   -33474.16   -33477.95   -33470.69   -33462.17   -33455.28   -33459.98 
fits$eq5d[["RCS with BIC"]] <- idx_model
 
vas_model <- ols(eq_vas1 ~ rcs(Age, 3) + 
                   Sex * Charnley_cat + 
                   rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3),
                 data=train, x=TRUE, y=TRUE)
 
sapply(3:9, getInfCrit, fit = vas_model,
  rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) 
#    166725.7    166737.8    166726.0    166733.8    166743.5    166752.4    166754.0 
fits$eq_vas[["RCS with BIC"]] <- vas_model

B-splines

B-splines, alias Basis spline, are common alternatives to restricted cubic splines that also use knots to control for flexibility. As these are not restricted at the ends they have more flexible tails than restricted cubic splines. In order to get a comparison we will use the same model for the other variables:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Use the same setting model as used in the RCS
vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
idx_model <- update(fits$eq5d[["RCS with AIC"]], 
                    sprintf(".~.-%s+bs(Age, 3)", rm_var))
sapply(3:9, getInfCrit, fit = idx_model,
  rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) 
#  -33669.07  -33683.35  -33680.55  -33683.44  -33683.03  -33681.79  -33685.55 
# Chose 6 knots for illustration as it otherwise be the
# same as the BIC model - not that interesting
fits$eq5d[["BS with AIC"]] <- 
  update(idx_model, .~.-bs(Age, 3)+bs(Age, 6))
 
vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
idx_model <- update(fits$eq5d[["RCS with BIC"]], 
                    sprintf(".~.-%s+bs(Age, 3)", rm_var))
sapply(3:9, getInfCrit, fit = idx_model,
  rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) 
#  -33468.29  -33475.09  -33464.40  -33459.42  -33451.12  -33442.38  -33438.71 
fits$eq5d[["BS with BIC"]] <- 
  update(idx_model, .~.-bs(Age, 3)+bs(Age, 4))
 
vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
vas_model <- update(fits$eq_vas[["RCS with AIC"]], 
                    sprintf(".~.-%s+bs(Age, 3)", rm_var))
sapply(3:9, getInfCrit, fit = vas_model,
  rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) 
#   166640.3   166617.2   166621.1   166612.9   166613.2   166614.8   166615.0 
fits$eq_vas[["BS with AIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 6))
 
vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
vas_model <- update(fits$eq_vas[["RCS with BIC"]], 
                    sprintf(".~.-%s+bs(Age, 3)", rm_var))
sapply(3:9, getInfCrit, fit = vas_model,
  rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) 
#   166750.4   166735.1   166746.9   166746.5   166754.7   166764.2   166772.2 
fits$eq_vas[["BS with BIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 4))

Polynomials

Polynomials can be of varying degrees and have often been argued as a simple approach to fitting a more flexible non-linear relationship. As the majority of the patients are located around the mean age, 69.1 years, it is important to remember that these patients will have the strongest influence on the curve appearance. It is therefore possible that the tails are less reliable than the central portion.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Use the same setting model as used in the RCS
vars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
idx_model <- update(fits$eq5d[["RCS with AIC"]], 
                    sprintf(".~.-%s+poly(Age, 2)", rm_var))
sapply(2:9, getInfCrit, fit = idx_model,
  rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) 
#    -33669.58    -33669.07    -33680.10    -33678.83    -33681.82    -33681.89    -33680.35    -33680.17 
fits$eq5d[["Polynomial with AIC"]] <- 
  update(idx_model, .~.-poly(Age, 2)+poly(Age, 6))
 
vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
idx_model <- update(fits$eq5d[["RCS with BIC"]], 
                    sprintf(".~.-%s+poly(Age, 2)", rm_var))
sapply(2:9, getInfCrit, fit = idx_model,
  rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) 
#    -33476.79    -33468.29    -33471.83    -33462.59    -33457.76    -33450.37    -33440.97    -33432.99 
fits$eq5d[["Polynomial with BIC"]] <- idx_model
 
vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
vas_model <- update(fits$eq_vas[["RCS with AIC"]], 
                    sprintf(".~.-%s+poly(Age, 2)", rm_var))
sapply(2:9, getInfCrit, fit = vas_model,
  rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) 
#     166638.4     166640.3     166622.3     166623.9     166615.7     166616.8     166617.2     166617.5 
fits$eq_vas[["Polynomial with AIC"]] <- 
  update(vas_model, .~.-poly(Age, 2)+poly(Age, 6))
 
vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels")
rm_var <- vars[grep("Age", vars, fixed = TRUE)]
vas_model <- update(fits$eq_vas[["RCS with BIC"]], 
                    sprintf(".~.-%s+poly(Age, 2)", rm_var))
sapply(2:9, getInfCrit, fit = vas_model,
  rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) 
#     166740.6     166750.4     166740.2     166749.7     166749.3     166758.3     166766.6     166774.7 
fits$eq_vas[["Polynomial with BIC"]] <- 
  update(vas_model, .~.-poly(Age, 2)+poly(Age, 4))

Multiple Fractional Polynomial Models

Multiple fractional polynomials (MFP) have been proposed as an alternative to splines. These use a defined set of exponential transformations of the variable, where it omits predictors according to their p-values. The mfp has a built-in algorithm and we don't need to rely on either BIC or AIC with MFP.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(mfp)
 
# We will use the simple BIC models and
# instead of rcs() that is not available
# for mfp we use ns() from the splines package
fits$eq5d[["MFP"]] <- 
  mfp(eq5d1 ~ fp(Age) + 
        Sex * Charnley_class + 
        ns(eq5d0, 3)+ns(pain0, 3),
      data=train)
fits$eq_vas[["MFP"]] <- 
  mfp(eq_vas1 ~ fp(Age) + 
        Sex * Charnley_class + 
        ns(eq_vas0, 3)+ns(pain0, 3),
      data=train)

Generalized additive models

Generalized additive model (GAM) are an extension to generalized linear models and specializes in non-linear relationships. Elements of statistical learning by Hastie et. al. is an excellent source for diving deeper into these. The simplest way to explain the GAM smoothers is that they penalize the flexibility in order to avoid over-fitting, there plenty of options - the ones used here are:

Thin plate regression splines: This is generally the most common type of smoother in GAM models. The name refers to the physical analogy of bending a thin sheet of metal.
Cubic regression splines: The basis for the spline is cubic with evenly spread knots.
P-splines: P-splines are similar to B-splines in that they share basis with the main difference that P-splines enforce a penalty on the coefficients.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(mgcv)
 
fits$eq5d[["GAM thin plate"]] <- 
  gam(eq5d1 ~ s(Age, bs="tp") + 
        Sex * Charnley_class + 
        ns(eq5d0, 3) + ns(pain0, 3),
      data=train)
fits$eq_vas[["GAM thin plate"]] <- 
  gam(eq_vas1 ~ s(Age, bs="tp") + 
        Sex * Charnley_class + 
        ns(eq_vas0, 3) + ns(pain0, 3), 
      data=train)
 
fits$eq5d[["GAM cubic regression"]] <-
  update(fits$eq5d[["GAM thin plate"]], 
         .~.-s(Age, bs="tp")+s(Age, bs="cr"))
 
fits$eq_vas[["GAM cubic regression"]] <- 
  update(fits$eq_vas[["GAM thin plate"]], 
         .~.-s(Age, bs="tp")+s(Age, bs="cr"))
 
 
fits$eq5d[["GAM P-splines"]] <-
  update(fits$eq5d[["GAM thin plate"]], 
         .~.-s(Age, bs="tp")+s(Age, bs="ps"))
 
fits$eq_vas[["GAM P-splines"]] <- 
  update(fits$eq_vas[["GAM thin plate"]], 
         .~.-s(Age, bs="tp")+s(Age, bs="ps"))

Benchmark

With all these fancy models we will first try to evaluate how they perform when cross-validated and then on our test-set that we've set apart at the start. We will evaluate according to root-mean-square error (RMSE) and mean absolute percentage error (MAPE). RMSE tends to penalize for having outliers while the MAPE is more descriptive of the performance on average. Our testing functions will therefore be:

?View Code RSPLUS
1
2
3
4
5
6
7
8
rmse <- function(fit, newdata, outcome_var){
  pred <- predict(fit, newdata=newdata)
  sqrt(mean((newdata[, outcome_var]-pred)^2, na.rm=TRUE))
}
mape <- function(fit, newdata, outcome_var){
  pred <- predict(fit, newdata=newdata)
  mean(abs(newdata[, outcome_var]-pred)/mean(newdata[, outcome_var], na.rm=TRUE)*100, na.rm=TRUE)
}

Furthermore in this particular article I wanted to look into what happens at the tails as almost 70 % of the patients were between 60 and 80 years while the variable span was 40 to 100 years. I therefore defined a central vs peripheral portion where the central portion was defined as being between the 15:th and 85:th percentile.

The cross-validation was a straight forward implementation. As this can take a little time it is useful to think about parallelization, we will here use the parallel-package although the foreach is also an excellent option.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
crossValidateInParallel <- function(fit, df, 
                                    outcome_var, 
                                    k=10, tail_prop=.15){
  df$central <- with(df, Age >= quantile(Age, probs=tail_prop) & 
                       Age <= quantile(Age, probs=1-tail_prop))
  df$cv <- rep(1:k, length.out=nrow(df))
 
  cv_internal_fn <- function(x, cv_fit,
                             df, outcome_var){
    lc_train <- subset(df, cv != x)
    lc_test <- subset(df, cv == x)
 
    # In the original version I had a BIC/AIC optimizing
    # function that worked with the cross-validation
    # Here, we'll keep it simple and just use the previously
    # chosen functions
    cv_fit <- update(fit, data=lc_train)
 
    return(c(Main_RMSE = 
               rmse(cv_fit, 
                    newdata=lc_test, 
                    outcome_var=outcome_var),
             Central_RMSE = 
               rmse(cv_fit, 
                    newdata=subset(lc_test, central == TRUE), 
                    outcome_var=outcome_var),
             Peripheral_RMSE = 
               rmse(cv_fit, 
                    newdata=subset(lc_test, central ==  FALSE), 
                    outcome_var=outcome_var),
             Main_MAPE = 
               mape(cv_fit, 
                    newdata=lc_test, 
                    outcome_var=outcome_var),
             Central_MAPE = 
               mape(cv_fit, 
                    newdata=subset(lc_test, central == TRUE), 
                    outcome_var=outcome_var),
             Peripheral_MAPE = 
               mape(cv_fit, 
                    newdata=subset(lc_test, central ==  FALSE), 
                    outcome_var=outcome_var)))
    }
 
  # It is convenient to use the detectCores() function in
  # order to use the machines full capacity. Subtracting
  # 1 is also preferable as it prohibits the computer from
  # stopping other tasks, i.e. you can keep surfing the web 🙂
  cl <- makeCluster(mc <- getOption("cl.cores", ifelse(detectCores() <= 1,
                                                       1,
                                                       detectCores() - 1)))
  # In Windows each cores starts out fresh and we therefore need
  # to export the functions, data etc so that they can access
  # it as expected or you will get nasty errors
  clusterEvalQ(cl, library(mfp))
  clusterEvalQ(cl, library(mgcv))
  clusterEvalQ(cl, library(rms))
  clusterEvalQ(cl, library(splines))
  clusterEvalQ(cl, library(stringr))
  clusterExport(cl, c("rmse", "mape"))
 
  res <- parallel::parLapply(1:k,
                             cl=cl,
                             fun=cv_internal_fn,
                             outcome_var=outcome_var,
                             cv_fit = fit,
                             df=df)
 
  stopCluster(cl)
  res <- as.data.frame(Gmisc::mergeLists(lapplyOutput=res))
  ret <- colMeans(res)
  attr(ret, "raw") <- res
  return(ret)
}

If we run all the models we get the following result:

RMSE   MAPE
Method Main Central Peripheral   Main Central Peripheral
The EQ-5D index
Restricted cubic splines
  AIC 0.100 0.098 0.105   8.83 8.60 9.42
  BIC 0.100 0.099 0.105   8.86 8.63 9.46
B-splines
  AIC 0.100 0.099 0.105   8.85 8.80 9.48
  BIC 0.101 0.099 0.105   8.91 8.79 9.48
Polynomials
  AIC 0.103 0.105 0.117   8.94 9.09 10.21
  BIC 0.103 0.105 0.116   8.96 9.07 10.09
MFP 0.101 0.099 0.105   8.87 8.64 9.43
Generalized additive models
  Thin plate 0.100 0.098 0.105   8.86 8.62 9.47
  Cubic regression 0.100 0.098 0.105   8.86 8.62 9.47
  P-splines 0.100 0.098 0.105   8.86 8.62 9.47
The EQ VAS
Restricted cubic splines
  AIC 18.54 18.49 18.66   19.0 18.7 19.7
  BIC 18.54 18.49 18.67   19.0 18.7 19.7
B-splines
  AIC 18.54 18.66 18.65   19.0 19.1 19.7
  BIC 18.55 18.66 18.69   19.1 19.0 19.7
Polynomials
  AIC 19.27 20.06 21.81   19.5 20.1 22.9
  BIC 19.24 19.96 21.67   19.5 20.0 22.6
MFP 18.55 18.50 18.69   19.0 18.7 19.7
Generalized additive models
  Thin plate 18.54 18.48 18.67   19.0 18.7 19.7
  Cubic regression 18.54 18.48 18.67   19.0 18.7 19.7
  P-splines 18.54 18.49 18.67   19.0 18.7 19.7

The difference between is almost negligable although the polynomial is clearly not performing as well as the other methods. This was though less pronounced in the testset where the polynomial had some issues with the tails:

RMSE   MAPE
Method Main Central Peripheral   Main Central Peripheral
The EQ-5D index
Restricted cubic splines
  AIC 0.100 0.098 0.104   8.75 8.56 9.23
  BIC 0.100 0.099 0.104   8.79 8.60 9.27
B-splines
  AIC 0.100 0.099 0.104   8.77 8.77 9.30
  BIC 0.100 0.099 0.105   8.85 8.78 9.34
Polynomials
  AIC 0.100 0.099 0.106   8.73 8.66 9.18
  BIC 0.101 0.099 0.106   8.78 8.69 9.15
MFP 0.100 0.099 0.104   8.79 8.61 9.25
Generalized additive models
  Thin plate 0.100 0.098 0.105   8.79 8.59 9.29
  Cubic regression 0.100 0.098 0.104   8.79 8.60 9.27
  P-splines 0.100 0.099 0.104   8.79 8.60 9.28
The EQ VAS
Restricted cubic splines
  AIC 18.70 18.50 19.18   19.1 18.7 20.1
  BIC 18.71 18.51 19.19   19.1 18.8 20.2
B-splines
  AIC 18.70 18.71 19.18   19.2 19.2 20.2
  BIC 18.71 18.71 19.17   19.2 19.1 20.2
Polynomials
  AIC 18.74 18.80 19.67   19.1 19.0 20.4
  BIC 18.74 18.75 19.64   19.1 19.0 20.4
MFP 18.72 18.52 19.19   19.2 18.8 20.1
Generalized additive models
  Thin plate 18.69 18.50 19.17   19.1 18.7 20.1
  Cubic regression 18.69 18.50 19.17   19.1 18.7 20.1
  P-splines 18.69 18.50 19.17   19.1 18.7 20.1
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
cv_results <- 
  list(eq5d = NULL,
       eq_vas = NULL)
for (group in names(fits)){
  for (method in names(fits[[group]])){
    cv_results[[group]] <-
      rbind(cv_results[[group]],
            crossValidateInParallel(fits[[group]][[method]],
                                    df=train, 
                                    outcome_var = sprintf("%s1", group)))
 
  }
  rownames(cv_results[[group]]) <-
    names(fits[[group]])
}
 
out <- rbind(t(apply(cv_results$eq5d, 1, 
                     function(x) c(sprintf("%.3f", x[1:3]),
                                   sprintf("%.2f", x[4:6])))),
             t(apply(cv_results$eq_vas, 1, 
                     function(x) c(sprintf("%.2f", x[1:3]),
                                   sprintf("%.1f", x[4:6])))))
cgroup <- unique(gsub("^.+_", "", colnames(cv_results$eq5d)))
n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(cv_results$eq5d))))
colnames(out) <- gsub("_.+$", "", colnames(cv_results$eq5d))
rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "",
                                 c(names(fits$eq5d),
                                   names(fits$eq_vas))))
htmlTable(out, 
          rgroupCSSstyle = "",
          rowlabel = "Method", 
          cgroup = cgroup, n.cgroup = n.cgroup, 
          rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2),
          n.rgroup = rep(c(2, 2, 2, 1, 3), 2),
          tspanner = c("The EQ-5D index",
                       "The EQ VAS"),
          n.tspanner = sapply(cv_results, nrow),
          ctable=TRUE)
test$central <- with(test, 
                     Age >= quantile(Age, probs=.15) & 
                       Age <= quantile(Age, probs=.85))
 
testset_results <- 
  list(eq5d = NULL,
       eq_vas = NULL)
for (group in names(fits)){
  outcome_var <- sprintf("%s1", group)
  for (method in names(fits[[group]])){
    testset_results[[group]] <-
      rbind(testset_results[[group]],
            c(Main_RMSE = 
               rmse(fits[[group]][[method]], 
                    newdata=test, 
                    outcome_var=outcome_var),
             Central_RMSE = 
               rmse(fits[[group]][[method]], 
                    newdata=subset(test, central == TRUE), 
                    outcome_var=outcome_var),
             Peripheral_RMSE = 
               rmse(fits[[group]][[method]], 
                    newdata=subset(test, central ==  FALSE), 
                    outcome_var=outcome_var),
             Main_MAPE = 
               mape(fits[[group]][[method]], 
                    newdata=test, 
                    outcome_var=outcome_var),
             Central_MAPE = 
               mape(fits[[group]][[method]], 
                    newdata=subset(test, central == TRUE), 
                    outcome_var=outcome_var),
             Peripheral_MAPE = 
               mape(fits[[group]][[method]], 
                    newdata=subset(test, central ==  FALSE), 
                    outcome_var=outcome_var)))
  }
  rownames(testset_results[[group]]) <-
    names(fits[[group]])
}
 
 
out <- rbind(t(apply(testset_results$eq5d, 1, 
                     function(x) c(sprintf("%.3f", x[1:3]),
                                   sprintf("%.2f", x[4:6])))),
             t(apply(testset_results$eq_vas, 1, 
                     function(x) c(sprintf("%.2f", x[1:3]),
                                   sprintf("%.1f", x[4:6])))))
cgroup <- unique(gsub("^.+_", "", colnames(testset_results$eq5d)))
n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(testset_results$eq5d))))
colnames(out) <- gsub("_.+$", "", colnames(testset_results$eq5d))
rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "",
                                 c(names(fits$eq5d),
                                   names(fits$eq_vas))))
htmlTable(out, 
          rgroupCSSstyle = "",
          rowlabel = "Method", 
          cgroup = cgroup, n.cgroup = n.cgroup, 
          rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2),
          n.rgroup = rep(c(2, 2, 2, 1, 3), 2),
          tspanner = c("The EQ-5D index",
                       "The EQ VAS"),
          n.tspanner = sapply(testset_results, nrow),
          ctable=TRUE)

Graphs

Now lets look at how the relations found actually look. In order to quickly style all graphs I use a common setup:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# The ggplot look and feel
my_theme <-   theme_bw() +
  theme(legend.position = "bottom", 
        legend.text=element_text(size=7),
        axis.title.x = element_text(size=9),
        axis.title.y = element_text(size=9),
        axis.text.x  = element_text(size=8),
        axis.text.y  = element_text(size=8))
 
 
getPredDf <- function(ds, eq_type, eq_values, Age){
  new_data <- data.frame(Age = Age)
  new_data$Sex="Male"
  new_data[[sprintf("%s0", eq_type)]] = eq_values
  new_data$Charnley_class="A"
  new_data$pain0 = median(train$pain0, na.rm=TRUE)
  new_data$OpNr =  1
 
  ret <- list("Best" = new_data)
  new_data$Sex="Female"
  new_data$Charnley_class="C"
  new_data$OpNr =  2
  ret[["Worst"]] <- new_data
 
  return(ret)
}
 
getAgePredDf <- function(ds, eq_type){
  mode_for_preop_eq <- as.numeric(names(which.max(table(ds[[sprintf("%s0", eq_type)]]))))
  Age <- seq(from=40, to=90, by=.1)
  return(getPredDf(ds, eq_type, 
                   eq_values = rep(mode_for_preop_eq, length.out=length(Age)),
                   Age=Age))
}
 
getRmsPred <- function(nd, model, eq_type){
  new_data <- nd[["Best"]]
  best_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95)
  new_data <- nd[["Worst"]]
  worst_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95)
 
  getDf <- function(pred, newdata, eq_type){
    df <- data.frame(Age = newdata$Age,
                     Lower = pred$lower,
                     Upper = pred$upper)
    df[[sprintf("%s1", eq_type)]] = pred$linear.predictors
    df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]]
    df$Pain = newdata$pain0
    df$OpNr = newdata$OpNr
    return(df)
  }
  df <- getDf(best_age_spl, new_data, eq_type)
  tmp <- getDf(worst_age_spl, new_data, eq_type)
  df$Cat = 1
  tmp$Cat = 2
  df <- rbind(df, tmp)
  rm(tmp)
  df$Cat <- factor(as.numeric(df$Cat), 
                   labels=c(" Sex: Male n Charnley class: A n First THR  ", 
                            " Sex: Female n Charnley class: C n Previous contralateral THR  "))
  return(df)
}
 
getGlmPred <- function(nd, model, eq_type){
  new_data <- nd[["Best"]]
  best_age_spl <- predict(model, newdata=new_data, se.fit = TRUE)
  new_data <- nd[["Worst"]]
  worst_age_spl <- predict(model, newdata=new_data, se.fit = TRUE)
 
  getDf <- function(pred, newdata, eq_type){
    df <- data.frame(Age = newdata$Age,
                     Lower = pred$fit - qnorm(0.975)*pred$se.fit,
                     Upper = pred$fit + qnorm(0.975)*pred$se.fit)
    df[[sprintf("%s1", eq_type)]] = pred$fit
    df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]]
    df$Pain = newdata$pain0
    df$OpNr = newdata$OpNr
    return(df)
  }
  df <- getDf(best_age_spl, new_data, eq_type)
  tmp <- getDf(worst_age_spl, new_data, eq_type)
  df$Cat = 1
  tmp$Cat = 2
  df <- rbind(df, tmp)
  rm(tmp)
  df$Cat <- factor(as.numeric(df$Cat), 
                   labels=c(" Sex: Malen Charnley class: A n First THR  ", 
                            " Sex: Femalen Charnley class: C n Previous contralateral THR  "))
  return(df)
}
 
g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)
}
 
plot2Lines <- function(eq5d_df, vas_df){
  clrs <- brewer.pal(3, "Pastel1")
  clrs <- GISTools::add.alpha(clrs, .5)
  clrs <- clrs[c language="(2,1,3)"][/c]
  gg_eq5d <- ggplot(data=eq5d_df, aes(x=Age, y=eq5d1, fill=Cat, linetype=Cat)) +
    geom_hline(yintercept=eq5d_df$eq5d0[1], lwd=1, lty=2, colour="#00640055") +
    geom_ribbon(aes(ymin=Lower, ymax=Upper)) +
    geom_line() +
    ylab("EQ-5D index one year after surgery") +
    xlab("Age (years)") +
    scale_x_continuous(expand=c(0,0))+
    scale_linetype_manual(name = "", values=c(1,2)) +
    scale_fill_manual(name = "", 
                      values = clrs[1:2],
                      guide="none")  +
    my_theme
 
  gg_eqvas <- ggplot(data=vas_df, aes(x=Age, y=eq_vas1, fill=Cat, linetype=Cat)) +
    geom_hline(yintercept=vas_df$eq_vas0[1], lwd=1, lty=2, colour="#00640055") +
    geom_ribbon(aes(ymin=Lower, ymax=Upper)) +
    geom_line() +
    theme_bw() +
    ylab("EQ VAS one year after surgery") +
    xlab("Age (years)") +
    scale_x_continuous(expand=c(0,0))+
    scale_linetype_manual(name = "", values=c(1,2)) +
    scale_fill_manual(name = "", 
                      values = clrs[1:2]) +
    theme(legend.position = "bottom") +
    my_theme
 
 
  mylegend<-g_legend(gg_eqvas)
 
  grid.arrange(arrangeGrob(gg_eq5d + theme(legend.position="none"),
                           gg_eqvas + theme(legend.position="none"),
                           nrow=1),
               mylegend, nrow=2,heights=c(10, 2))
 
}

It is important to remember that different algorithms will find different optima and may therefore seem different to the eye even though they fit the data equally well. I think of it as a form of skeleton that defines how it can move, it will try to adapt the best it can but within its boundaries. We can for instance bend our elbow but not the forearm (unless you need my surgical skill).

Restricted cubic splines

AIC

RCS_AIC

BIC

RCS_BIC

B-splines

AIC

BS_AIC

BIC

BS_BIC

Polynomials

Note that the y-scale differs for the polynomials

AIC

Poly_AIC

BIC

Poly_BIC

MFP - multiple fractional polynomial

MFP_plot

Generalized additive models

Thin plate regression splines

GAM_tp_plot

Cubic regression splines

GAM_cr_plot

P-splines

GAM_ps_plot

Summary

I hope you enjoyed the post and found something useful. If there is some model that you lack, write up some code and I can see if it runs. Note that there are differences from the original supplement that used a slightly more complex setup for choosing the number of knots and a slightly different dataset.

flattr this!

To leave a comment for the author, please follow the link and comment on their blog: G-Forge » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)