The rms package offers a variety of tools to build and evaluate regression models in R. Originally named ‘Design’, the package accompanies the book “Regression Modeling Strategies” by Frank Harrell, which is essential reading for anyone who works in the ‘data science’ space. Over the past year or so, I have transitioned my personal modeling scripts to rms as it makes things such as bootstrapping, model validation, and plotting predicted probabilities easier to do. While the package is fairly well documented, I wanted to put together a ‘simpler’ and more accessible introduction that would explain to R-beginners how they could start using the rms package. For those with limited statistics training, I strongly suggest reading “Clinical Prediction Models” and working your way up to “Regression Modeling Strategies”.
We start this introduction to the rms package with the datadist function, which computes statistical summaries of predictors to automate estimation and plotting of effects. The user will generally supply the final data frame to the datadist function and set the data distribution using the options function. Note that if you modify the data in your data frame, then you will need to reset the distribution summaries using datadist.
my.df = data.frame(one=c(rnorm(100)), two=c(rnorm(100)), y=rnorm(100)) dd = datadist(my.df) options(datadist="dd") my.df_new = subset(my.df, two >= 0.05) ddd <- datadist(my.df_new) options( datadist = "ddd" ) ddd
The main functions to estimate models in rms are ols for linear models and lrm for logistic regression or ordinal logistic regression. There are also a few other functions for performing survival analysis,but they will not be covered in this post.
getHdata(prostate) head(prostate) ddd <- datadist(prostate) options( datadist = "ddd" )
Using the prostate data, we built a linear model using ordinary least squares estimation. The argument x and y must be set to true if we plan on evaluate the model in later stages using the validate and calibrate functions. Because we haven’t altered the default contrasts or incorporated any smoothing splines, the coefficients and standard errors should be identical to the results of lm. Use the model variable name to see the estimates from the model and use the summary function to get an overview of the effects of each predictor on the response variable. One important thing to note that the effect point estimates in the summary.rms output relate to the estimated effect of an inter-quartile range increase in the predictor variable.
lmod = ols(wt ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE) lmod summary(lmod) summary(lmod, age=c(50,70))
This may not seem like anything to write home about. But what makes the rms package special is that it makes the modeling process significantly easier. For the above linear regression model, let’s plot the predicted values and perform internal bootstrapped validation of the model. In the following code, the validate function is used to assess model fit and calibrate is used to assess if the observed responses agree with predicted responses.
plot(anova(lmod), what='proportion chisq') # relative importance plot(Predict(lmod)) # predicted values rms::validate(lmod, method="boot", B=500) # bootstrapped validation my.calib <- rms::calibrate(lmod, method="boot", B=500) # model calibration plot(my.calib, las=1) vif(lmod) # test for multicolinearity Predict(lmod)
Let us now build a logistic regression model using the lrm function, plot the expected probabilities, and evaluate the model. We also use the pentrace function to perform logistic regression with penalized maximum likelihood estimation.
mod1 = lrm(as.factor(bm) ~ age + sbp + rx, data=prostate, x=TRUE, y=TRUE) mod1 summary(mod1) plot(anova(mod1), what='proportion chisq') # relative importance plot(Predict(mod1, fun=plogis)) # predicted values rms::validate(mod1, method="boot", B=500) # bootstrapped validation my.calib <- rms::calibrate(mod1, method="boot", B=500) # model calibration plot(my.calib, las=1) penalty <- pentrace(mod1, penalty=c(0.5,1,2,3,4,6,8,12,16,24), maxit=25) mod1_pen <- update(mod1, penalty=penalty$penalty) effective.df(mod1_pen) mod1_pen
There you have it, a very basic introduction to the rms package for beginners to the R programming language. Once again, I strongly suggest that readers who are not trained statisticians should read and fully comprehend “Clinical Prediction Models” and “Regression Modeling Strategies” by Frank Harrell. You can also access a number of handouts and lecture notes at here.