by Max Kuhn
The formula interface to symbolically specify blocks of data is ubiquitous in R. It is commonly used to generate design matrices for modeling function (e.g.
lm). In traditional linear model statistics, the design matrix is the two-dimensional representation of the predictor set where instances of data are in rows and variable attributes are in columns (a.k.a. the X matrix).
A simple motivating example uses the inescapable iris data in a linear regression model:
While the purpose of this code chunk is to fit a linear regression models, the formula is used to specify the symbolic model as well as generating the intended design matrix. Note that the formula method defines the columns to be included in the design matrix, as well as which rows should be retained.
In this post, I’ll walk through the mechanics of how some modeling functions use formulas to make a design matrix using
lm to illustrate the details. Note, however, that the syntactical minutiae are likely to be different from function to function, even within base R.
Formulas and Terms
lm initially uses the formula and the appropriate environment to translate the relationships between variables to creating a data frame containing the data. R has a fairly standard set of operators that can be used to create a matrix of predictors for models.
We will start by looking at some of the internals of
lm (circa December 2016).
Preparing for the Model Frame
The main tools used to get the design matrix are the
model.matrix functions. The definition and first few lines of
The goal of this code is to manipulate the formula and other arguments into an acceptable set of arguments to the
model.frame function in the
stats package. The code will modify the call to
lm to use as the substrate into
model.frame, which has many similar arguments (e.g.
na.action). However, there are arguments that are not common to both functions.
mf is initially created to mirror the original call. After executing
match.call(expand.dots = FALSE), our original call generates a value of
class(mf) has a value of
"call". Note that the first element of the call,
mf[[1L]], has a value of
lm with the class of
The next few lines remove any arguments to
lm that are not arguments to
model.frame, and adds another (
drop.unused.levels). Finally, the call is modified by replacing the first element of the call (
mf, has a value of
Creating the Model Frame and Terms
When this code is executed using
eval(expr = mf, envir = parent.frame()), the
model.frame function returns
data.framecontaining the variables used in formula plus those specified in
.... It will have additional attributes, including “
terms” for an object of class “
terms” derived from
formula, and possibly “
na.action” giving information on the handling of
NAs (which will not be present if no special handling was done, e.g. by
For our particular call, the first six values of
Note that :
- all of the columns are present, predictors and response,
- the filtering defined by the
subsetcommand is executed here (note the row names above),
Petal.Lengthhas been log transformed and the column name is not a valid name, and
- the Species variable has not generated any dummy variables.
If weights or an offset was used in the model, the resulting model frame would also include these.
As alluded to above,
mf has several attributes, and includes one that would not normally be associated with a data frame (e.g.
terms object contains the data that defines the relationships between variables in the formulas, as well as any transformations of the individual predictors (e.g.
log). For our original model:
terms object will be used to generate design matrices on new data (e.g. samples being predicted).
Creating the Design Matrix
lm code has some additional steps to save the model terms and generate the design matrix:
model.matrix function uses the data in the
terms object to generate any interactions and/or dummy variables from factors. This work is mostly accomplished by a C routine.
The Predictive Nature of the
In the previous example, the log transformation is applied to one of the columns. When using an inline function inside a formula, this transformation will be applied to the current data, as well as any future data points (say, via
predict.lm). The same workflow is followed where a model frame is used with the
terms object and
However, there are some operations that can be specified in a formula that require statistical estimates. Two examples:
- a natural spline (
splines::ns) takes a numeric variable, does some computations, and expands that variable into multiple features that can be used to model that predictor in a nonlinear fashion.
- orthogonal polynomials (
stats::poly) is a basis expansion that takes a single predictor and produces new columns that correspond to the polynomial degree.
As an example of natural splines:
ns returns multiple elements: the basis function spline results (shown just above) and the data required to generate them for new data (in the attributes).
It turns out that the only statistics required to produce new spline results are the
intercept attributes. When new data are predicted, those statistical quantities are used:
Now, getting back to formulas, we can include a function like this inline:
terms object saves the model specification, and also the values required to reproduce the spline basis function. The
terms object contains an attribute that is misleadingly named
predvars that has this information:
predict.lm is invoked with a new data set, the terms are passed to
model.frame and the
predvars are evaluated on the new data, e.g.,
In summary, R’s formula interface works by exploiting the original formula as a general expression, and the
terms object is where most of the information about the design matrix is stored.
Finally, it is possible to include more complex operations in the formula. For example, two techniques for imputation in predictive models are using tree ensembles and K-nearest neighbors. In each case, a model can be created to impute a predictor, and this model also could be embedded into
predvars as long as the prediction function can be exploited as an expression.
There are some severe limitations to formulas which may not be obvious and various packages have had to work around these issues. The second post on formulas (“The Bad”) will illustrate these shortcomings.