# Example 9.24: Changing the parameterization for categorical predictors

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our book, we discuss the important question of how to assign different parameterizations to categorical variables when fitting models (section 3.1.3). We show code in R for use in the **SAS and R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

`lm()`function, as follows:

`lm(y ~ x, contrasts=list(x,"contr.treatment")`

This works great in

`lm()`and some other functions, notably

`glm()`. But for functions from contributed packages, the

`contrasts`option may not work.

Here we show a more generic approach to setting contrasts in R, using Firth logistic regression, which is discussed in Example 8.15, to demonstrate. This approach is also shown in passing in section 3.7.5.

**R**

We’ll simulate a simple data set for logistic regression, then examine the results of the default parameterization.

n = 100 j = rep(c(0,1,2), each = n) linpred = -2.5 + j y = (runif(n*3) < exp(linpred)/(1 + exp(linpred)) ) library(logistf) flrfactor = logistf(y ~ as.factor(j)) summary(flrfactor) coef se(coef) Chisq p (Intercept) -2.1539746 0.3276441 Inf 0.000000e+00 as.factor(j)1 0.3679788 0.4343756 0.7331622 3.918601e-01 as.factor(j)2 1.7936917 0.3855682 26.2224650 3.042623e-07

To see what R is doing, use the

`contrasts()`function:

> contrasts(as.factor(j)) 1 2 0 0 0 1 1 0 2 0 1

R made indicator ("dummy") variables for two of the three levels, so that the estimated coefficients are the log relative odds for these levels vs. the omitted level. This is the "contr.treatment" structure (default for unordered factors). The defaults can be changed with

`options("contrasts")`, but this is a sensible one.

But what if we wanted to assess whether a linear effect was plausible, independent of any quadratic effect? For

`glm()`objects we could examine the

`anova()`between the model with the linear term and the model with the linear and quadratic terms. Or, we could use the syntax shown in the introduction, but with "contr.poly" in place of "contr.treatment". The latter approach may be preferable, and for the

`logistf()`function (and likely many other contributed functions) the

`contrasts =`option does not work. In those cases, use the

`contrasts`function:

jfactor = as.factor(j) contrasts(jfactor) = contr.poly(3) flrfc = logistf(y ~ jfactor) summary(flrfc) coef se(coef) Chisq p (Intercept) -1.4334177 0.1598591 Inf 0.000000e+00 jfactor.L 1.2683316 0.2726379 26.222465 3.042623e-07 jfactor.Q 0.4318181 0.2810660 2.472087 1.158840e-01

Not surprisingly, there is no need for a quadratic term, after the linear trend is accounted for. The canned contrasts available in R are somewhat limited--effect cell coding is not included, for example. You can assign

`contrasts(x)`a matrix you write manually in such cases.

**SAS**

In SAS, the

`class`statement for the

`logistic`procedure allows many parametrizations, including "orthpoly", which matches the "contr.poly" contrast from R. However, most modeling procedures do not have this flexibility, and you would have to generate your contrasts manually in those cases, typically by creating new variables with the appropriate contrast values. Here we show the reference cell coding that is the default in R. Perversely, it is not the the default in

`proc logistic`despite it being the only option in most procedures. On the other hand, it does allow the user to specify the reference category.

data test; do i = 1 to 300; j = (i gt 100) + (i gt 200); linpred = -2.5 + j; y = (uniform(0) lt exp(linpred)/(1 + exp(linpred)) ); output; end; run; title "Reference cell"; proc logistic data = test; class j (param=ref ref='0'); model y(event='1') = j / firth clparm = pl; run; title "Polynomials"; proc logistic data = test; class j (param=orthpoly); model y(event='1') = j; run;

With the results:

Reference cell Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.6110 0.1252 434.6071

**An unrelated note about aggregators**

We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers and PROC-X with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To

**leave a comment**for the author, please follow the link and comment on their blog:**SAS and R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.