Regression on categorical variables

Posted on January 30, 2013 by arthur charpentier in Uncategorized | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This morning, Stéphane asked me tricky question about extracting coefficients from a regression with categorical explanatory variates. More precisely, he asked me if it was possible to store the coefficients in a nice table, with information on the variable and the modality (those two information being in two different columns). Here is some code I did to produce the table he was looking for, but I guess that some (much) smarter techniques can be used (comments – see below – are open). Consider the following dataset

> base
   x sex   hair
1  1   H  Black
2  4   F  Brown
3  6   F  Black
4  6   H  Black
5 10   H  Brown
6  5   H Blonde

with two factors,

> levels(base$hair)
[1] "Black"  "Blonde" "Brown" 
> levels(base$sex)
[1] "F" "H"

Let us run a (standard linear) regression,

> reg=lm(x~hair+sex,data=base)

which is here

> summary(reg)

Call:
lm(formula = x ~ hair + sex, data = base)

Residuals:
         1          2          3          4          5          6 
-3.714e+00 -2.429e+00  2.429e+00  1.286e+00  2.429e+00 -2.220e-16 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.5714     3.4405   1.038    0.408
hairBlonde    0.2857     4.8655   0.059    0.959
hairBrown     2.8571     3.7688   0.758    0.528
sexH          1.1429     3.7688   0.303    0.790

Residual standard error: 4.071 on 2 degrees of freedom
Multiple R-squared: 0.2352,	Adjusted R-squared: -0.9121 
F-statistic: 0.205 on 3 and 2 DF,  p-value: 0.886

If we want to extract the names of the factors (assuming here that there are no numbers in the name of the factor), and the values of the associated modality, one can use

> VARIABLE=c("",gsub("[-^0-9]", "", names(unlist(reg$xlevels))))
> MODALITY=c("",as.character(unlist(reg$xlevels)))
> names=data.frame(VARIABLE,MODALITY,NOMVAR=c(
+ "(Intercept)",paste(VARIABLE,MODALITY,sep="")[-1]))
> regression=data.frame(NOMVAR=names(coefficients(reg)),
+ COEF=as.numeric(coefficients(reg)))
> merge(names,regression,all.x=TRUE)
       NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2   hairBlack     hair    Black        NA
3  hairBlonde     hair   Blonde 0.2857143
4   hairBrown     hair    Brown 2.8571429
5        sexF      sex        F        NA
6        sexH      sex        H 1.1428571

or, if we want modalities exluding references,

> merge(names,regression)
       NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2  hairBlonde     hair   Blonde 0.2857143
3   hairBrown     hair    Brown 2.8571429
4        sexH      sex        H 1.1428571

In order to reproduce the table Stéphane sent me, let us use the following code to produce an html table,

> library(xtable)
> htlmtable <- xtable(merge(names,regression))
> print(htlmtable,type="html")

	NOMVAR	VARIABLE	MODALITY	COEF
1	(Intercept)			3.57
2	hairBlonde	hair	Blonde	0.29
3	hairBrown	hair	Brown	2.86
4	sexH	sex	H	1.14

So yes, it is possible to build a table with the variable, modalities, and coefficients. This function can be interesting on prospective mortality, when we do have a large number of modalities per factor (years, ages and year of birth). Consider the following datasets

> DEATH=read.table(
+ "http://freakonometrics.free.fr/DeathsSwitzerland.txt",
+ header=TRUE,skip=2)
> EXPOSURE=read.table(
+ "http://freakonometrics.free.fr/ExposuresSwitzerland.txt",
+ header=TRUE,skip=2)
> DEATH$Age=as.numeric(as.character(DEATH$Age))
> DEATH=DEATH[-which(is.na(DEATH$Age)),]
> EXPOSURE$Age=as.numeric(as.character(EXPOSURE$Age))
> EXPOSURE=EXPOSURE[-which(is.na(EXPOSURE$Age)),]
> base=data.frame(y=as.factor(DEATH$Year),a=as.factor(DEATH$Age),
+ c=as.factor(DEATH$Year-DEATH$Age),D=DEATH$Total,E= EXPOSURE$Total)
> base=base[base$E>0,]

and the following nonlinear model, based on Lee-Carter model (including a cohort effect),

$http://latex.codecogs.com/gif.latex?N_{x,t}\sim\mathcal{P}(E_{x,t}\cdot%20\exp[\alpha_x+\beta_x%20\kappa_t%20+%20\gamma_x%20\delta_{t-x}])$

can be estimated using

> library(gnm)
> reg=gnm(D~a+Mult(a,y)+Mult(a,c),offset=log(E),family=poisson,data=base)

In order to extract the 671 coefficients from the regresssion,

> length(coefficients(reg))
[1] 671

(as properly as possible) we have to be careful: names of coefficients are not that simple to handle. For instance, we can see things like

> coefficients(reg)[200]
Mult(., year).age98 
         0.04203519

In order to extract them, define

> na=length((reg$xlevels)$age)
> ny=length((reg$xlevels)$year)
> nc=length((reg$xlevels)$cohort)
> VARIABLElong=c("",rep("age",na),rep("Mult(., year).age",na),
+ rep("Mult(a, .).y",ny),
+ rep("Mult(., cohort).age",na),rep("Mult(age, .).cohort",nc))
> VARIABLEshort=c("",rep("age",na),rep("age",na),rep("year",ny),
+ rep("age",na),rep("cohort",nc))
> MODALITY=c("",(reg$xlevels)$age,(reg$xlevels)$age,
+ (reg$xlevels)$year,(reg$xlevels)$age,(reg$xlevels)$cohort)
> names=data.frame(VARIABLElong,VARIABLEshort,
+ MODALITY,NOMVAR=c("(Intercept)",paste(VARIABLElong,MODALITY,sep="")[-1]))
> regression=data.frame(NOMVAR=names(coefficients(reg)),
+ COEF=as.numeric(coefficients(reg)))

Here we go, now we have the coefficients from the regression in a nice table,

> outputreg=merge(names,regression)
> outputreg[1:10,]
        NOMVAR VARIABLElong VARIABLEshort MODALITY        COEF
1  (Intercept)                                     -8.22225458
2         age1          age           age        1 -0.87495451
3        age10          age           age       10 -1.67145704
4       age100          age           age      100  4.91041650
5        age11          age           age       11 -1.00186990
6        age12          age           age       12 -1.05953497
7        age13          age           age       13 -0.90952859
8        age14          age           age       14  0.02880668
9        age15          age           age       15  0.42830738
10       age16          age           age       16  1.35961403

It is now possible to plot all the coefficients, as functions of the age, the year of observation, or the year of birth. For instance, for the standard average age effect (namely $http://latex.codecogs.com/gif.latex?\alpha_x$ as a function of $http://latex.codecogs.com/gif.latex?x$ ), we can use

> typevariable=as.character(unique(outputreg$VARIABLElong))
> basegraph=outputreg[outputreg$VARIABLElong==typevariable[2],]
> x=as.numeric(as.character(basegraph$MODALITY))
> y=basegraph$COEF
> plot(x,y,type="p",col="blue",xlab="Age")

while the cohort effect ( $http://latex.codecogs.com/gif.latex?\delta_t$ as a function of $http://latex.codecogs.com/gif.latex?t$ ) is obtained using

> basegraph=outputreg[outputreg$VARIABLElong==typevariable[5],]
> x=as.numeric(as.character(basegraph$MODALITY))
> y=basegraph$COEF
> plot(x,y,type="p",col="blue",xlab="Cohort (year of birth)",ylim=c(0,10))

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.