# Regression on categorical variables

January 30, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This morning, Stéphane asked me tricky question about extracting coefficients from a regression with categorical explanatory variates. More precisely, he asked me if it was possible to store the coefficients in a nice table, with information on the variable and the modality (those two information being in two different columns). Here is some code I did to produce the table he was looking for, but I guess that some (much) smarter techniques can be used (comments – see below – are open). Consider the following dataset

> base
x sex   hair
1  1   H  Black
2  4   F  Brown
3  6   F  Black
4  6   H  Black
5 10   H  Brown
6  5   H Blonde

with two factors,

> levels(base$hair) [1] "Black" "Blonde" "Brown" > levels(base$sex)
[1] "F" "H"

Let us run a (standard linear) regression,

> reg=lm(x~hair+sex,data=base)

which is here

> summary(reg)

Call:
lm(formula = x ~ hair + sex, data = base)

Residuals:
1          2          3          4          5          6
-3.714e+00 -2.429e+00  2.429e+00  1.286e+00  2.429e+00 -2.220e-16

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   3.5714     3.4405   1.038    0.408
hairBlonde    0.2857     4.8655   0.059    0.959
hairBrown     2.8571     3.7688   0.758    0.528
sexH          1.1429     3.7688   0.303    0.790

Residual standard error: 4.071 on 2 degrees of freedom
Multiple R-squared: 0.2352,	Adjusted R-squared: -0.9121
F-statistic: 0.205 on 3 and 2 DF,  p-value: 0.886

If we want to extract the names of the factors (assuming here that there are no numbers in the name of the factor), and the values of the associated modality, one can use

> VARIABLE=c("",gsub("[-^0-9]", "", names(unlist(reg$xlevels)))) > MODALITY=c("",as.character(unlist(reg$xlevels)))
> names=data.frame(VARIABLE,MODALITY,NOMVAR=c(
+ "(Intercept)",paste(VARIABLE,MODALITY,sep="")[-1]))
> regression=data.frame(NOMVAR=names(coefficients(reg)),
+ COEF=as.numeric(coefficients(reg)))
> merge(names,regression,all.x=TRUE)
NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2   hairBlack     hair    Black        NA
3  hairBlonde     hair   Blonde 0.2857143
4   hairBrown     hair    Brown 2.8571429
5        sexF      sex        F        NA
6        sexH      sex        H 1.1428571

or, if we want modalities exluding references,

> merge(names,regression)
NOMVAR VARIABLE MODALITE      COEF
1 (Intercept)                   3.5714286
2  hairBlonde     hair   Blonde 0.2857143
3   hairBrown     hair    Brown 2.8571429
4        sexH      sex        H 1.1428571

In order to reproduce the table Stéphane sent me, let us use the following code to produce an html table,

> library(xtable)
> htlmtable <- xtable(merge(names,regression))
> print(htlmtable,type="html")
NOMVAR VARIABLE MODALITY COEF
1 (Intercept) 3.57
2 hairBlonde hair Blonde 0.29
3 hairBrown hair Brown 2.86
4 sexH sex H 1.14

So yes, it is possible to build a table with the variable, modalities, and coefficients. This function can be interesting on prospective mortality, when we do have a large number of modalities per factor (years, ages and year of birth). Consider the following datasets

> DEATH=read.table(
+ "http://freakonometrics.free.fr/DeathsSwitzerland.txt",
+ "http://freakonometrics.free.fr/ExposuresSwitzerland.txt",
> DEATH$Age=as.numeric(as.character(DEATH$Age))
> DEATH=DEATH[-which(is.na(DEATH$Age)),] > EXPOSURE$Age=as.numeric(as.character(EXPOSURE$Age)) > EXPOSURE=EXPOSURE[-which(is.na(EXPOSURE$Age)),]
> base=data.frame(y=as.factor(DEATH$Year),a=as.factor(DEATH$Age),
+ c=as.factor(DEATH$Year-DEATH$Age),D=DEATH$Total,E= EXPOSURE$Total)
> base=base[base$E>0,] and the following nonlinear model, based on Lee-Carter model (including a cohort effect), can be estimated using > library(gnm) > reg=gnm(D~a+Mult(a,y)+Mult(a,c),offset=log(E),family=poisson,data=base) In order to extract the 671 coefficients from the regresssion, > length(coefficients(reg)) [1] 671 (as properly as possible) we have to be careful: names of coefficients are not that simple to handle. For instance, we can see things like > coefficients(reg)[200] Mult(., year).age98 0.04203519 In order to extract them, define > na=length((reg$xlevels)$age) > ny=length((reg$xlevels)$year) > nc=length((reg$xlevels)$cohort) > VARIABLElong=c("",rep("age",na),rep("Mult(., year).age",na), + rep("Mult(a, .).y",ny), + rep("Mult(., cohort).age",na),rep("Mult(age, .).cohort",nc)) > VARIABLEshort=c("",rep("age",na),rep("age",na),rep("year",ny), + rep("age",na),rep("cohort",nc)) > MODALITY=c("",(reg$xlevels)$age,(reg$xlevels)$age, + (reg$xlevels)$year,(reg$xlevels)$age,(reg$xlevels)$cohort) > names=data.frame(VARIABLElong,VARIABLEshort, + MODALITY,NOMVAR=c("(Intercept)",paste(VARIABLElong,MODALITY,sep="")[-1])) > regression=data.frame(NOMVAR=names(coefficients(reg)), + COEF=as.numeric(coefficients(reg))) Here we go, now we have the coefficients from the regression in a nice table, > outputreg=merge(names,regression) > outputreg[1:10,] NOMVAR VARIABLElong VARIABLEshort MODALITY COEF 1 (Intercept) -8.22225458 2 age1 age age 1 -0.87495451 3 age10 age age 10 -1.67145704 4 age100 age age 100 4.91041650 5 age11 age age 11 -1.00186990 6 age12 age age 12 -1.05953497 7 age13 age age 13 -0.90952859 8 age14 age age 14 0.02880668 9 age15 age age 15 0.42830738 10 age16 age age 16 1.35961403 It is now possible to plot all the coefficients, as functions of the age, the year of observation, or the year of birth. For instance, for the standard average age effect (namely as a function of ), we can use > typevariable=as.character(unique(outputreg$VARIABLElong))
> basegraph=outputreg[outputreg$VARIABLElong==typevariable[2],] > x=as.numeric(as.character(basegraph$MODALITY))
> y=basegraph$COEF > plot(x,y,type="p",col="blue",xlab="Age") while the cohort effect ( as a function of ) is obtained using > basegraph=outputreg[outputreg$VARIABLElong==typevariable[5],]
> x=as.numeric(as.character(basegraph$MODALITY)) > y=basegraph$COEF
> plot(x,y,type="p",col="blue",xlab="Cohort (year of birth)",ylim=c(0,10))

### Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1.  Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...