Voting Twice in France

February 19, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

On the Monkey Cage blog, Baptiste Coulmont (a.k.a. @coulmont) recently uploaded a post entitled “You can vote twice ! The many political appeals of proxy votes in France“, coauthored with Joël Gombin (a.k.a. @joelgombin), and myself. The study was initially written in French as mentioned in a previous post. Baptiste posted additional information on his blog (http://coulmont.com/blog/…) and I also wanted to post some lines of code, to mention a model that was not used in that study (more complex to analyze, but more realistic, and with the same conclusions). The econometric study is based on aggregated voted, with a possible ecological misinterpretation.

  • Regression Model: Possible Explanatory Variables

The first idea was to model proxies using a binomial regression, per pooling station  where  denote the number of proxy vote, per station , and  denotes the number of voters. Proportion  can be a function of possible explanatory variables (on Baptiste’s blog there are additional information about the datasets, obtained from insee.fr and opendata.paris.fr)

> bt1=read.table("paris2007-pres-t1.csv",header=TRUE,sep=";")
> bt2=read.table("paris2007-pres-t2.csv",header=TRUE,sep=";")
> bv=read.table("paris-bv-insee-07.csv",header=TRUE,sep=";")
> bv$BV=bv$BVCOM
> baset1=merge(bt1,bv,by="BV")
> baset2=merge(bt2,bv,by="BV")
> baset1$LOGEMENT=baset1$PROPRIO+baset1$LOCNONHLM+baset1$LOCHLM+baset1$GRATUIT
> baset2$LOGEMENT=baset2$PROPRIO+baset2$LOCNONHLM+baset2$LOCHLM+baset2$GRATUIT

For instance, assume that  is a function of the proportion of owner of the place people live in, denoted  in the neighborhood of the pooling station,

> variable="PROPRIO"
> reference="LOGEMENT"
> baset1$taux=baset1[,variable]/baset1[,reference]
> baset2$taux=baset2[,variable]/baset2[,reference]

We can consider a logistic regression

or a logistic regression with splines, if we do not want to assume a linear model

With cubic splines, the code is

> b=hist(baset1$taux,plot=FALSE)
> library(splines)
> regt1=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset1)
> regt2=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset2)
> u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001)
> ND=data.frame(taux=u)
> ug=seq(0,max(baset1$taux)+.05,by=.001)
> pt1=predict(regt1,newdata=ND,se=TRUE,type="response")
> pt2=predict(regt2,newdata=ND,se=TRUE,type="response")
> library(RColorBrewer)
> CL=brewer.pal(6, "RdBu")
> plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration",
+ ylim=c(0,.1))
> for(i in 1:(length(b$breaks)-1)){
+ polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0)
+ /max(b$counts)*.05,col="light yellow",border=NA)}
> polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)),
+ border=NA,density=30,col=CL[4])

while a standard logistic regression would be

> lines(u,pt1$fit,col=CL[6],lwd=2)
> polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)),
+ border=NA,density=30,col=CL[3])
> lines(u,pt2$fit,col=CL[1],lwd=2)
> regt1l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset1)
> regt2l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset2)
> ND=data.frame(taux=ug)
> pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response")
> pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response")
> lines(ug,pt1l$fit,col=CL[5],lty=2)
> lines(ug,pt2l$fit,col=CL[2],lty=2)
> legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c language="(1,6)"][/c],
+ lwd=2,lty=1,border=NA)

Here it is (the confidence region is for the spline regression) with on blue the first round of the Presidential election, and in red, the second round (in France, it’s a two-round system)

(the legend of the y axis is not correct). We can consider as explanatory variable the rate of H.L.M., low-cost housing or council housing,

If I like the graph, unfortunately, the interpretation of coefficient  might be complicated

> summary(regt1l)

Call:
glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, 
    data = baset1, weights = INSCRITS)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-12.9549   -1.5722    0.0319    1.6292   13.1303  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.70811    0.01516  -244.6   <2e-16 ***
taux         1.49666    0.04012    37.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12507  on 836  degrees of freedom
Residual deviance: 11065  on 835  degrees of freedom
AIC: 15699

Number of Fisher Scoring iterations: 4

> summary(regt2l)

Call:
glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, 
    data = baset2, weights = INSCRITS)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-15.4872   -1.7817   -0.1615    1.6035   12.5596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.24272    0.01230 -263.61   <2e-16 ***
taux         1.45816    0.03266   44.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9424.7  on 836  degrees of freedom
Residual deviance: 7362.3  on 835  degrees of freedom
AIC: 12531

Number of Fisher Scoring iterations: 4

So we did consider a standard linear regression model, for the proxy rate, per station,

(again, either a model with splines, or a standard linear model). The code is

> regt1=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset1)
> regt2=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset2)
> u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001)
> ND=data.frame(taux=u)
> ug=seq(0,max(baset1$taux)+.05,by=.001)
> pt1=predict(regt1,newdata=ND,se=TRUE,type="response")
> pt2=predict(regt2,newdata=ND,se=TRUE,type="response")
> library(RColorBrewer)
> CL=brewer.pal(6, "RdBu")
> plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration",
+ ylim=c(0,.1))
> for(i in 1:(length(b$breaks)-1)){
+ polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0)
+ /max(b$counts)*.05,col="light yellow",border=NA)}
> polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)),
+ border=NA,density=30,col=CL[4])
> lines(u,pt1$fit,col=CL[6],lwd=2)
> polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)),
+ border=NA,density=30,col=CL[3])
> lines(u,pt2$fit,col=CL[1],lwd=2)
> regt1l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset1)
> regt2l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset2)
> ND=data.frame(taux=ug)
> pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response")
> pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response")
> lines(ug,pt1l$fit,col=CL[5],lty=2)
> lines(ug,pt2l$fit,col=CL[2],lty=2)
> legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c language="(1,6)"][/c],
+ lwd=2,lty=1,border=NA)

Here is again the evolution as a function of the rate of owner of their homes,

The graph is rather close to the one before, and here, the interpretation of the summary table is more conventional,

> summary(regt1l)

Call:
lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset1, weights = INSCRITS)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-1.9994 -0.2926  0.0011  0.3173  3.2072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.021268   0.001739   12.23   <2e-16 ***
taux        0.054371   0.004812   11.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.646 on 835 degrees of freedom
Multiple R-squared:  0.1326,	Adjusted R-squared:  0.1316 
F-statistic: 127.7 on 1 and 835 DF,  p-value: < 2.2e-16

> summary(regt2l)

Call:
lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset2, weights = INSCRITS)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-2.9029 -0.4148 -0.0338  0.4029  3.4907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.033909   0.001866   18.17   <2e-16 ***
taux        0.079749   0.005165   15.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6934 on 835 degrees of freedom
Multiple R-squared:  0.2221,	Adjusted R-squared:  0.2212 
F-statistic: 238.4 on 1 and 835 DF,  p-value: < 2.2e-16

We have used those codes to produce the graphs mentioned in the post. But before mentioning the residuals of the multiple model we considered, I wanted to share some awesome code that produce maps (I can say that those codes are awesome since Baptiste wrote most of them).

  • Visualization of Residuals on a Map of Paris

To plot the neighborhood of the pooling stations, one more time the post on Baptiste’s blog, explains how the shapefile was obtained from cartelec.net

> library(maptools)
> library(rgdal)
> library(classInt)
> paris=readShapeSpatial("paris-cartelec.shp")

To visualize the proxy rate (the average of round one and round two), here is the code

> elec=data.frame()
> elec=cbind(bt1$BV,(bt1$PROCURATIONS+bt2$PROCURATIONS),(bt1$EXPRIMES+bt2$EXPRIMES))
> colnames(elec)=c("BV","PROCURATIONS","EXPRIMES")
> elec=as.data.frame(elec)
> elec$BV=bt1$BV

To get nice colors, function of the rates, we use

> m=match(paris$BUREAU,elec$BV)
> plotvar=100*elec$PROCURATIONS/elec$EXPRIMES
> nclr=7
> plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] 
>(plotvar[m], nclr, style="fisher",dataPrecision=1)
> colcode=findColours(class, plotclr)

and finally

> par(mar=c(1,1,1,1))
> plot(paris,col=colcode,border=colcode)
> legend(656274.9, 6867308,legend=names(attr(colcode,"table")), 
+ fill=attr(colcode, "palette"), cex=1, bty="n",
+ title="Frequence procurations (%)")

If we consider a model with three explanatory variable, to explain the proxy rate,

> regt1=lm(PROCURATIONS/INSCRITS~I(POP65P/POP)+
+ I(PROPRIO/LOGEMENT)+I(CS3/POP1564),weights=INSCRITS,data=baset1)

we can plot the residuals using

> m=match(paris$BUREAU,elec$BV)
> plotvar=100*residuals(regt1)
> nclr=7
> plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] 
>(plotvar[m], nclr, style="fisher",dataPrecision=1)
> colcode=findColours(class, plotclr)
> par(mar=c(1,1,1,1))
> plot(paris,col=colcode,border=colcode)
> legend(656274.9, 6867308,legend=names(attr(colcode,"table")), 
+ fill=attr(colcode, "palette"), cex=1, bty="n",title="Residus")

It might not be a pure random spatial noise… But we could not get better with our small set of covariates.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.