Voting Twice in France

Posted on February 19, 2014 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On the Monkey Cage blog, Baptiste Coulmont (a.k.a. @coulmont) recently uploaded a post entitled “You can vote twice ! The many political appeals of proxy votes in France“, coauthored with Joël Gombin (a.k.a. @joelgombin), and myself. The study was initially written in French as mentioned in a previous post. Baptiste posted additional information on his blog (http://coulmont.com/blog/…) and I also wanted to post some lines of code, to mention a model that was not used in that study (more complex to analyze, but more realistic, and with the same conclusions). The econometric study is based on aggregated voted, with a possible ecological misinterpretation.

Regression Model: Possible Explanatory Variables

The first idea was to model proxies using a binomial regression, per pooling station $P_i\sim\mathcal{B}(N_i,p_i)$ where $P_i$ denote the number of proxy vote, per station $i$ , and $N_i$ denotes the number of voters. Proportion $p_i$ can be a function of possible explanatory variables (on Baptiste’s blog there are additional information about the datasets, obtained from insee.fr and opendata.paris.fr)

> bt1=read.table("paris2007-pres-t1.csv",header=TRUE,sep=";")
> bt2=read.table("paris2007-pres-t2.csv",header=TRUE,sep=";")
> bv=read.table("paris-bv-insee-07.csv",header=TRUE,sep=";")
> bv$BV=bv$BVCOM
> baset1=merge(bt1,bv,by="BV")
> baset2=merge(bt2,bv,by="BV")
> baset1$LOGEMENT=baset1$PROPRIO+baset1$LOCNONHLM+baset1$LOCHLM+baset1$GRATUIT
> baset2$LOGEMENT=baset2$PROPRIO+baset2$LOCNONHLM+baset2$LOCHLM+baset2$GRATUIT

For instance, assume that $p_i$ is a function of the proportion of owner of the place people live in, denoted $X_i$ in the neighborhood of the pooling station,

> variable="PROPRIO"
> reference="LOGEMENT"
> baset1$taux=baset1[,variable]/baset1[,reference]
> baset2$taux=baset2[,variable]/baset2[,reference]

We can consider a logistic regression

$p_i=h(X_i)=\frac{\exp[\beta_0+\beta_1 X_i]}{1+\exp[\beta_0+\beta_1 X_i]}$

or a logistic regression with splines, if we do not want to assume a linear model

$p_i=\tilde h(X_i)=\frac{\exp[s(X_i)]}{1+\exp[s(X_i)]}$

With cubic splines, the code is

> b=hist(baset1$taux,plot=FALSE)
> library(splines)
> regt1=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset1)
> regt2=glm(PROCURATIONS/INSCRITS~bs(taux,6),family=binomial,weights=INSCRITS,data=baset2)
> u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001)
> ND=data.frame(taux=u)
> ug=seq(0,max(baset1$taux)+.05,by=.001)
> pt1=predict(regt1,newdata=ND,se=TRUE,type="response")
> pt2=predict(regt2,newdata=ND,se=TRUE,type="response")
> library(RColorBrewer)
> CL=brewer.pal(6, "RdBu")
> plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration",
+ ylim=c(0,.1))
> for(i in 1:(length(b$breaks)-1)){
+ polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0)
+ /max(b$counts)*.05,col="light yellow",border=NA)}
> polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)),
+ border=NA,density=30,col=CL[4])

while a standard logistic regression would be

> lines(u,pt1$fit,col=CL[6],lwd=2)
> polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)),
+ border=NA,density=30,col=CL[3])
> lines(u,pt2$fit,col=CL[1],lwd=2)
> regt1l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset1)
> regt2l=glm(PROCURATIONS/INSCRITS~taux,family=binomial,weights=INSCRITS,data=baset2)
> ND=data.frame(taux=ug)
> pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response")
> pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response")
> lines(ug,pt1l$fit,col=CL[5],lty=2)
> lines(ug,pt2l$fit,col=CL[2],lty=2)
> legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c language="(1,6)"][/c],
+ lwd=2,lty=1,border=NA)

Here it is (the confidence region is for the spline regression) with on blue the first round of the Presidential election, and in red, the second round (in France, it’s a two-round system)

(the legend of the y axis is not correct). We can consider as explanatory variable the rate of H.L.M., low-cost housing or council housing,

If I like the graph, unfortunately, the interpretation of coefficient $\beta_1$ might be complicated

> summary(regt1l)

Call:
glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, 
    data = baset1, weights = INSCRITS)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-12.9549   -1.5722    0.0319    1.6292   13.1303  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.70811    0.01516  -244.6   <2e-16 ***
taux         1.49666    0.04012    37.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12507  on 836  degrees of freedom
Residual deviance: 11065  on 835  degrees of freedom
AIC: 15699

Number of Fisher Scoring iterations: 4

> summary(regt2l)

Call:
glm(formula = PROCURATIONS/INSCRITS ~ taux, family = binomial, 
    data = baset2, weights = INSCRITS)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-15.4872   -1.7817   -0.1615    1.6035   12.5596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.24272    0.01230 -263.61   <2e-16 ***
taux         1.45816    0.03266   44.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9424.7  on 836  degrees of freedom
Residual deviance: 7362.3  on 835  degrees of freedom
AIC: 12531

Number of Fisher Scoring iterations: 4

So we did consider a standard linear regression model, for the proxy rate, per station,

$\frac{P_i}{N_i}=\beta_0+\beta_1 X_i+\varepsilon_i$

(again, either a model with splines, or a standard linear model). The code is

> regt1=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset1)
> regt2=lm(PROCURATIONS/INSCRITS~bs(taux,6),weights=INSCRITS,data=baset2)
> u=seq(min(baset1$taux)+.015,max(baset1$taux)-.015,by=.001)
> ND=data.frame(taux=u)
> ug=seq(0,max(baset1$taux)+.05,by=.001)
> pt1=predict(regt1,newdata=ND,se=TRUE,type="response")
> pt2=predict(regt2,newdata=ND,se=TRUE,type="response")
> library(RColorBrewer)
> CL=brewer.pal(6, "RdBu")
> plot(ug,ug*1,col="white",xlab=nom,ylab="Taux de procuration",
+ ylim=c(0,.1))
> for(i in 1:(length(b$breaks)-1)){
+ polygon(b$breaks[i+c(0,0,1,1)],c(0,b$counts[i],b$counts[i],0)
+ /max(b$counts)*.05,col="light yellow",border=NA)}
> polygon(c(u,rev(u)),c(pt1$fit+2*pt1$se.fit,rev(pt1$fit-2*pt1$se.fit)),
+ border=NA,density=30,col=CL[4])
> lines(u,pt1$fit,col=CL[6],lwd=2)
> polygon(c(u,rev(u)),c(pt2$fit+2*pt2$se.fit,rev(pt2$fit-2*pt2$se.fit)),
+ border=NA,density=30,col=CL[3])
> lines(u,pt2$fit,col=CL[1],lwd=2)
> regt1l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset1)
> regt2l=lm(PROCURATIONS/INSCRITS~taux,weights=INSCRITS,data=baset2)
> ND=data.frame(taux=ug)
> pt1l=predict(regt1l,newdata=ND,se=TRUE,type="response")
> pt2l=predict(regt2l,newdata=ND,se=TRUE,type="response")
> lines(ug,pt1l$fit,col=CL[5],lty=2)
> lines(ug,pt2l$fit,col=CL[2],lty=2)
> legend(0,.1,c("Second Tour","Premier Tour"),col=CL[c language="(1,6)"][/c],
+ lwd=2,lty=1,border=NA)

Here is again the evolution as a function of the rate of owner of their homes,

The graph is rather close to the one before, and here, the interpretation of the summary table is more conventional,

> summary(regt1l)

Call:
lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset1, weights = INSCRITS)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-1.9994 -0.2926  0.0011  0.3173  3.2072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.021268   0.001739   12.23   <2e-16 ***
taux        0.054371   0.004812   11.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.646 on 835 degrees of freedom
Multiple R-squared:  0.1326,	Adjusted R-squared:  0.1316 
F-statistic: 127.7 on 1 and 835 DF,  p-value: < 2.2e-16

> summary(regt2l)

Call:
lm(formula = PROCURATIONS/INSCRITS ~ taux, data = baset2, weights = INSCRITS)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-2.9029 -0.4148 -0.0338  0.4029  3.4907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.033909   0.001866   18.17   <2e-16 ***
taux        0.079749   0.005165   15.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6934 on 835 degrees of freedom
Multiple R-squared:  0.2221,	Adjusted R-squared:  0.2212 
F-statistic: 238.4 on 1 and 835 DF,  p-value: < 2.2e-16

We have used those codes to produce the graphs mentioned in the post. But before mentioning the residuals of the multiple model we considered, I wanted to share some awesome code that produce maps (I can say that those codes are awesome since Baptiste wrote most of them).

Visualization of Residuals on a Map of Paris

To plot the neighborhood of the pooling stations, one more time the post on Baptiste’s blog, explains how the shapefile was obtained from cartelec.net

> library(maptools)
> library(rgdal)
> library(classInt)
> paris=readShapeSpatial("paris-cartelec.shp")

To visualize the proxy rate (the average of round one and round two), here is the code

> elec=data.frame()
> elec=cbind(bt1$BV,(bt1$PROCURATIONS+bt2$PROCURATIONS),(bt1$EXPRIMES+bt2$EXPRIMES))
> colnames(elec)=c("BV","PROCURATIONS","EXPRIMES")
> elec=as.data.frame(elec)
> elec$BV=bt1$BV

To get nice colors, function of the rates, we use

> m=match(paris$BUREAU,elec$BV)
> plotvar=100*elec$PROCURATIONS/elec$EXPRIMES
> nclr=7
> plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] 
>(plotvar[m], nclr, style="fisher",dataPrecision=1)
> colcode=findColours(class, plotclr)

and finally

> par(mar=c(1,1,1,1))
> plot(paris,col=colcode,border=colcode)
> legend(656274.9, 6867308,legend=names(attr(colcode,"table")), 
+ fill=attr(colcode, "palette"), cex=1, bty="n",
+ title="Frequence procurations (%)")

If we consider a model with three explanatory variable, to explain the proxy rate,

> regt1=lm(PROCURATIONS/INSCRITS~I(POP65P/POP)+
+ I(PROPRIO/LOGEMENT)+I(CS3/POP1564),weights=INSCRITS,data=baset1)

we can plot the residuals using

> m=match(paris$BUREAU,elec$BV)
> plotvar=100*residuals(regt1)
> nclr=7
> plotclr=brewer.pal(nclr,"RdYlBu")[nclr:1] 
>(plotvar[m], nclr, style="fisher",dataPrecision=1)
> colcode=findColours(class, plotclr)
> par(mar=c(1,1,1,1))
> plot(paris,col=colcode,border=colcode)
> legend(656274.9, 6867308,legend=names(attr(colcode,"table")), 
+ fill=attr(colcode, "palette"), cex=1, bty="n",title="Residus")

It might not be a pure random spatial noise… But we could not get better with our small set of covariates.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Voting Twice in France

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)