Discrimination by proxy (a real case study)

Posted on February 15, 2024 by arthur charpentier in R bloggers | 0 Comments

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday, with Laurence Barry, we posted a blog post “Who benefits from data sharing?” explaining why data sharing, in insurance, could end mutualization. Actually, it can also be bad in the context of discrimination. Consider here the same dataset, with claim occurence, in a real insurance portfolio,

library(InsurFair) library(randomForest)

Consider a version of this dataset without the gender, and use variable importance to get a list of variables we can use in a predictive model

subfrenchmotor = frenchmotor[,-which(names(frenchmotor)=="sensitive")] RF = randomForest(y~. ,data=subfrenchmotor) vi = varImpPlot(RF , sort = TRUE)

We sort variables based on variable importance (the first one is the “most important” one), and add splines for three continuous variables

dfvi = data.frame(nom = names(subfrenchmotor)[-15], g = as.numeric(vi)) dfvi = dfvi[rev(order(dfvi$g)),] nom = dfvi$nom nom[1] = "bs(LicAge)" nom[3] = "bs(DrivAge)" nom[7] = "bs(BonusMalus)"

Then, the idea is simple : at stage $k$, we keep the $k$ most important variables, and run a logistic regression on those $k$ variables. Again, I should stress that the gender of the driver is not among those $k$ variables. Then, we compute the average prediction of claim frequency, for mean and women.

n=nrow(subfrenchmotor) library(splines) idx_F = which(frenchmotor$sensitive == "Female") idx_M = which(frenchmotor$sensitive == "Male") metric_gender= function(k =3){ if(k==0){ reg = glm(y~1, family=binomial, data=subfrenchmotor) yp = predict(reg, type="response") yp_F = yp[idx_F] yp_M = yp[idx_M] sortie = c(mean(yp_F),mean(yp_M),quantile(yp_F,c(.1,.9)),quantile(yp_M,c(.1,.9))) names(sortie)[1:2]=c("mean_F","mean_M") } if(k>0){ vr = paste(nom[1:k],collapse = " + ") fm = paste("y ~ ",vr,sep="") reg = glm(fm, family=binomial, data=subfrenchmotor) yp = predict(reg, type="response") yp_F = yp[idx_F] yp_M = yp[idx_M] sortie = c(mean(yp_F),mean(yp_M),quantile(yp_F,c(.1,.9)),quantile(yp_M,c(.1,.9))) names(sortie)[1:2]=c("mean_F","mean_M") } sortie}

Let us not compute it for all variables

N = 0:15 M = Vectorize(metric_gender)(N)

and plot it

plot(N,M[1,]*100, xlab="Number of predictive variables (without gender)", ylab= "Average predicted claims frequency (%)", type="b", pch=19, col=COLORS[2], ylim=c(8.12,9)) lines(N, M[2,]*100, type="b", pch=15, col=COLORS[3])

Interestingly, we can clearly see that with 15 explanatory variables, even if our model is gender-blind (since it is not in the training dataset), our model reproduce the difference we can observe in the dataset : annual claim frequency for men is almost 9% and 8.2% for women.

Actually, it is not possible to predict the gender for our 15 variables (below is the ROC curve of the logistic regression to predict the gender)

metric_gender_2= function(k =3){ if(k==0){ reg = glm((sensitive=="Female")~1, family=binomial, data=frenchmotor) } if(k>0){ vr = paste(nom[1:k],collapse = " + ") fm_genre = paste('(sensitive=="Female") ~ ',vr,sep="") reg = glm(fm_genre, family=binomial, data=frenchmotor) } pred = prediction(predict(reg,type="response"),(frenchmotor$sensitive=="Female")) performance(pred,"tpr","fpr")} plot(metric_gender_2(15))

but still, when using 15 variables, we obtain discrimination in our portfolio, since the average predictions for mean and women are significantly difference (even if our models are, per se, gender-blind).

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Discrimination by proxy (a real case study)

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)