[This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello everyone,

In the last post we have decided to continue our study with the logistic regression. We have obtained the following ROC curve with an area under the curve (AUC) of 0.843.

Now we imagine 2 scenarios. In the first one, we suppose we have a large budget and we want to target many customers. In the second one, we have a restricted budget. Hence, we want to contact just 25% of the customers in our database.
For the first scenario, we will focus on the threshold value. We want to optimize the number of good classifications.
Then, in the second scenario, we will focus on the probability that each customer has to churn. We will sort our database by their probability to churn, we also call this probability, the score. Then we will select the top 25% customers of our database based on their probability to churn by using a lift curve.

### Scenario 1: Threshold Optimization

Each customer has a score that corresponds to his probability to churn. Let’s see the score of the five first customers of our database.

head(pred_logistic,5)
##          1          2          3          4          5
## 0.62513394 0.32512336 0.52812027 0.31273422 0.01264661

Originally, the threshold is 0.5. We predict a positive answer if the score is higher than 0.5 and a negative answer if the score is lower than 0.5.
It means for the 5 customers above that the model predicts a positive answer for customer 1 and customer 3 and a negative answer for the 3 other customers.

To optimize the threshold, we want to compute different statistical indicators (accuracy, precision, sensitivity, f1 and kappa) for different threshold values from 0.05 to 0.847 with a step value of 0.001. We don’t go above 0.847 because after this value, we just have negative answers.

if (!require("tidyverse")) install.packages("tidyverse")
library("tidyverse")
if (!require("caret")) install.packages("caret")
library("caret")
if (!require("e1071")) install.packages("e1071")
library("e1071")
comp = cbind.data.frame(answer = db_test$ChurnNum, pred=pred_logistic) %>% arrange(desc(pred)) indic_perf = function(x){ compare = comp %>% mutate(pred = ifelse(pred>x,1,0)) if(ncol(table(compare))>1){ mat = confusionMatrix(table(compare), positive = "1") #acuracy acc = mat$overall["Accuracy"]

#Kappa
kap = mat$overall["Kappa"] #sensitivity sen = mat$byClass["Sensitivity"]

#F1
f1_stat = mat$byClass["F1"] #Precision prec = mat$byClass["Precision"]

}else{
acc = NA
prec = NA
sen = NA
kap = NA
f1_stat = NA
}
return(data.frame(threshold = x, accuracy = acc,
precision = prec, sensitivity = sen,
kappa = kap, f1= f1_stat))
}
indics = do.call(rbind, lapply(seq(0.05,0.95, by=0.001),
indic_perf)) %>%
filter(!is.na(accuracy))

if (!require("plotly")) install.packages("plotly")
library("plotly")
if (!require("IRdisplay")) install.packages("IRdisplay")
library("IRdisplay")

gather_indics = tidyr::gather(indics, variable,
value, -threshold) %>%
group_by(variable) %>%
mutate(color =  (max(value) == value),
threshold = as.numeric(threshold) )

q=ggplot(gather_indics , aes(x= threshold, y=value)) +
ggtitle("Indicator values by thresholds")+
geom_point(aes(color = color), size=0.5) +
facet_wrap(~variable, scales = 'free_x') +
scale_color_manual(values = c(NA, "tomato")) +
labs(x="thresholds", y=" ") +
geom_line(color="navy") + theme_bw()+
theme( legend.position="none")
offline(ggplotly(q),  width = '100%')