**rdata.lu Blog | Data science with R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello everyone,

In the last post we have decided to continue our study with the logistic regression. We have obtained the following ROC curve with an area under the curve (AUC) of `0.843`

.

Now we imagine 2 scenarios. In the first one, we suppose we have a large budget and we want to target many customers. In the second one, we have a restricted budget. Hence, we want to contact just 25% of the customers in our database.

For the first scenario, we will focus on the threshold value. We want to optimize the number of good classifications.

Then, in the second scenario, we will focus on the probability that each customer has to churn. We will sort our database by their probability to churn, we also call this probability, the score. Then we will select the top 25% customers of our database based on their probability to churn by using a lift curve.

### Scenario 1: Threshold Optimization

Each customer has a score that corresponds to his probability to churn. Let’s see the score of the five first customers of our database.

`head(pred_logistic,5)`

```
## 1 2 3 4 5
## 0.62513394 0.32512336 0.52812027 0.31273422 0.01264661
```

Originally, the threshold is `0.5`

. We predict a positive answer if the score is higher than `0.5`

and a negative answer if the score is lower than `0.5`

.

It means for the 5 customers above that the model predicts a positive answer for customer 1 and customer 3 and a negative answer for the 3 other customers.

To optimize the threshold, we want to compute different statistical indicators (accuracy, precision, sensitivity, f1 and kappa) for different threshold values from `0.05`

to `0.847`

with a step value of `0.001`

. We don’t go above `0.847`

because after this value, we just have negative answers.

```
if (!require("tidyverse")) install.packages("tidyverse")
library("tidyverse")
if (!require("caret")) install.packages("caret")
library("caret")
if (!require("e1071")) install.packages("e1071")
library("e1071")
comp = cbind.data.frame(answer = db_test$ChurnNum,
pred=pred_logistic) %>%
arrange(desc(pred))
indic_perf = function(x){
compare = comp %>%
mutate(pred = ifelse(pred>x,1,0))
if(ncol(table(compare))>1){
mat = confusionMatrix(table(compare), positive = "1")
#acuracy
acc = mat$overall["Accuracy"]
#Kappa
kap = mat$overall["Kappa"]
#sensitivity
sen = mat$byClass["Sensitivity"]
#F1
f1_stat = mat$byClass["F1"]
#Precision
prec = mat$byClass["Precision"]
}else{
acc = NA
prec = NA
sen = NA
kap = NA
f1_stat = NA
}
return(data.frame(threshold = x, accuracy = acc,
precision = prec, sensitivity = sen,
kappa = kap, f1= f1_stat))
}
indics = do.call(rbind, lapply(seq(0.05,0.95, by=0.001),
indic_perf)) %>%
filter(!is.na(accuracy))
if (!require("plotly")) install.packages("plotly")
library("plotly")
if (!require("IRdisplay")) install.packages("IRdisplay")
library("IRdisplay")
gather_indics = tidyr::gather(indics, variable,
value, -threshold) %>%
group_by(variable) %>%
mutate(color = (max(value) == value),
threshold = as.numeric(threshold) )
q=ggplot(gather_indics , aes(x= threshold, y=value)) +
ggtitle("Indicator values by thresholds")+
geom_point(aes(color = color), size=0.5) +
facet_wrap(~variable, scales = 'free_x') +
scale_color_manual(values = c(NA, "tomato")) +
labs(x="thresholds", y=" ") +
geom_line(color="navy") + theme_bw()+
theme( legend.position="none")
offline(ggplotly(q), width = '100%')
```

**leave a comment**for the author, please follow the link and comment on their blog:

**rdata.lu Blog | Data science with R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.