Churn Analysis: Model Selection (Part 1)

[This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello everyone,

Today we will make a churn analysis with a dataset provided by IBM.
You can find the dataset here.

What is a churn?

We can shortly define customer churn (most commonly called “churn”) as customers that stop doing business with a company or a service. There are customer churns in different business area. In this post, we will focus on the telecom area. Here, we want to predict which customers will leave their current telecom provider. This type of studies are called churn analysis. It’s a trendy topic in customer relationship management (CRM) departments because it costs more money to find new customers than keeping the existing ones. So companies want to prevent them to leave.

How could we identify customers who are likely to leave?

To identify the customers, we need to have a database with data about the previous customers that churned. Using this data, we develop a model which identifies customers that have a profile close to the ones that already left.
To simulate an experiment where we want to predict if our customers will churn, we need to work with a partitioned database. The database has 2 parts, one part will be the training set. This will be used to create the model. The second part will be the testing set which will be used to evaluate our model.
In this case we know customer answers from the testing dataset so we can compare the model prediction with the true answers. Nevertheless in reality, we don’t know what will be the true answers. So we have to target mainly customers with high probability to churn. This probability is given by our model.

Let’s start

We import data and look at quick insight results. We encode the SeniorCitizen variable as a factor variable.

#to manipulate data
if (!require("tidyverse")) install.packages("tidyverse")
library("tidyverse")
path = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=58920755505115141495567&cm_mc_sid_50200000=1514149556&cm_mc_sid_52640000=1514149556"
db_churn = read_csv(path)

#To see variables structure 
#str(db_churn)

db_churn$SeniorCitizen = as.factor(db_churn$SeniorCitizen)
summary(db_churn)
##   customerID           gender          SeniorCitizen   Partner         
##  Length:7043        Length:7043        0:5901        Length:7043       
##  Class :character   Class :character   1:1142        Class :character  
##  Mode  :character   Mode  :character                 Mode  :character  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup      
##  Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  DeviceProtection   TechSupport        StreamingTV       
##  Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  StreamingMovies      Contract         PaperlessBilling  
##  Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  PaymentMethod      MonthlyCharges    TotalCharges       Churn          
##  Length:7043        Min.   : 18.25   Min.   :  18.8   Length:7043       
##  Class :character   1st Qu.: 35.50   1st Qu.: 401.4   Class :character  
##  Mode  :character   Median : 70.35   Median :1397.5   Mode  :character  
##                     Mean   : 64.76   Mean   :2283.3                     
##                     3rd Qu.: 89.85   3rd Qu.:3794.7                     
##                     Max.   :118.75   Max.   :8684.8                     
##                                      NA's   :11

There are 3 numerical variables and 18 categorical variables. Then we observe the churn rate.

if (!require("janitor")) install.packages("janitor")
library("janitor")
prop = tabyl(db_churn$Churn)
prop
##   db_churn$Churn    n   percent
## 1             No 5174 0.7346301
## 2            Yes 1869 0.2653699

We realize that we have only 11 missing values in our dataset and they are all linked to the variable TotalCharges and none of them are churning. They represent only 0,16 % of our total observations. So we decide to remove them.

n_NA = db_churn %>%
        filter(is.na(TotalCharges)) %>%
        select(Churn)
#The next line give the percentage of missing values in our dataset
#100*nrow(n_NA)/nrow(db_churn)

#11 NA, that 0.16% of our database and none of them decode to churn
db_churn = db_churn %>%
        filter(!is.na(TotalCharges))

Now we can split our database in 2 parts. The training part and the testing part.

if (!require("caret")) install.packages("caret")
library("caret")
set.seed(7)
trainId = createDataPartition(db_churn$Churn, 
                       p=0.7, list=FALSE,times=1)
 
db_train = db_churn[trainId,]
db_test = db_churn[-trainId,]

As we have seen in summary(df), there are 3 numerical variables and they don’t have the same scale. Hence, we need to use a method to rescale them. First, we want to see their distribution.

gather_train =gather(db_train %>% 
               select(customerID, MonthlyCharges,TotalCharges, tenure),
               variable, value,
       -customerID)
ggplot(gather_train , aes(value)) + facet_wrap(~variable, scales = 'free_x') +
        geom_histogram() + theme_bw()

None of the variables on the graph above has a gaussian distribution, so we rescale them without standardization.

normalize = function(x) {
        result = (x - min(x, na.rm = TRUE)
                  ) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
        return(result)
}
norm.train = lapply(db_train %>% 
                    select(MonthlyCharges, TotalCharges, tenure),
                    normalize)
norm.train = do.call(cbind, norm.train) %>%
             as.data.frame()

Then we see in summary(df) that some variables have the following factors:

  • “Yes”
  • “No”
  • “No internet service” (or “No phone service”).

The third factor doesn’t give anymore no more informations so we recode the third factor in “No”.

factor.train = lapply(db_train %>% 
                      select(-customerID,-MonthlyCharges, 
                      -TotalCharges, -tenure), 
                      function(x){
        x = gsub("No internet service", "No", x)
        x = gsub("No phone service", "No", x)
        return(x)
})

factor.train = do.call(cbind, factor.train) %>% 
               as.data.frame()

db_train = cbind( customerID = db_train[,1], 
                  factor.train, norm.train)

Now that our training set is clean, we can estimate our train with 3 different models.

  1. Decision tree
  2. Random Forest
  3. Logistic Regression

Decision tree

First, we make a decision tree with library("rpart"). We keep the original options.

if (!require("rpart")) install.packages("rpart")
library("rpart")
if (!require("rpart.plot")) install.packages("rpart.plot")
library("rpart.plot")

tree = rpart(Churn ~., data = db_train %>% 
                     select(-customerID), method="class")
rpart.plot(tree)

In the graphic above, we see the decision tree that groups observations by their variable values. A decision tree has 2 mains components, leaves and nodes.

  • Leaves represent a group of observation. For each leaf, an answer is given, “Yes” or “No”. Below these answers, figures represent the percentage of churn in a leaf and finally we see the percentage of total observations in the leaf.
  • Nodes show which varible where used to seperate a leaf in two sub-leaves

Contract is the main variable in the churn decision. It makes sense because it is harder to change telecom providers if customers have a long term contract than a month-to-month contract.

Random forest

Imagine that you want to buy Kendrick Lamar’s new album. You ask a friend if you should buy it. Your friend tells you: “Buy it”. But you are not sure that your friend and you share the same musical taste so you decide to ask the same question to 29 other friends. At the end you have 20 pros and 10 cons. By the majority you decide to buy it. Random forest works the same way!
Instead of using one tree, we will use many trees with different variable options to draw our model. Then for each observation, we choose their category by taking the majority vote of our different trees.

Here we use a 5-fold cross-validation method. We don’t have so many variables so we choose 5 different values to optimize the number of variables via tuneLenght.

if (!require("randomForest")) install.packages("randomForest")
library("randomForest")

ctrl = trainControl(method = "cv", number=5, 
                classProbs = TRUE, summaryFunction = twoClassSummary)

model.rf = train(Churn ~., data = db_train %>% select(-customerID),
                         method = "rf",
                         ntree = 75,
                         tuneLength = 5,
                         metric = "ROC",
                         trControl = ctrl)
        
model.rf
## Random Forest 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3939, 3939, 3939, 3940, 3939 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.8293918  0.9244813  0.4560674
##    7    0.8180692  0.8946058  0.4988535
##   12    0.8150772  0.8940526  0.4996227
##   17    0.8129159  0.8882434  0.4934924
##   23    0.8066348  0.8882434  0.4980843
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Logistic regression

The logistic regression fits perfectly for a model that explains a binomial variable. We need to do 2 things. First, recode the churn variable as 0 for “No” and 1 for “Yes”. Second, we have to choose which variable combinations will the best explain the churn decision.

db_train$ChurnNum = ifelse(db_train$Churn == "Yes",1,0)
good_model = step(glm(ChurnNum ~.,data = db_train %>% 
                  select(-customerID, -Churn ), 
                  family=binomial(link='logit')), 
                  direction="both")
## Start:  AIC=4115.6
## ChurnNum ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     TotalCharges + tenure
## 
##                    Df Deviance    AIC
## - Partner           1   4067.6 4113.6
## - TechSupport       1   4067.7 4113.7
## - Dependents        1   4067.9 4113.9
## - gender            1   4068.0 4114.0
## - OnlineSecurity    1   4068.1 4114.1
## - OnlineBackup      1   4068.2 4114.2
## - PhoneService      1   4068.3 4114.3
## <none>                  4067.6 4115.6
## - DeviceProtection  1   4069.8 4115.8
## - MonthlyCharges    1   4070.1 4116.1
## - SeniorCitizen     1   4070.5 4116.5
## - StreamingMovies   1   4071.0 4117.0
## - StreamingTV       1   4071.1 4117.1
## - InternetService   2   4073.7 4117.7
## - MultipleLines     1   4075.5 4121.5
## - PaperlessBilling  1   4076.3 4122.3
## - TotalCharges      1   4081.4 4127.4
## - PaymentMethod     3   4088.7 4130.7
## - Contract          2   4120.3 4164.3
## - tenure            1   4144.6 4190.6
## 
## Step:  AIC=4113.62
## ChurnNum ~ gender + SeniorCitizen + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + TechSupport + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     TotalCharges + tenure
## 
##                    Df Deviance    AIC
## - TechSupport       1   4067.8 4111.8
## - gender            1   4068.0 4112.0
## - Dependents        1   4068.1 4112.1
## - OnlineSecurity    1   4068.2 4112.2
## - OnlineBackup      1   4068.2 4112.2
## - PhoneService      1   4068.3 4112.3
## <none>                  4067.6 4113.6
## - DeviceProtection  1   4069.8 4113.8
## - MonthlyCharges    1   4070.1 4114.1
## - SeniorCitizen     1   4070.5 4114.5
## - StreamingMovies   1   4071.1 4115.1
## - StreamingTV       1   4071.1 4115.1
## + Partner           1   4067.6 4115.6
## - InternetService   2   4073.8 4115.8
## - MultipleLines     1   4075.5 4119.5
## - PaperlessBilling  1   4076.4 4120.4
## - TotalCharges      1   4081.4 4125.4
## - PaymentMethod     3   4088.7 4128.7
## - Contract          2   4120.3 4162.3
## - tenure            1   4145.4 4189.4
## 
## Step:  AIC=4111.75
## ChurnNum ~ gender + SeniorCitizen + Dependents + PhoneService + 
##     MultipleLines + InternetService + OnlineSecurity + OnlineBackup + 
##     DeviceProtection + StreamingTV + StreamingMovies + Contract + 
##     PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges + 
##     tenure
## 
##                    Df Deviance    AIC
## - gender            1   4068.2 4110.2
## - Dependents        1   4068.2 4110.2
## - OnlineSecurity    1   4068.3 4110.3
## <none>                  4067.8 4111.8
## - OnlineBackup      1   4070.6 4112.6
## - SeniorCitizen     1   4070.7 4112.7
## + TechSupport       1   4067.6 4113.6
## + Partner           1   4067.7 4113.7
## - PhoneService      1   4072.8 4114.8
## - DeviceProtection  1   4075.6 4117.6
## - PaperlessBilling  1   4076.5 4118.5
## - TotalCharges      1   4081.4 4123.4
## - MonthlyCharges    1   4083.9 4125.9
## - StreamingTV       1   4084.4 4126.4
## - StreamingMovies   1   4084.8 4126.8
## - PaymentMethod     3   4088.9 4126.9
## - MultipleLines     1   4093.8 4135.8
## - InternetService   2   4102.4 4142.4
## - Contract          2   4121.1 4161.1
## - tenure            1   4145.5 4187.5
## 
## Step:  AIC=4110.16
## ChurnNum ~ SeniorCitizen + Dependents + PhoneService + MultipleLines + 
##     InternetService + OnlineSecurity + OnlineBackup + DeviceProtection + 
##     StreamingTV + StreamingMovies + Contract + PaperlessBilling + 
##     PaymentMethod + MonthlyCharges + TotalCharges + tenure
## 
##                    Df Deviance    AIC
## - Dependents        1   4068.6 4108.6
## - OnlineSecurity    1   4068.7 4108.7
## <none>                  4068.2 4110.2
## - OnlineBackup      1   4071.0 4111.0
## - SeniorCitizen     1   4071.1 4111.1
## + gender            1   4067.8 4111.8
## + TechSupport       1   4068.0 4112.0
## + Partner           1   4068.1 4112.1
## - PhoneService      1   4073.2 4113.2
## - DeviceProtection  1   4075.9 4115.9
## - PaperlessBilling  1   4077.0 4117.0
## - TotalCharges      1   4081.8 4121.8
## - MonthlyCharges    1   4084.3 4124.3
## - StreamingTV       1   4084.8 4124.8
## - StreamingMovies   1   4085.2 4125.2
## - PaymentMethod     3   4089.3 4125.3
## - MultipleLines     1   4094.2 4134.2
## - InternetService   2   4102.8 4140.8
## - Contract          2   4121.4 4159.4
## - tenure            1   4145.9 4185.9
## 
## Step:  AIC=4108.63
## ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + InternetService + 
##     OnlineSecurity + OnlineBackup + DeviceProtection + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges + tenure
## 
##                    Df Deviance    AIC
## - OnlineSecurity    1   4069.1 4107.1
## <none>                  4068.6 4108.6
## - OnlineBackup      1   4071.5 4109.5
## - SeniorCitizen     1   4072.1 4110.1
## + Dependents        1   4068.2 4110.2
## + gender            1   4068.2 4110.2
## + Partner           1   4068.4 4110.4
## + TechSupport       1   4068.5 4110.5
## - PhoneService      1   4073.7 4111.7
## - DeviceProtection  1   4076.4 4114.4
## - PaperlessBilling  1   4077.4 4115.4
## - TotalCharges      1   4082.5 4120.5
## - MonthlyCharges    1   4084.8 4122.8
## - StreamingTV       1   4085.3 4123.3
## - StreamingMovies   1   4085.8 4123.8
## - PaymentMethod     3   4090.0 4124.0
## - MultipleLines     1   4094.8 4132.8
## - InternetService   2   4103.5 4139.5
## - Contract          2   4123.1 4159.1
## - tenure            1   4147.3 4185.3
## 
## Step:  AIC=4107.13
## ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + InternetService + 
##     OnlineBackup + DeviceProtection + StreamingTV + StreamingMovies + 
##     Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + 
##     TotalCharges + tenure
## 
##                    Df Deviance    AIC
## <none>                  4069.1 4107.1
## - SeniorCitizen     1   4072.5 4108.5
## + OnlineSecurity    1   4068.6 4108.6
## + Dependents        1   4068.7 4108.7
## + gender            1   4068.7 4108.7
## + Partner           1   4068.9 4108.9
## + TechSupport       1   4069.0 4109.0
## - OnlineBackup      1   4074.4 4110.4
## - PaperlessBilling  1   4078.1 4114.1
## - PhoneService      1   4080.8 4116.8
## - DeviceProtection  1   4081.9 4117.9
## - TotalCharges      1   4083.0 4119.0
## - PaymentMethod     3   4090.4 4122.4
## - StreamingTV       1   4101.9 4137.9
## - StreamingMovies   1   4102.5 4138.5
## - MultipleLines     1   4106.8 4142.8
## - MonthlyCharges    1   4106.9 4142.9
## - Contract          2   4123.4 4157.4
## - InternetService   2   4146.5 4180.5
## - tenure            1   4148.1 4184.1

It seems that 13 variables are enough for our model. Now let’s see coefficient estimates.

summary(good_model)
## 
## Call:
## glm(formula = ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + 
##     InternetService + OnlineBackup + DeviceProtection + StreamingTV + 
##     StreamingMovies + Contract + PaperlessBilling + PaymentMethod + 
##     MonthlyCharges + TotalCharges + tenure, family = binomial(link = "logit"), 
##     data = db_train %>% select(-customerID, -Churn))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9012  -0.6731  -0.2923   0.7290   3.3888  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                           0.60150    0.22743   2.645 0.008174
## SeniorCitizen1                        0.18366    0.09965   1.843 0.065337
## PhoneServiceYes                       1.04493    0.30762   3.397 0.000682
## MultipleLinesYes                      0.69549    0.11417   6.092 1.12e-09
## InternetServiceFiber optic            2.82341    0.32796   8.609  < 2e-16
## InternetServiceNo                    -2.87861    0.40080  -7.182 6.86e-13
## OnlineBackupYes                       0.25915    0.11367   2.280 0.022619
## DeviceProtectionYes                   0.41179    0.11563   3.561 0.000369
## StreamingTVYes                        0.92831    0.16380   5.667 1.45e-08
## StreamingMoviesYes                    0.92595    0.16194   5.718 1.08e-08
## ContractOne year                     -0.66307    0.12800  -5.180 2.22e-07
## ContractTwo year                     -1.24824    0.20212  -6.176 6.59e-10
## PaperlessBillingYes                   0.26399    0.08847   2.984 0.002847
## PaymentMethodCredit card (automatic) -0.18326    0.13747  -1.333 0.182523
## PaymentMethodElectronic check         0.30024    0.11163   2.690 0.007152
## PaymentMethodMailed check            -0.02901    0.13531  -0.214 0.830262
## MonthlyCharges                       -8.02745    1.32045  -6.079 1.21e-09
## TotalCharges                          2.63684    0.72491   3.637 0.000275
## tenure                               -4.24828    0.52098  -8.154 3.51e-16
##                                         
## (Intercept)                          ** 
## SeniorCitizen1                       .  
## PhoneServiceYes                      ***
## MultipleLinesYes                     ***
## InternetServiceFiber optic           ***
## InternetServiceNo                    ***
## OnlineBackupYes                      *  
## DeviceProtectionYes                  ***
## StreamingTVYes                       ***
## StreamingMoviesYes                   ***
## ContractOne year                     ***
## ContractTwo year                     ***
## PaperlessBillingYes                  ** 
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check        ** 
## PaymentMethodMailed check               
## MonthlyCharges                       ***
## TotalCharges                         ***
## tenure                               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5702.8  on 4923  degrees of freedom
## Residual deviance: 4069.1  on 4905  degrees of freedom
## AIC: 4107.1
## 
## Number of Fisher Scoring iterations: 6

Most of them are significants they have a p-value lower than 0.05. As we could expect, ContractOne year, ContractTwo year or a high value for the variable tenure reduce the probability to churn.

Now that we have our 3 models, we decide to test them on our testing dataset. But first, we need to apply the same transformations that we appliedd to the training set.

norm.test = lapply(db_test %>% select(MonthlyCharges,TotalCharges, tenure), normalize)
norm.test = do.call(cbind, norm.test) %>%
        as.data.frame()

factor.test = lapply(db_test %>% 
                     select(-customerID,-MonthlyCharges, 
                            -TotalCharges, -tenure), 
                     function(x){
        x = gsub("No internet service", "No", x)
        x = gsub("No phone service", "No", x)
        return(x)
})
factor.test = do.call(cbind, factor.test) %>% as.data.frame()


db_test = cbind( customerID = db_test[,1], factor.test , norm.test)
db_test$ChurnNum  = ifelse(db_test$Churn == "Yes", 1, 0)

if (!require("ROSE")) install.packages("ROSE")
library("ROSE")
pred_tree = predict(tree, db_test %>% 
                    select(-customerID), type = c("prob"))[,2]
pred_rf = predict(object=model.rf, db_test %>% 
                          select(-customerID), type='prob')[,2]
pred_logistic = predict(good_model, db_test, type="response")

roc_tree = roc.curve(response = db_test$ChurnNum, pred_tree, 
                   col = "#0d84da")
roc_rf = roc.curve(response = db_test$ChurnNum, pred_rf, 
                   col = "#ef0a30", add.roc=TRUE)
roc_logistic = roc.curve(response = db_test$ChurnNum, pred_logistic, 
                   col = "#45a163", add.roc=TRUE)

legend("bottomright", legend=c("Decision Tree", "Random Forest", 
                               "Logistic Regression"), 
       lwd = 2, col = c("#0d84da", "#ef0a30", "#45a163"))

The logistic regression gives the best model: a better true positive rate for less false positive observations.

In the next post we will focus on the different way to analyse results from the logistic regression.

Don’t hesitate to contact me if you have any comments or suggestions. See you for the next post.

To leave a comment for the author, please follow the link and comment on their blog: rdata.lu Blog | Data science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)