Churn Analysis – Part 1: Model Selection
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello everyone,
Today we will make a churn analysis with a dataset provided by IBM.
You can find the dataset here.
What is a churn?
We can shortly define customer churn (most commonly called “churn”) as customers that stop doing business with a company or a service. There are customer churns in different business area. In this post, we will focus on the telecom area. Here, we want to predict which customers will leave their current telecom provider. This type of studies are called churn analysis. It’s a trendy topic in customer relationship management (CRM) departments because it costs more money to find new customers than keeping the existing ones. So companies want to prevent them to leave.
How could we identify customers who are likely to leave?
To identify the customers, we need to have a database with data about the previous customers that churned. Using this data, we develop a model which identifies customers that have a profile close to the ones that already left.
To simulate an experiment where we want to predict if our customers will churn, we need to work with a partitioned database. The database has 2 parts, one part will be the training set. This will be used to create the model. The second part will be the testing set which will be used to evaluate our model.
In this case we know customer answers from the testing dataset so we can compare the model prediction with the true answers. Nevertheless in reality, we don’t know what will be the true answers. So we have to target mainly customers with high probability to churn. This probability is given by our model.
Let’s start
We import data and look at quick insight results. We encode the SeniorCitizen
variable as a factor variable.
#to manipulate data if (!require("tidyverse")) install.packages("tidyverse") library("tidyverse") path = "https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=58920755505115141495567&cm_mc_sid_50200000=1514149556&cm_mc_sid_52640000=1514149556" db_churn = read_csv(path) #To see variables structure #str(db_churn) db_churn$SeniorCitizen = as.factor(db_churn$SeniorCitizen) summary(db_churn) ## customerID gender SeniorCitizen Partner ## Length:7043 Length:7043 0:5901 Length:7043 ## Class :character Class :character 1:1142 Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## Dependents tenure PhoneService MultipleLines ## Length:7043 Min. : 0.00 Length:7043 Length:7043 ## Class :character 1st Qu.: 9.00 Class :character Class :character ## Mode :character Median :29.00 Mode :character Mode :character ## Mean :32.37 ## 3rd Qu.:55.00 ## Max. :72.00 ## ## InternetService OnlineSecurity OnlineBackup ## Length:7043 Length:7043 Length:7043 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## DeviceProtection TechSupport StreamingTV ## Length:7043 Length:7043 Length:7043 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## StreamingMovies Contract PaperlessBilling ## Length:7043 Length:7043 Length:7043 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## PaymentMethod MonthlyCharges TotalCharges Churn ## Length:7043 Min. : 18.25 Min. : 18.8 Length:7043 ## Class :character 1st Qu.: 35.50 1st Qu.: 401.4 Class :character ## Mode :character Median : 70.35 Median :1397.5 Mode :character ## Mean : 64.76 Mean :2283.3 ## 3rd Qu.: 89.85 3rd Qu.:3794.7 ## Max. :118.75 Max. :8684.8 ## NA's :11
There are 3 numerical variables and 18 categorical variables. Then we observe the churn rate.
if (!require("janitor")) install.packages("janitor") library("janitor") prop = tabyl(db_churn$Churn) prop ## db_churn$Churn n percent ## 1 No 5174 0.7346301 ## 2 Yes 1869 0.2653699
We realize that we have only 11 missing values in our dataset and they are all linked to the variable TotalCharges
and none of them are churning. They represent only 0,16 % of our total observations. So we decide to remove them.
n_NA = db_churn %>% filter(is.na(TotalCharges)) %>% select(Churn) #The next line give the percentage of missing values in our dataset #100*nrow(n_NA)/nrow(db_churn) #11 NA, that 0.16% of our database and none of them decode to churn db_churn = db_churn %>% filter(!is.na(TotalCharges))
Now we can split our database in 2 parts. The training part and the testing part.
if (!require("caret")) install.packages("caret") library("caret") set.seed(7) trainId = createDataPartition(db_churn$Churn, p=0.7, list=FALSE,times=1) db_train = db_churn[trainId,] db_test = db_churn[-trainId,]
As we have seen in summary(df)
, there are 3 numerical variables and they don’t have the same scale. Hence, we need to use a method to rescale them. First, we want to see their distribution.
gather_train =gather(db_train %>% select(customerID, MonthlyCharges,TotalCharges, tenure), variable, value, -customerID) ggplot(gather_train , aes(value)) + facet_wrap(~variable, scales = 'free_x') + geom_histogram() + theme_bw()
None of the variables on the graph above has a gaussian distribution, so we rescale them without standardization.
normalize = function(x) { result = (x - min(x, na.rm = TRUE) ) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) return(result) } norm.train = lapply(db_train %>% select(MonthlyCharges, TotalCharges, tenure), normalize) norm.train = do.call(cbind, norm.train) %>% as.data.frame()
Then we see in summary(df)
that some variables have the following factors:
- “Yes”
- “No”
- “No internet service” (or “No phone service”).
The third factor doesn’t give anymore no more informations so we recode the third factor in “No”.
factor.train = lapply(db_train %>% select(-customerID,-MonthlyCharges, -TotalCharges, -tenure), function(x){ x = gsub("No internet service", "No", x) x = gsub("No phone service", "No", x) return(x) }) factor.train = do.call(cbind, factor.train) %>% as.data.frame() db_train = cbind( customerID = db_train[,1], factor.train, norm.train)
Now that our training set is clean, we can estimate our train with 3 different models.
- Decision tree
- Random Forest
- Logistic Regression
Decision tree
First, we make a decision tree with library("rpart")
. We keep the original options.
if (!require("rpart")) install.packages("rpart") library("rpart") if (!require("rpart.plot")) install.packages("rpart.plot") library("rpart.plot") tree = rpart(Churn ~., data = db_train %>% select(-customerID), method="class") rpart.plot(tree)
In the graphic above, we see the decision tree that groups observations by their variable values. A decision tree has 2 mains components, leaves and nodes.
- Leaves represent a group of observation. For each leaf, an answer is given, “Yes” or “No”. Below these answers, figures represent the percentage of churn in a leaf and finally we see the percentage of total observations in the leaf.
- Nodes show which varible where used to seperate a leaf in two sub-leaves
Contract is the main variable in the churn decision. It makes sense because it is harder to change telecom providers if customers have a long term contract than a month-to-month contract.
Random forest
Imagine that you want to buy Kendrick Lamar’s new album. You ask a friend if you should buy it. Your friend tells you: “Buy it”. But you are not sure that your friend and you share the same musical taste so you decide to ask the same question to 29 other friends. At the end you have 20 pros and 10 cons. By the majority you decide to buy it. Random forest works the same way!
Instead of using one tree, we will use many trees with different variable options to draw our model. Then for each observation, we choose their category by taking the majority vote of our different trees.
Here we use a 5-fold cross-validation method. We don’t have so many variables so we choose 5 different values to optimize the number of variables via tuneLenght
.
if (!require("randomForest")) install.packages("randomForest") library("randomForest") ctrl = trainControl(method = "cv", number=5, classProbs = TRUE, summaryFunction = twoClassSummary) model.rf = train(Churn ~., data = db_train %>% select(-customerID), method = "rf", ntree = 75, tuneLength = 5, metric = "ROC", trControl = ctrl) model.rf ## Random Forest ## ## 4924 samples ## 19 predictor ## 2 classes: 'No', 'Yes' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 3939, 3939, 3939, 3940, 3939 ## Resampling results across tuning parameters: ## ## mtry ROC Sens Spec ## 2 0.8293918 0.9244813 0.4560674 ## 7 0.8180692 0.8946058 0.4988535 ## 12 0.8150772 0.8940526 0.4996227 ## 17 0.8129159 0.8882434 0.4934924 ## 23 0.8066348 0.8882434 0.4980843 ## ## ROC was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2.
Logistic regression
The logistic regression fits perfectly for a model that explains a binomial variable. We need to do 2 things. First, recode the churn variable as 0 for “No” and 1 for “Yes”. Second, we have to choose which variable combinations will the best explain the churn decision.
db_train$ChurnNum = ifelse(db_train$Churn == "Yes",1,0) good_model = step(glm(ChurnNum ~.,data = db_train %>% select(-customerID, -Churn ), family=binomial(link='logit')), direction="both") ## Start: AIC=4115.6 ## ChurnNum ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + ## MultipleLines + InternetService + OnlineSecurity + OnlineBackup + ## DeviceProtection + TechSupport + StreamingTV + StreamingMovies + ## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + ## TotalCharges + tenure ## ## Df Deviance AIC ## - Partner 1 4067.6 4113.6 ## - TechSupport 1 4067.7 4113.7 ## - Dependents 1 4067.9 4113.9 ## - gender 1 4068.0 4114.0 ## - OnlineSecurity 1 4068.1 4114.1 ## - OnlineBackup 1 4068.2 4114.2 ## - PhoneService 1 4068.3 4114.3 ## <none> 4067.6 4115.6 ## - DeviceProtection 1 4069.8 4115.8 ## - MonthlyCharges 1 4070.1 4116.1 ## - SeniorCitizen 1 4070.5 4116.5 ## - StreamingMovies 1 4071.0 4117.0 ## - StreamingTV 1 4071.1 4117.1 ## - InternetService 2 4073.7 4117.7 ## - MultipleLines 1 4075.5 4121.5 ## - PaperlessBilling 1 4076.3 4122.3 ## - TotalCharges 1 4081.4 4127.4 ## - PaymentMethod 3 4088.7 4130.7 ## - Contract 2 4120.3 4164.3 ## - tenure 1 4144.6 4190.6 ## ## Step: AIC=4113.62 ## ChurnNum ~ gender + SeniorCitizen + Dependents + PhoneService + ## MultipleLines + InternetService + OnlineSecurity + OnlineBackup + ## DeviceProtection + TechSupport + StreamingTV + StreamingMovies + ## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + ## TotalCharges + tenure ## ## Df Deviance AIC ## - TechSupport 1 4067.8 4111.8 ## - gender 1 4068.0 4112.0 ## - Dependents 1 4068.1 4112.1 ## - OnlineSecurity 1 4068.2 4112.2 ## - OnlineBackup 1 4068.2 4112.2 ## - PhoneService 1 4068.3 4112.3 ## <none> 4067.6 4113.6 ## - DeviceProtection 1 4069.8 4113.8 ## - MonthlyCharges 1 4070.1 4114.1 ## - SeniorCitizen 1 4070.5 4114.5 ## - StreamingMovies 1 4071.1 4115.1 ## - StreamingTV 1 4071.1 4115.1 ## + Partner 1 4067.6 4115.6 ## - InternetService 2 4073.8 4115.8 ## - MultipleLines 1 4075.5 4119.5 ## - PaperlessBilling 1 4076.4 4120.4 ## - TotalCharges 1 4081.4 4125.4 ## - PaymentMethod 3 4088.7 4128.7 ## - Contract 2 4120.3 4162.3 ## - tenure 1 4145.4 4189.4 ## ## Step: AIC=4111.75 ## ChurnNum ~ gender + SeniorCitizen + Dependents + PhoneService + ## MultipleLines + InternetService + OnlineSecurity + OnlineBackup + ## DeviceProtection + StreamingTV + StreamingMovies + Contract + ## PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges + ## tenure ## ## Df Deviance AIC ## - gender 1 4068.2 4110.2 ## - Dependents 1 4068.2 4110.2 ## - OnlineSecurity 1 4068.3 4110.3 ## <none> 4067.8 4111.8 ## - OnlineBackup 1 4070.6 4112.6 ## - SeniorCitizen 1 4070.7 4112.7 ## + TechSupport 1 4067.6 4113.6 ## + Partner 1 4067.7 4113.7 ## - PhoneService 1 4072.8 4114.8 ## - DeviceProtection 1 4075.6 4117.6 ## - PaperlessBilling 1 4076.5 4118.5 ## - TotalCharges 1 4081.4 4123.4 ## - MonthlyCharges 1 4083.9 4125.9 ## - StreamingTV 1 4084.4 4126.4 ## - StreamingMovies 1 4084.8 4126.8 ## - PaymentMethod 3 4088.9 4126.9 ## - MultipleLines 1 4093.8 4135.8 ## - InternetService 2 4102.4 4142.4 ## - Contract 2 4121.1 4161.1 ## - tenure 1 4145.5 4187.5 ## ## Step: AIC=4110.16 ## ChurnNum ~ SeniorCitizen + Dependents + PhoneService + MultipleLines + ## InternetService + OnlineSecurity + OnlineBackup + DeviceProtection + ## StreamingTV + StreamingMovies + Contract + PaperlessBilling + ## PaymentMethod + MonthlyCharges + TotalCharges + tenure ## ## Df Deviance AIC ## - Dependents 1 4068.6 4108.6 ## - OnlineSecurity 1 4068.7 4108.7 ## <none> 4068.2 4110.2 ## - OnlineBackup 1 4071.0 4111.0 ## - SeniorCitizen 1 4071.1 4111.1 ## + gender 1 4067.8 4111.8 ## + TechSupport 1 4068.0 4112.0 ## + Partner 1 4068.1 4112.1 ## - PhoneService 1 4073.2 4113.2 ## - DeviceProtection 1 4075.9 4115.9 ## - PaperlessBilling 1 4077.0 4117.0 ## - TotalCharges 1 4081.8 4121.8 ## - MonthlyCharges 1 4084.3 4124.3 ## - StreamingTV 1 4084.8 4124.8 ## - StreamingMovies 1 4085.2 4125.2 ## - PaymentMethod 3 4089.3 4125.3 ## - MultipleLines 1 4094.2 4134.2 ## - InternetService 2 4102.8 4140.8 ## - Contract 2 4121.4 4159.4 ## - tenure 1 4145.9 4185.9 ## ## Step: AIC=4108.63 ## ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + InternetService + ## OnlineSecurity + OnlineBackup + DeviceProtection + StreamingTV + ## StreamingMovies + Contract + PaperlessBilling + PaymentMethod + ## MonthlyCharges + TotalCharges + tenure ## ## Df Deviance AIC ## - OnlineSecurity 1 4069.1 4107.1 ## <none> 4068.6 4108.6 ## - OnlineBackup 1 4071.5 4109.5 ## - SeniorCitizen 1 4072.1 4110.1 ## + Dependents 1 4068.2 4110.2 ## + gender 1 4068.2 4110.2 ## + Partner 1 4068.4 4110.4 ## + TechSupport 1 4068.5 4110.5 ## - PhoneService 1 4073.7 4111.7 ## - DeviceProtection 1 4076.4 4114.4 ## - PaperlessBilling 1 4077.4 4115.4 ## - TotalCharges 1 4082.5 4120.5 ## - MonthlyCharges 1 4084.8 4122.8 ## - StreamingTV 1 4085.3 4123.3 ## - StreamingMovies 1 4085.8 4123.8 ## - PaymentMethod 3 4090.0 4124.0 ## - MultipleLines 1 4094.8 4132.8 ## - InternetService 2 4103.5 4139.5 ## - Contract 2 4123.1 4159.1 ## - tenure 1 4147.3 4185.3 ## ## Step: AIC=4107.13 ## ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + InternetService + ## OnlineBackup + DeviceProtection + StreamingTV + StreamingMovies + ## Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + ## TotalCharges + tenure ## ## Df Deviance AIC ## <none> 4069.1 4107.1 ## - SeniorCitizen 1 4072.5 4108.5 ## + OnlineSecurity 1 4068.6 4108.6 ## + Dependents 1 4068.7 4108.7 ## + gender 1 4068.7 4108.7 ## + Partner 1 4068.9 4108.9 ## + TechSupport 1 4069.0 4109.0 ## - OnlineBackup 1 4074.4 4110.4 ## - PaperlessBilling 1 4078.1 4114.1 ## - PhoneService 1 4080.8 4116.8 ## - DeviceProtection 1 4081.9 4117.9 ## - TotalCharges 1 4083.0 4119.0 ## - PaymentMethod 3 4090.4 4122.4 ## - StreamingTV 1 4101.9 4137.9 ## - StreamingMovies 1 4102.5 4138.5 ## - MultipleLines 1 4106.8 4142.8 ## - MonthlyCharges 1 4106.9 4142.9 ## - Contract 2 4123.4 4157.4 ## - InternetService 2 4146.5 4180.5 ## - tenure 1 4148.1 4184.1
It seems that 13 variables are enough for our model. Now let’s see coefficient estimates.
summary(good_model) ## ## Call: ## glm(formula = ChurnNum ~ SeniorCitizen + PhoneService + MultipleLines + ## InternetService + OnlineBackup + DeviceProtection + StreamingTV + ## StreamingMovies + Contract + PaperlessBilling + PaymentMethod + ## MonthlyCharges + TotalCharges + tenure, family = binomial(link = "logit"), ## data = db_train %>% select(-customerID, -Churn)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9012 -0.6731 -0.2923 0.7290 3.3888 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.60150 0.22743 2.645 0.008174 ## SeniorCitizen1 0.18366 0.09965 1.843 0.065337 ## PhoneServiceYes 1.04493 0.30762 3.397 0.000682 ## MultipleLinesYes 0.69549 0.11417 6.092 1.12e-09 ## InternetServiceFiber optic 2.82341 0.32796 8.609 < 2e-16 ## InternetServiceNo -2.87861 0.40080 -7.182 6.86e-13 ## OnlineBackupYes 0.25915 0.11367 2.280 0.022619 ## DeviceProtectionYes 0.41179 0.11563 3.561 0.000369 ## StreamingTVYes 0.92831 0.16380 5.667 1.45e-08 ## StreamingMoviesYes 0.92595 0.16194 5.718 1.08e-08 ## ContractOne year -0.66307 0.12800 -5.180 2.22e-07 ## ContractTwo year -1.24824 0.20212 -6.176 6.59e-10 ## PaperlessBillingYes 0.26399 0.08847 2.984 0.002847 ## PaymentMethodCredit card (automatic) -0.18326 0.13747 -1.333 0.182523 ## PaymentMethodElectronic check 0.30024 0.11163 2.690 0.007152 ## PaymentMethodMailed check -0.02901 0.13531 -0.214 0.830262 ## MonthlyCharges -8.02745 1.32045 -6.079 1.21e-09 ## TotalCharges 2.63684 0.72491 3.637 0.000275 ## tenure -4.24828 0.52098 -8.154 3.51e-16 ## ## (Intercept) ** ## SeniorCitizen1 . ## PhoneServiceYes *** ## MultipleLinesYes *** ## InternetServiceFiber optic *** ## InternetServiceNo *** ## OnlineBackupYes * ## DeviceProtectionYes *** ## StreamingTVYes *** ## StreamingMoviesYes *** ## ContractOne year *** ## ContractTwo year *** ## PaperlessBillingYes ** ## PaymentMethodCredit card (automatic) ## PaymentMethodElectronic check ** ## PaymentMethodMailed check ## MonthlyCharges *** ## TotalCharges *** ## tenure *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 5702.8 on 4923 degrees of freedom ## Residual deviance: 4069.1 on 4905 degrees of freedom ## AIC: 4107.1 ## ## Number of Fisher Scoring iterations: 6
Most of them are significants they have a p-value lower than 0.05. As we could expect, ContractOne year
, ContractTwo year
or a high value for the variable tenure
reduce the probability to churn.
Now that we have our 3 models, we decide to test them on our testing dataset. But first, we need to apply the same transformations that we appliedd to the training set.
norm.test = lapply(db_test %>% select(MonthlyCharges,TotalCharges, tenure), normalize) norm.test = do.call(cbind, norm.test) %>% as.data.frame() factor.test = lapply(db_test %>% select(-customerID,-MonthlyCharges, -TotalCharges, -tenure), function(x){ x = gsub("No internet service", "No", x) x = gsub("No phone service", "No", x) return(x) }) factor.test = do.call(cbind, factor.test) %>% as.data.frame() db_test = cbind( customerID = db_test[,1], factor.test , norm.test) db_test$ChurnNum = ifelse(db_test$Churn == "Yes", 1, 0) if (!require("ROSE")) install.packages("ROSE") library("ROSE") pred_tree = predict(tree, db_test %>% select(-customerID), type = c("prob"))[,2] pred_rf = predict(object=model.rf, db_test %>% select(-customerID), type='prob')[,2] pred_logistic = predict(good_model, db_test, type="response") roc_tree = roc.curve(response = db_test$ChurnNum, pred_tree, col = "#0d84da") roc_rf = roc.curve(response = db_test$ChurnNum, pred_rf, col = "#ef0a30", add.roc=TRUE) roc_logistic = roc.curve(response = db_test$ChurnNum, pred_logistic, col = "#45a163", add.roc=TRUE) legend("bottomright", legend=c("Decision Tree", "Random Forest", "Logistic Regression"), lwd = 2, col = c("#0d84da", "#ef0a30", "#45a163"))
The logistic regression gives the best model: a better true positive rate for less false positive observations.
In the next post we will focus on the different way to analyse results from the logistic regression.
Don’t hesitate to contact me if you have any comments or suggestions. See you for the next post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.