[This article was first published on Data Apple » R Blogs in English, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data mining techniques and algorithms such as Decision Tree, Naïve Bayes, Support Vector Machine, Random Forest, and Logistic Regression are “most commonly used for predicting a specific outcome such as response / no-response, high / medium / low-value customer, likely to buy / not buy.”1

In this article, we will demonstrate how to use R to build classification models to identify the potential customers who are likely to buy an insurance product. We will build models with Decision Tree, Random Forest, Naïve Bayes, and SVM respectively and then compare the models to find out the best one. The scenario and data are based on the public tutorial of “Using Oracle Data Miner 11g Release 22

## Data Observation and Preparation

The dataset is exported from the database mentioned in the above tutorial. Let’s take a look at the variables in the dataset.

> dim(dataset)

[1] 1015   31

> names(dataset)

[5] “STATE”            ”REGION”         ”SEX”          ”PROFESSION”

[9]“AGE”              ”HAS_CHILDREN”        “SALARY”       “N_OF_DEPENDENTS”

[13] “CAR_OWNERSHIP”       ”HOUSE_OWNERSHIP”        ”TIME_AS_CUSTOMER”        “MARITAL_STATUS”

[17] “CREDIT_BALANCE”         “BANK_FUNDS”                “CHECKING_AMOUNT”         “MONEY_MONTLY_OVERDRAWN”

[21] “T_AMOUNT_AUTOM_PAYMENTS”            ”MONTHLY_CHECKS_WRITTEN”  “MORTGAGE_AMOUNT”                             ”N_TRANS_ATM”

[25] “N_MORTGAGES”             “N_TRANS_TELLER”          “CREDIT_CARD_LIMITS”      “N_TRANS_KIOSK”

[29] “N_TRANS_WEB_BANK”          “LTV”                     “LTV_BIN”

There are 1015 cases and 31 variables in the dataset. The variable of “BUY_INSURANCE” is the dependent variable. Other variables are customers’ basic information, geographical information, and bank account information. The data types for the variables should be “factor” and “numeric” in R.

> table(dataset$BUY_INSURANCE) No Yes 742 273 In the 1015 cases, 273 people bought the insurance product in the past. Checking Missing Value > sum(complete.cases(dataset)) [1] 1015 There is no missing value in the dataset for us to deal with. Removing Unnecessary Variables The variables of “CUSTOMER_ID”, “LAST”, and “FIRST” doesn’t help for the data mining. We can remove them. > dataset <-subset(dataset,select = -c(CUSTOMER_ID, LAST, FIRST)) Some of the algorithms have a limitation on the categorical levels. If there are too many levels in a variable, we need to combine the lower levels into higher levels to reduce the quantity of the total levels or we can just remove the variables if it doesn’t influence the data mining result. Let’s check how many levels there are in the following categorical variables. > dim(table(dataset$PROFESSION))

[1] 95

> dim(table(dataset$STATE)) [1] 22 > dim(table(dataset$REGION))

[1] 5

The quantity of PROFESSION levels exceeds the limitation and we know that there are 50 states though it is only 22 in the dataset. For simplification, we remove the two variables here.

> dataset <-subset(dataset, select = -c(PROFESSION, STATE))

Since variable of LTV has been binned into the LTV_BIN already in the dataset, we remove LTV as well.

> dataset <-subset(dataset, select = -c(LTV))

Transferring the Data Type

> dataset$REGION <-as.factor(dataset$REGION)

> dataset$SEX <-as.factor(dataset$SEX)

> dataset$CAR_OWNERSHIP <-as.factor(dataset$CAR_OWNERSHIP)

> dataset$HOUSE_OWNERSHIP <-as.factor(dataset$HOUSE_OWNERSHIP)

> dataset$MARITAL_STATUS <-as.factor(dataset$MARITAL_STATUS)

> dataset$HAS_CHILDREN <-as.factor(dataset$HAS_CHILDREN)

> dataset$LTV_BIN <-as.ordered(dataset$LTV_BIN)

>

> dataset$AGE <-as.numeric(dataset$AGE)

> dataset$SALARY <-as.numeric(dataset$SALARY)

> dataset$N_OF_DEPENDENTS <-as.numeric(dataset$N_OF_DEPENDENTS)

> dataset$TIME_AS_CUSTOMER <-as.numeric(dataset$TIME_AS_CUSTOMER)

> dataset$CREDIT_BALANCE <-as.numeric(dataset$CREDIT_BALANCE)

> dataset$BANK_FUNDS <-as.numeric(dataset$BANK_FUNDS)

> dataset$CHECKING_AMOUNT <-as.numeric(dataset$CHECKING_AMOUNT)

>dataset$MONEY_MONTLY_OVERDRAWN <-as.numeric(dataset$MONEY_MONTLY_OVERDRAWN)

>dataset$T_AMOUNT_AUTOM_PAYMENTS <-as.numeric(dataset$T_AMOUNT_AUTOM_PAYMENTS)

> dataset$MONTHLY_CHECKS_WRITTEN <-as.numeric(dataset$MONTHLY_CHECKS_WRITTEN)

> dataset$MORTGAGE_AMOUNT <-as.numeric(dataset$MORTGAGE_AMOUNT)

> dataset$N_TRANS_ATM <-as.numeric(dataset$N_TRANS_ATM)

> dataset$N_MORTGAGES <-as.numeric(dataset$N_MORTGAGES)

> dataset$N_TRANS_TELLER <-as.numeric(dataset$N_TRANS_TELLER)

> dataset$CREDIT_CARD_LIMITS <-as.numeric(dataset$CREDIT_CARD_LIMITS)

> dataset$N_TRANS_KIOSK <-as.numeric(dataset$N_TRANS_KIOSK)

> dataset$N_TRANS_WEB_BANK <-as.numeric(dataset$N_TRANS_WEB_BANK)

Checking the Correlations between Numeric Variables

We could use the function of “pairs20x()” to check the correlations visually. Due to that there are more than 20 numeric variables and thus the output figure is too large to display, we don’t show the figure here. We only use the function of “cor()” to get the correlations.

> cor(dataset$TIME_AS_CUSTOMER, dataset$N_OF_DEPENDENTS)

[1] 0.7667451

> cor(dataset$T_AMOUNT_AUTOM_PAYMENTS, dataset$CREDIT_BALANCE)

[1] 0.8274963

> cor(dataset$N_TRANS_WEB_BANK, dataset$MORTGAGE_AMOUNT)

[1] 0.7679546

We found the above three pairs of variables have higher correlations. Ideally we should try to remove one variable in turn in each pair for model building to see if the performance of models can be improved. However, for simplification, we don’t deal with the correlated variables in this ariticle.

Breaking Data into Training and Test Sample

> # breaking the data set into training and test samples by half

> d = sort(sample(nrow(dataset), nrow(dataset)*.5))

> train<-dataset[d,]

> test<-dataset[-d,]

## Building the Models

In this part, we will use the Decision Tree, Random Forest, Naive Bayes, and SVM classifiers in R to build models respectively. For simplification, we will not conduct k-folder cross validation during modeling for some classifiers in which there are no embedded cross validation.

Decision Tree

Decision Tree is one of the most commonly used classifier. It is able to handle both numerical and categorical variables and it is insensitive to data errors or even missing data. Most importantly, it provides human-readable rules.

> library(“rpart”)

> plot(model.tree,uniform=TRUE,margin=0.1)

> text(model.tree,use.n=T,cex=0.8)

We can view the output tree structure in the Figure 1.

Figure 1

The tree is a bit complex so we will prune it. Firstly, we need to find out the right complexity parameter (cp) value, hence the number of splits (or size) of the tree, for pruning. The right cp is a threshold point where increased cost for further splitting outweighs reduction in lack-of-fit.

># plot cp

> plotcp(model.tree)

Figure 2

“A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.” In the Figure 2 above, we can see that the optimal cp value is 0.042.

># Prune the tree with the optimal cp value

> pTree<- prune(model.tree, 0.042)

# draw the pruned tree

>plot(pTree,uniform=TRUE,margin=0.1)

>text(pTree,use.n=T,cex=0.8)

Figure 3

As shown in Figure 3, it is easy to describe the rules to decide if a customer is more likely to buy.

Take the most right leaf for example; the rules can be described as follows.

IF BANK_FUNDS >= 320.5

IF CHECKING_AMOUNT < 162

IF MONEY_MONTHLY_WITHDRAWN >=53.68

IF CREDIT_BALANCE < 3850

The numbers below the leaf shows that 4 customers didn’t buy and 50 customers bought under the above conditions in the training dataset.

After we build the model, we can use it to predict for the customers in the test dataset.

> pred.tree <- predict(pTree,test[,-1])

The prediction result can be No or Yes for each customer, or it can provide the probabilities of No and Yes for each customers as follows.

No               Yes

1 0.8380952         0.16190476

2 0.9823009         0.01769912

3 0.8380952         0.16190476

7 0.9823009         0.01769912

8 0.9823009         0.01769912

9 0.9823009         0.01769912

Random Forest

Random Forest is based on Decision Tree. It can handle large dataset and thousands of input variables without variable deletion. In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It gives estimates of what variables are important in the classification.3

> library(randomForest)

>model.randomForest<- randomForest(BUY_INSURANCE ~ ., data = train, importance=TRUE,proximity=TRUE)

> # look at variable importance

> round(importance(model.randomForest),2)

No   Yes    MeanDecreaseAccuracy   MeanDecreaseGini

REGION               0.18  0.74                 0.34                   4.68

SEX                      0.01  0.09                 0.04                   0.96

AGE                      0.40  0.85                 0.53                  11.10

HAS_CHILDREN   -0.12  0.21                -0.03                 1.12

SALARY                 -0.06  0.01                -0.04             8.98

N_OF_DEPENDENTS     -0.04  1.08                 0.39             4.01

CAR_OWNERSHIP           -0.06 -0.15                -0.08             0.44

HOUSE_OWNERSHIP          0.08  0.38                 0.19             1.11

TIME_AS_CUSTOMER         0.09  0.32                 0.17             2.90

MARITAL_STATUS           0.22  0.83                 0.45             3.62

CREDIT_BALANCE           0.76  0.62                 0.67             4.26

BANK_FUNDS               1.02  2.69                 1.27            27.15

CHECKING_AMOUNT          1.26  1.78                 1.18            15.12

MONEY_MONTLY_OVERDRAWN   0.93  2.31                 1.16            24.09

T_AMOUNT_AUTOM_PAYMENTS  0.85  1.58                 0.99            15.97

MONTHLY_CHECKS_WRITTEN   0.27  1.35                 0.64            10.43

MORTGAGE_AMOUNT          0.10  1.45                 0.60             8.26

N_TRANS_ATM              0.66  1.95                 0.96            13.97

N_MORTGAGES             -0.01  0.29                 0.10             1.58

N_TRANS_TELLER           0.66  1.52                 0.82             7.62

CREDIT_CARD_LIMITS       0.00  0.82                 0.25             6.35

N_TRANS_KIOSK           -0.01 -0.51                -0.17             4.03

N_TRANS_WEB_BANK         0.29  1.20                 0.60             9.27

LTV_BIN                 -0.01  0.15                 0.04             2.47

Higher values of in the above table indicate variables that are more important to the classification. We can see that BANK_FUNDS, CHECKING_AMOUNT, and MONEY_MONTLY_OVERDRAWN are more helpful to the classification.

Let’s use the model to predicate the cases in the test data set.

> pred.randomForest <- predict(model.randomForest, test[,-1],type=”prob”)

No      Yes

1 0.764    0.236

2 0.990    0.010

3 0.600    0.400

7 0.864    0.136

8 0.996    0.004

9 0.996    0.004

Naïve Bayes

“A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. Despite the fact that the far-reaching independence assumptions are often inaccurate, it has several properties that make it surprisingly useful in practice”4. Naive Bayes can deal with both categorical and numeric data. Since the sample size in the training data set is not very large, we will not discretize the continuous values in some of the variables by binning for simplification.

> library(e1071)

> model.naiveBayes <- naiveBayes(BUY_INSURANCE ~ ., data = train, laplace = 3)

> pred.naiveBayes <- predict(model.naiveBayes, test[,-1],type=”raw”)

No                    Yes

[1,] 1.0000000       5.244713e-18

[2,] 0.9953059       4.694106e-03

[3,] 0.5579982       4.420018e-01

[4,] 0.2221896       7.778104e-01

[5,] 0.9857277       1.427231e-02

[6,] 0.9923343       7.665676e-03

SVM

“A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”5 We will use the svm() function in the e1071 package to build the classification models. Kernel and cost parameters are important for svm() function to yield sensible results. We will try linear and radial kernel functions respectively in the following.

> # build the models with two different kernel functions respectively

> model.svm.linear <- svm(BUY_INSURANCE ~ ., data = train, kernel=”linear”,probability = TRUE)

> # prediction with the two models respectively

> pred.svm.linear <- predict(model.svm.linear, test[,-1],probability=TRUE)

> attr(pred.svm.linear, “probabilities”)[1:6,]

No                 Yes

1 0.9816020           0.01839796

2 0.9391826           0.06081737

3 0.5237238           0.47627615

7 0.9310071           0.06899288

8 0.9531510           0.04684897

9 0.9444462           0.05555381

No                Yes

1 0.8849981           0.11500191

2 0.9664234           0.03357663

3 0.5672350           0.43276502

7 0.9591768           0.04082316

8 0.9624121           0.03758789

9 0.9862672           0.01373277

## Comparing the Models

To compare the models generated above, we will plot ROC curve and calculate the area under the ROC (AUC for short).

> #prepares the legend string for the ROC figure

> c.legend<-c(“decision tree, auc=”,”random forest, auc=”,”naive Bayes, auc=”,”svm.linear, auc=”,”svm.radial, auc=”)

> #ROC for Decision Tree

> pred <- prediction(pred.tree[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,col=”red”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[1]<-paste(c.legend[1],round((performance(pred,’auc’)@y.values)[[1]],3))

>#ROC for Random Forest

> pred <- prediction(pred.randomForest[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> # caculate the AUC and add it to the legend vector

> c.legend[2]<-paste(c.legend[2],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for Naive Bayes

> pred <- prediction(pred.naiveBayes[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> # caculate the AUC and add it to the legend vector

> c.legend[3]<-paste(c.legend[3],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with linear kernel

> pred <- prediction(attr(pred.svm.linear, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> # caculate the AUC and add it to the legend vector

> c.legend[4]<-paste(c.legend[4],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with radial kernel

> pred <- prediction(attr(pred.svm.radial, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> # caculate the AUC and add it to the legend vector

> c.legend[5]<-paste(c.legend[5],round((performance(pred,’auc’)@y.values)[[1]],3))

> draw the legend

>legend(0.5,0.6, .legend,lty=c(1,1,1,1,1),lwd=c(2,2,2,2,2),col=c(“red”,”green”,”blue”,”purple”,”black”))

Figure 4

As shown in Figure 4, the model built by Random Forest (green line) has the best performance with the AUC of 0.921 in this case. We can use this model for our actual usage to predict Not Buy or Buy on new customers who are not in the existing data set.

## Summary

In this article, we built and compared the models generated by the Decision Tree, Random Forest, Naïve Bayes, SVM algorithms implemented in R packages. The performance of Random Forest exceeded others in this insurance buying use case.

References