**R Programming – DataScience+**, and kindly contributed to R-bloggers)

Categories

Tags

Machine learning logistic regressions is a widely popular method to model credit modeling. There are excellent and efficient packages in R, that can perform these types of analysis. Typically you will first create different machine learning visualizations before you perform the machine learning logistic regression analysis.

This article is the second step of a credit modeling analysis, where I recently published the first step in this article.

The first thing we need to do is to load the R packages into the library:

# Load R packages into the library # Data management packages library(DescTools) library(skimr) library(plyr) library(dplyr) library(aod) library(readxl) # Visualization packages library(Deducer) library(ggplot2) # Machine learnning method packages library(ROCR) library(pROC) library(caret) library(MASS)

Now it is time to load the dataset and do some data management. We will work with the loan lending club dataset. The below coding is the data management:

# Import dataset loan_data <- read.csv("/loan.csv") # Selecting the relevant variables in the dataset: loan_data <- loan_data[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_status","emp_length", "home_ownership", "annual_inc","verification_status","purpose","dti", "delinq_2yrs","addr_state","int_rate", "inq_last_6mths","mths_since_last_delinq", "mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc")] # Data management for missing observations loan_data$mths_since_last_delinq[is.na(loan_data$mths_since_last_delinq)] <- 0 loan_data$mths_since_last_record[is.na(loan_data$mths_since_last_record)] <- 0 var.has.na <- lapply(loan_data, function(x){any(is.na(x))}) num_na <- which( var.has.na == TRUE ) per_na <- num_na/dim(loan_data)[1] loan_data <- loan_data[complete.cases(loan_data),]

Although this is the second step of a credit modeling analysis, the visualization step can be found in my previous article, let us do minimum of visualization in case the reader only reads this article:

# Visualization of the data # Bar chart of the loan amount loanamount_barchart <- ggplot(data=loan_data, aes(loan_data$loan_amnt)) + geom_histogram(breaks=seq(0, 35000, by=1000), col="black", aes(fill=..count..)) + scale_fill_gradient("Count", low="green1", high="yellowgreen")+ labs(title="Loan Amount", x="Amount", y="Number of Loans") loanamount_barchart ggplotly(p = ggplot2::last_plot()) # Box plot of loan amount box_plot_stat <- ggplot(loan_data, aes(loan_status, loan_amnt)) box_plot_stat + geom_boxplot(aes(fill = loan_status)) + theme(axis.text.x = element_blank()) + labs(list(title = "Loan amount by status", x = "Loan Status", y = "Amount")) ggplotly(p = ggplot2::last_plot())

The above coding gives us the following two visualizations:

Lets see some descriptive statistics of the data as well:

skim(loan_data)Skim summary statistics n obs: 886877 n variables: 23 -- Variable type:factor -------------------------------------------------------- variable missing complete n n_unique top_counts ordered addr_state 0 886877 886877 51 CA: 129456, NY: 74033, TX: 71100, FL: 60901 FALSE emp_length 0 886877 886877 12 10+: 291417, 2 y: 78831, < 1: 70538, 3 y: 69991 FALSE grade 0 886877 886877 7 B: 254445, C: 245721, A: 148162, D: 139414 FALSE home_ownership 0 886877 886877 6 MOR: 443319, REN: 355921, OWN: 87408, OTH: 180 FALSE issue_d 0 886877 886877 103 Oct: 48619, Jul: 45938, Dec: 44323, Oct: 38760 FALSE loan_status 0 886877 886877 8 Cur: 601533, Ful: 209525, Cha: 45956, Lat: 11582 FALSE purpose 0 886877 886877 14 deb: 524009, cre: 206136, hom: 51760, oth: 42798 FALSE sub_grade 0 886877 886877 35 B3: 56301, B4: 55599, C1: 53365, C2: 52206 FALSE term 0 886877 886877 2 36: 620739, 60: 266138, NA: 0 FALSE verification_status 0 886877 886877 3 Sou: 329393, Ver: 290896, Not: 266588, NA: 0 FALSE -- Variable type:numeric ------------------------------------------------------- variable missing complete n mean sd p0 p25 p50 p75 p100 hist annual_inc 0 886877 886877 75019.4 64687.38 0 45000 65000 90000 9500000 ???????? delinq_2yrs 0 886877 886877 0.31 0.86 0 0 0 0 39 ???????? dti 0 886877 886877 18.16 17.19 0 11.91 17.66 23.95 9999 ???????? inq_last_6mths 0 886877 886877 0.69 1 0 0 0 1 33 ???????? int_rate 0 886877 886877 13.25 4.38 5.32 9.99 12.99 16.2 28.99 ???????? loan_amnt 0 886877 886877 14756.97 8434.43 500 8000 13000 20000 35000 ???????? mths_since_last_delinq 0 886877 886877 16.62 22.89 0 0 0 30 188 ???????? mths_since_last_record 0 886877 886877 10.83 27.65 0 0 0 0 129 ???????? open_acc 0 886877 886877 11.55 5.32 1 8 11 14 90 ???????? pub_rec 0 886877 886877 0.2 0.58 0 0 0 0 86 ???????? revol_bal 0 886877 886877 16924.56 22414.33 0 6450 11879 20833 2904836 ???????? revol_util 0 886877 886877 55.07 23.83 0 37.7 56 73.6 892.3 ???????? total_acc 0 886877 886877 25.27 11.84 1 17 24 32 169 ????????

Next we need to do some more data management to prepare the dataset for machine learning analysis

# Focus on the historical loans loan_data=as.data.frame(loan_data[loan_data$loan_status!="Current", ]) limits_inc = quantile(loan_data$annual_inc, seq(0,1,0.1)) labels <- c(0, limits_inc[2:10], "+inf") labels <- prettyNum(labels, big.mark = ",") labels <- paste(labels[1:10], labels[2:11], sep = "-") loan_data$annual_inc <- cut(loan_data$annual_inc, limits_inc, labels = labels, include.lowest = T) loan_data[,"annual_inc"] <- as.character(loan_data[,"annual_inc"]) # Create binary variables for the logistic regression analysis # Annual_inc loan_data$annual_inc[loan_data$annual_inc == "70,000- 80,000"| loan_data$annual_inc == "80,000- 94,000" | loan_data$annual_inc == "94,000-120,000" | loan_data$annual_inc == "120,000- +inf" ] <- 1 loan_data$annual_inc[loan_data$annual_inc != 1] <- 0 loan_data$annual_inc <- as.numeric(loan_data$annual_inc) # Home_ownership loan_data$home_ownership <- as.character(loan_data$home_ownership) loan_data$home_ownership[loan_data$home_ownership=="OWN" | loan_data$home_ownership=="MORTGAGE" ] <- 1 loan_data$home_ownership[loan_data$home_ownership!=1] <- 0 # Dealinq_2yrs loan_data$delinq_2yrs <- as.character(loan_data$delinq_2yrs) loan_data$delinq_2yrs[loan_data$delinq_2yrs=="0"] <- 0 loan_data$delinq_2yrs[loan_data$delinq_2yrs!= 0] <- 1 # Verification status: if Verified = 1 ; otherwise = 0 loan_data$verification_status = as.character(loan_data$verification_status) loan_data$verification_status[loan_data$verification_status == "Verified" | loan_data$verification_status == "Source Verified"] = 1 loan_data$verification_status[loan_data$verification_status != 1] = 0 loan_data$verification_status=as.numeric(loan_data$verification_status) # Dti dti_quant <- quantile(loan_data$dti, seq(0, 1, 0.1)) labels = c(0,prettyNum(dti_quant[2:10], big.mark = ","), "+Inf") labels = paste(labels[1:10],labels[2:11], sep = "-") loan_data <- mutate(loan_data, dti= cut(loan_data$dti, breaks = dti_quant, labels = factor(labels), include.lowest = T)) loan_data$dti <- as.character(loan_data$dti) loan_data$dti[loan_data$dti == "0-6.57" | loan_data$dti == "12.13-14.32" | loan_data$dti == "14.32-16.49" ] <- 1 loan_data$dti[loan_data$dti!=1] <- 0 # Status loan_data$loan_status <- as.character(loan_data$loan_status) loan_data$loan_status[loan_data$loan_status == "Charged Off" | loan_data$loan_status == "Default" ] <- 1 loan_data$loan_status[loan_data$loan_status != 1] <- 0 table(loan_data$loan_status) PercTable(loan_data$loan_status) # Change to nummeric variables: loan_data[,"revol_util"] <- as.numeric(sub("%", "",loan_data$"revol_util", fixed =TRUE))/100 loan_data[,"int_rate"] <- as.numeric(sub("%", "",loan_data$"int_rate", fixed =TRUE))/100 loan_data$loan_status <- as.numeric(loan_data$loan_status) # Grouping variables loan_data$purpose <- as.character(loan_data$purpose) loan_data$purpose[loan_data$purpose == "car" | loan_data$purpose == "major_purchase" | loan_data$purpose == "home_improvement"| loan_data$purpose == "credit_card" ] <- 2 loan_data$purpose[loan_data$purpose == "moving" | loan_data$purpose == "small_business" | loan_data$purpose == "renewable_energy" ] <- 0 loan_data$purpose[loan_data$purpose!= 0 & loan_data$purpose!= 2 ] <- 1 loan_data$purpose <- as.factor(loan_data$purpose)

Now it is time to make the machine learning regression analysis. We will work with multiple logistic regression. Logistic regression is applied when you have a binary variable (y) to explain. The logistic regression model uses the cumulative distribution function to estimate the logistic function of the model with a group of explanatory variables (the x’s). We will work with a stepwise model in order to find a final model for the logistic regression. The below coding generates the multiple logistic regression analysis:

##Machine Learning: Multiple Logistic Regression Models # Logistic: Logit stepwise Regression logregmodI <- glm(loan_status ~ loan_amnt + home_ownership + annual_inc + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, data = loan_data, family = binomial(link= "logit")) step <- stepAIC(logregmodI, direction="both") step$anovaStepwise Model Path Analysis of Deviance Table Initial Model: loan_status ~ loan_amnt + home_ownership + annual_inc + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc Final Model: loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc Step Df Deviance Resid. Df Resid. Dev AIC 1 285330 241553.4 241581.4 2 - annual_inc 0 0 285330 241553.4 241581.4

Now we need to make a training dataset and testing dataset in order to train the model and perform a ROC curve:

# Create a training- and testing dataset percing <- floor((nrow(loan_data)/4)*3) loan <- loan_data[sample(nrow(loan_data)), ] loan.training <- loan[1:percing, ] loan.testing <- loan[(percing+1):nrow(loan), ] # Begin training of the model fitting.logistic <- glm(loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, data=loan.training,family = binomial(link= "logit")) summary(fitting.logistic)Call: glm(formula = loan_status ~ loan_amnt + home_ownership + verification_status + purpose + dti + delinq_2yrs + int_rate + inq_last_6mths + mths_since_last_delinq + revol_bal + revol_util + total_acc, family = binomial(link = "logit"), data = loan.training) Deviance Residuals: Min 1Q Median 3Q Max -1.5509 -0.6336 -0.5098 -0.3799 2.5925 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.952e+00 4.221e-02 -69.931 < 2e-16 *** loan_amnt 2.213e-06 8.330e-07 2.657 0.007894 ** home_ownership1 -1.626e-01 1.261e-02 -12.892 < 2e-16 *** verification_status 7.695e-02 1.429e-02 5.385 7.23e-08 *** purpose1 -3.342e-01 3.237e-02 -10.326 < 2e-16 *** purpose2 -3.882e-01 3.383e-02 -11.475 < 2e-16 *** dti1 -2.075e-01 1.385e-02 -14.982 < 2e-16 *** delinq_2yrs1 -5.411e-02 1.618e-02 -3.345 0.000823 *** int_rate 1.157e+01 1.545e-01 74.868 < 2e-16 *** inq_last_6mths 7.640e-02 5.113e-03 14.943 < 2e-16 *** mths_since_last_delinq -2.289e-03 2.726e-04 -8.396 < 2e-16 *** revol_bal -1.818e-06 4.002e-07 -4.543 5.54e-06 *** revol_util 3.391e-01 2.743e-02 12.366 < 2e-16 *** total_acc -6.473e-03 5.718e-04 -11.320 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 191593 on 214007 degrees of freedom Residual deviance: 180719 on 213994 degrees of freedom AIC: 180747 Number of Fisher Scoring iterations: 4

Now we can create the ROC and AUC curves for the model.

# AUC and ROC curve fitted.results <- predict(fit.log, newdata = loan.testing, type = "response") loan.testing$prob <- fitted.results pred <- prediction(loan.testing$prob,loan.testing$loan_status) auc1 <- performance(pred, measure = "auc") [email protected]

And lastly we can display the ROC curve as a visualization. These codings are pretty heavy in computing power – so relax and let R do the calculations:

# Performance function ROCRperf = performance(pred, "tpr", "fpr") # Plot the ROC graph Add threshold labels plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7)) abline(0, 1, col= "black")

The above coding gives us the following ROC graph visualizaiton:

### References

- Using DescTools in R – CRAN.R-project.org
- Using skimr in R – CRAN.R-project.org
- Using magrittr in R – CRAN.R-project.org
- Using plyr in R – CRAN.R-project.org
- Using dplyr in R – CRAN.R-project.org
- Using aod in R – CRAN.R-project.org
- Using readxl in R – CRAN.R-project.org
- Using Deducer in R – CRAN.R-project.org
- Using ggplot2 in R – CRAN.R-project.org
- Using plotly in R – CRAN.R-project.org
- Using ROCR in R – CRAN.R-project.org
- Using pROC in R – CRAN.R-project.org
- Using caret in R – CRAN.R-project.org
- Using MASS in R – CRAN.R-project.org

Related Post

- Commercial data analytics: An economic view on the data science methods
- Weight loss in the U.S. – An analysis of NHANES data with tidyverse
- Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
- Story of pairs, ggpairs, and the linear regression
- Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial

**leave a comment**for the author, please follow the link and comment on their blog:

**R Programming – DataScience+**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...