Gradient Boosting: Analysis of LendingClub’s Data

[This article was first published on Kevin Davenport » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

An old 5.75% CD of mine recently matured and seeing that those interest rates are gone forever, I figured I’d take a statistical look at LendingClub’s data. Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download.

The latest dataset consisted of 119,573 entries and some of it’s atributes included:

  1. Amount Requested
  2. Amount Funded by Investors*
  3. Interest Rate
  4. Term (Loan Length)
  5. Purpose of Loan
  6. Debt/Income Ratio **
  7. State
  8. Rent or Own Home
  9. Monthly Income
  10. FICO Low
  11. FICO High
  12. Credit Lines Open
  13. Revolving Balance
  14. Inquiries in Last 6 Months
  15. Length of Employment

*Lending club states that the amount funded by investors has no affect on the final interest rate assigned to a loan.

** DTI ratio takes all of your monthly liabilities and divides the total by your gross monthly income.

Once I had the .csv loaded as a dataframe in R, I had a little data munging to accomplish. I wasn’t sure what method I was going to use at this point, but I always address missing data. In this instance I replaced NA entries with averages and converted interest rate to a numeric column ( the % sign in the column caused R to import as string). I also created a debt-to-income ratio variable by dividing annual income by 12 and dividing the result by revolving balance.

I wanted to create a  RangeFICO column (inherently a factor or character variable in R due to the “-” in an entry like “675 -725″) then create a numeric MeanFICO column.

df$FICOrange = paste(df$fico_range_low, df$fico_range_high, sep='-')
FICOMEAN = function(x) (as.numeric(substr(x, 0, 3)) + as.numeric(substr(x,
5, 8)))/2
df$MeanFICO = sapply(df$FICO.Range, FICOMEAN)

I try to identify confounders early in dredging. Mathbabe aka Catherine O’Neil has a great post on confounders here. Simple ANOVA listed Amount Requested, Debt/Income Ratio, Rent or Own Home, Inquiries in Last 6 Months, Length of Loan, and Purpose of Loan as being significantly correlated with MeanFICO and Interest Rate. Fair Isaac itself states the following:

Fair Isaac’s research shows that opening several credit accounts in a short period of time represents greater credit risk. When the information on your credit report indicates that you have been applying for multiple new credit lines in a short period of time (as opposed to rate shopping for a single loan, which is handled differently as discussed below), your FICO score can be lower as a result.

This coupled with their FICO score breakdown information further confirms that Debt/Income Ratio and Inquiries in Last 6 Months are definite confounders.

The next step was quantitatively expressing the importance of each variable in determining a loan’s interest rate. From experience I would hypothesize that FICO score and loan length are key factors, but let’s find out for sure. The current base rate is 5.05%. Below is a quick screenshot of a subset of the current rates by risk, essentially the point of this post is to examine how LendingClub obtains their “Adjustment for Risk & Volatility” values.


Since there are so many variables, some which may be highly correlated, and some that affect Interest Rate in a non-linear manner,  I decided against multiple linear regressions and wanted to utilize gradient boosting (library(gbm)). Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function base learn to current “pseudo”-residuals by least-squares at each iteration.[1]

Overfitting is a big concern when modeling data so I used a 10-fold cross validation (might be overkill considering the 90k+ training obs).I also used the  gbm.perf() function to estimate “the optimal number of boosting iterations for a gbm object” and  gbm.Summary() to summarize the relative influence of each variable in the gbm object. Below is a graphical summary of the relative influence of each variable.gbm importance bar graph

I fit a gbm (gradient boosted model) to a subset of the data (training data) to generate a list describing how each variable reduced the squared error. According to this output the three most important variables were FICOMean, Term, and Amount Requested. Alone these variables could predict 93% of the interest rate.

gbm = gbm(int_rate ~., training,
            n.minobsinnode = 50

I calculated the Root Mean Square (RMS) error % of gbm to assess the power of the model. Basically the RMS error is a measure of the differences between values predicted by a model/estimator and the values actually observed. The model’s RMS error % was 15.48%, not bad.


Below is the output of gbm.Summary().

var rel.inf
MeanFICO 64.92519872
term 23.35757505
loan_amnt 4.89518426
funded_amnt 3.78362094
purpose 1.0948237
annual_inc 1.06986
revol_bal 0.2888473
home_ownership 0.27482517
addr_state 0.18597635
DTIratio 0.11379097
emp_length 0.01029754

It’s easy to understand the relative weakness of the individual variables that were rolled up into FICO score. What is interesting, however, is how insignificant home ownership and loan purposes were in determining interest rates. A loan to purchase Jetskis compared to pre-home application loan consolidation for example. I am surprised by the lack of weight in Home Ownership, especially when you consider LendingClubs max is $35,000.

Below is a graph generated with the full dataset (119,573 obs). It demonstrates the relationship between FICO, Interest Rate, and Term. I used a gam line with formula: y ~ s(x, bs = “cs”) opposed  to a simple lm (linear model) line as I wanted to demonstrate the curtailing steepness as the line approaches higher FICO scores. This exhibits the diminishing returns with higher FICO scores.


A linear model explains:

  • Every 1 unit increase in MeanFICO results in a .096 unit decrease in interest rate.
  • Every $1,000 unit increase in amount requested results in a 0.28 increase in interest rate.

LendingClub’s models could and perhaps should be much more complicated than this. They could employ text analytics to asses volatility and default risk based on the purpose summary produced by the user, although there is no way to verify the intent. LendingClub could also examine the micro-economical climates of each state or zipcode and factor in housing availability  rent control, socioeconomic factors in geography, race, gender, etc. It would be interesting to learn how the government would assess the “fair and equal” nature of this type of lending.

[1]  <- for Chris Rice

To leave a comment for the author, please follow the link and comment on their blog: Kevin Davenport » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)