Gradient Boosting: Analysis of LendingClub’s Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
An old 5.75% CD of mine recently matured and seeing that those interest rates are gone forever, I figured I’d take a statistical look at LendingClub’s data. Lending Club is the first peer-to-peer lending company to register its offerings as securities with the Securities and Exchange Commission (SEC). Their operational statistics are public and available for download.
The latest dataset consisted of 119,573 entries and some of it’s atributes included:
- Amount Requested
- Amount Funded by Investors*
- Interest Rate
- Term (Loan Length)
- Purpose of Loan
- Debt/Income Ratio **
- State
- Rent or Own Home
- Monthly Income
- FICO Low
- FICO High
- Credit Lines Open
- Revolving Balance
- Inquiries in Last 6 Months
- Length of Employment
*Lending club states that the amount funded by investors has no affect on the final interest rate assigned to a loan.
** DTI ratio takes all of your monthly liabilities and divides the total by your gross monthly income.
Once I had the .csv loaded as a dataframe in R, I had a little data munging to accomplish. I wasn’t sure what method I was going to use at this point, but I always address missing data. In this instance I replaced NA entries with averages and converted interest rate to a numeric column ( the % sign in the column caused R to import as string). I also created a debt-to-income ratio variable by dividing annual income by 12 and dividing the result by revolving balance.
I wanted to create a RangeFICO column (inherently a factor or character variable in R due to the “-” in an entry like “675 -725″) then create a numeric MeanFICO column.
df$FICOrange = paste(df$fico_range_low, df$fico_range_high, sep='-') FICOMEAN = function(x) (as.numeric(substr(x, 0, 3)) + as.numeric(substr(x, 5, 8)))/2 df$MeanFICO = sapply(df$FICO.Range, FICOMEAN)
I try to identify confounders early in dredging. Mathbabe aka Catherine O’Neil has a great post on confounders here. Simple ANOVA listed Amount Requested, Debt/Income Ratio, Rent or Own Home, Inquiries in Last 6 Months, Length of Loan, and Purpose of Loan as being significantly correlated with MeanFICO and Interest Rate. Fair Isaac itself states the following:
Fair Isaac’s research shows that opening several credit accounts in a short period of time represents greater credit risk. When the information on your credit report indicates that you have been applying for multiple new credit lines in a short period of time (as opposed to rate shopping for a single loan, which is handled differently as discussed below), your FICO score can be lower as a result.
This coupled with their FICO score breakdown information further confirms that Debt/Income Ratio and Inquiries in Last 6 Months are definite confounders.
The next step was quantitatively expressing the importance of each variable in determining a loan’s interest rate. From experience I would hypothesize that FICO score and loan length are key factors, but let’s find out for sure. The current base rate is 5.05%. Below is a quick screenshot of a subset of the current rates by risk, essentially the point of this post is to examine how LendingClub obtains their “Adjustment for Risk & Volatility” values.
Since there are so many variables, some which may be highly correlated, and some that affect Interest Rate in a non-linear manner, I decided against multiple linear regressions and wanted to utilize gradient boosting (library(gbm)). Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function base learn to current “pseudo”-residuals by least-squares at each iteration.[1]
Overfitting is a big concern when modeling data so I used a 10-fold cross validation (might be overkill considering the 90k+ training obs).I also used the gbm.perf() function to estimate “the optimal number of boosting iterations for a gbm object” and gbm.Summary() to summarize the relative influence of each variable in the gbm object. Below is a graphical summary of the relative influence of each variable.
I fit a gbm (gradient boosted model) to a subset of the data (training data) to generate a list describing how each variable reduced the squared error. According to this output the three most important variables were FICOMean, Term, and Amount Requested. Alone these variables could predict 93% of the interest rate.
gbm = gbm(int_rate ~., training, n.trees=1000, shrinkage=0.01, distribution="gaussian", interaction.depth=7, bag.fraction=0.9, cv.fold=10, n.minobsinnode = 50 )
I calculated the Root Mean Square (RMS) error % of gbm to assess the power of the model. Basically the RMS error is a measure of the differences between values predicted by a model/estimator and the values actually observed. The model’s RMS error % was 15.48%, not bad.
Below is the output of gbm.Summary().
var | rel.inf | |
MeanFICO | 64.92519872 | |
term | 23.35757505 | |
loan_amnt | 4.89518426 | |
funded_amnt | 3.78362094 | |
purpose | 1.0948237 | |
annual_inc | 1.06986 | |
revol_bal | 0.2888473 | |
home_ownership | 0.27482517 | |
addr_state | 0.18597635 | |
DTIratio | 0.11379097 | |
emp_length | 0.01029754 |
It’s easy to understand the relative weakness of the individual variables that were rolled up into FICO score. What is interesting, however, is how insignificant home ownership and loan purposes were in determining interest rates. A loan to purchase Jetskis compared to pre-home application loan consolidation for example. I am surprised by the lack of weight in Home Ownership, especially when you consider LendingClubs max is $35,000.
Below is a graph generated with the full dataset (119,573 obs). It demonstrates the relationship between FICO, Interest Rate, and Term. I used a gam line with formula: y ~ s(x, bs = “cs”) opposed to a simple lm (linear model) line as I wanted to demonstrate the curtailing steepness as the line approaches higher FICO scores. This exhibits the diminishing returns with higher FICO scores.
A linear model explains:
- Every 1 unit increase in MeanFICO results in a .096 unit decrease in interest rate.
- Every $1,000 unit increase in amount requested results in a 0.28 increase in interest rate.
LendingClub’s models could and perhaps should be much more complicated than this. They could employ text analytics to asses volatility and default risk based on the purpose summary produced by the user, although there is no way to verify the intent. LendingClub could also examine the micro-economical climates of each state or zipcode and factor in housing availability rent control, socioeconomic factors in geography, race, gender, etc. It would be interesting to learn how the government would assess the “fair and equal” nature of this type of lending.
[1] http://www-stat.stanford.edu/~jhf/ <- for Chris Rice
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.