Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State

[This article was first published on Dataspora » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have a few friends that keep bragging about their 14% annual returns by investing their money with Lending Club, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been dealing with, here’s a snapshot of the Company Statistics as of 10/14/2011:

  • Loans funded to date: $387,043,375
  • Loans funded last month: $24,945,400
  • Interest paid to investors since inception: $32,135,688

Currently Lending Club is boasting that 91% of borrowers earn between 6-18%. Now of course, higher returns are correlated with higher risk. You can choose to diversify your investment across hundreds of different loans with different credit grades – the worse the credit grade, the higher the return percentage, and the higher the risk. I thought it would be interesting to investigate Lending Club a bit more so I navigated over to their site and found something that only a data scientist would consider to be gold:

Yes, they have provided complete loan data in CSV format for all of us data geeks to devour.  The data include the current loan status (Current, Late, Fully Paid, etc.), credit grades, interest rates, loan purposes, and all sorts of other juicy tidbits of borrower information.

I downloaded the data and quickly determined that this csv file contained information on 37,122 loans. Of course the first thing I did was fire up R:

library(ggplot2) library(maps) loans <- read.csv("LoanStats.csv", header=TRUE, skip = 1)

One of the three sexy skills of the data geek includes data munging, otherwise known as suffering. This post will briefly touch on 2 of the 3 skills - Data Munging and Data Visualization. But first, we need to get the data into a format that our tool, in this case R, can handle. We’ll replace some percentage signs and change a factor to a character string.

loans$Debt.To.Income.Ratio <- as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio)) loans$State <- as.character(loans$State) loans$Interest.Rate <- as.numeric(gsub("%", "", loans$Interest.Rate)) loans$Revolving.Line.Utilization <- as.numeric(gsub("%", "", loans$Revolving.Line.Utilization))

Conveniently, map_data(“state”) will get all of the latitude and longitude data for each state so that we can draw a map of the U.S.

states <- map_data("state") loans<-loans[-which(loans$State == ""),]

#Change state abbreviations to full names so we can merge our data frames together state.names <- unlist(sapply(loans$State, function(x) if(length([grep(x,]) == 0) "District of Columbia" else[grep(x,]) ) loans$State <- tolower(state.names) colnames(loans)[23] <- "region"

Now we will determine the number of loans by state and merge this data.frame with the states data so we can plot this all out on a map using ggplot2.

state.counts <- data.frame(table(loans$region)) colnames(state.counts) <- c("region", "Num.Loans") result<-merge(state.counts, states, by=c("region")) result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) print(p)

It doesn’t take a geography whiz to realize that a state is missing from this map! Turns out, Lending Club has zero borrowers in North Dakota as of 10/14/2011. Since the number of loans can range anywhere from 3 in Maine to 6,452 in California, we can also plot the log of the total number of loans in order to more easily compare each state's loan activity visually. Why don’t we also add poor North Dakota onto our map? We will assign its Num.Loans variable a value of 1 since we will be taking the log for our next visualization and log(1) = 0.

nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),] nd$Num.Loans <- 1 result <- rbind(result, nd) result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) print (p)

That’s better. We can see that most of Lending Club’s borrowers are from CA, which makes sense given that Lending Club is headquartered in San Francisco. They also have vast reach across Texas, Florida, New York, a good portion of the east coast, and states bordering California. They have the least number of borrowers in Maine and parts of the West to Midwest.

Now let’s explore some of the other variables and project them onto our map. We will look at the median monthly incomes by state.

monthly.income <-aggregate(loans$Monthly.Income, by=list(loans$region), function(x) median(x, na.rm=TRUE)) colnames(monthly.income) <- c("region", "monthly.income") result <- merge(monthly.income, states, by="region") nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),] nd$monthly.income <- 0 result <- rbind(result, nd) result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=monthly.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) print(p)

You may recall that Lending Club has only issued 3 loans in Maine. This means we are only looking at 3 data points, which is not a large sample size. We can add any type of information we would like to the center of each state on our map. Let’s add the total number of loans in each state using geom_text() to the center of each state to give this information a little more context.<-data.frame(region = tolower(,$x,$y) <- subset(, !region %in% c("alaska", "hawaii")) totals <- data.frame(table(loans$region)) colnames(totals) <- c("region", "total") <- merge(, totals)
p + geom_text(, aes(label=total, cex=0.5))

I have one more trick up my sleeve, which I hacked together thanks to this post from We will look at how the median interest rate for loans issued by Lending Club have varied over the past 4 years by state.


#Pull out just the year from the Issued.Date for each loan loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)<-aggregate(loans$Interest.Rate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE)) years <- c("2007", "2008", "2009", "2010", "2011") colnames( <- c("year", "region", "interest.rate")

result <- merge(, states, by="region") result <- result[order(result$order),]

#Calculate the lower and upper bounds for the gradient lower <- floor(summary($interest.rate)[1])[[1]] upper <- ceiling(summary($interest.rate)[6])[[1]]

states2 <- data.frame(map("state", plot=FALSE)[c("x","y")]) animateMap <- function(year){ result.year <- result[grep(year, result$year),] usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate)) print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab("")) } saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);

If we observe our award-winning animated GIF created in R, we can see that the interest rates that Lending Club calculated for issued loans in 2007, the year of its inception, were much more heterogeneous than they are now. They are the highest in 2009 at around 14% across a majority of the U.S. and now they are more constant, hovering around 11% for most states. States without color simply indicate that there were no loans issued by Lending Club in that state for the given year.

What are some interesting visualizations you have come up with using Lending Club’s trove of borrower data?


Due to high demand, I have created a map of "Good" vs. "Bad" borrowers broken down by state. Since some states have many more borrowers than others, I also included the total number of borrowers that went into the ratio, depicted as a number on each state's center. I filtered the original loan data down to two classes of borrowers. "Good" borrowers as those that were fully paid and "Bad" borrowers are those that either charged off, defaulted, or were late on payments. This resulted in 1,542 "Bad" borrowers and 4,647 "Good" borrowers. I then simply calculated the percentage of "Bad" customers by state. Keep in mind this does not include data on the ~24,000 other loans that are current! Click the image to see a larger version.

As you can see, the only states with 0% "Bad" borrowers are those with fewer than 13 borrowers. If we compare states with multiple hundreds of borrowers, Florida consists of about 40% "Bad" borrowers! That's approximately 167 borrowers out of 418! Texas, New York and Pennsylvania borrowers on the other hand, are pretty diligent with paying back their loans and are boasting that only 20% of their borrowers are naughty. Meanwhile California, Lending Club's home state, has the most borrowers and about 30% of those have either charged off, defaulted or were late on their payments.

Hang tight for a more quantitative analysis in which we will try to determine which factors other than state of residence are most important in determining what makes a "good" or "bad" borrower.

To leave a comment for the author, please follow the link and comment on their blog: Dataspora » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)