Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

[This article was first published on Dataspora » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: Convert Prosper XML to CSV

Once you have downloaded the .jar file run the following command (changing the parameters of course!):
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory

Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history. This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.

Let’s jump right into the visualizations by state:

library(ggplot2) library(maps)

## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server. loans <- read.csv("Loans.CSV", header=TRUE) listings <- read.csv("Listings.CSV", header=TRUE)

## Obtain the active loans from the Listings file, since it ## contains more detailed information than the Loans file listings.match <- listings[match(loans$ListingKey, listings$Key),]

listings.match$BorrowerState <- as.character(listings.match$BorrowerState) loans <- listings.match states <- map_data("state")

## Change state abbreviations to full names so we can merge our ## data frames together state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) ) loans$BorrowerState <- tolower(state.names) colnames(loans)[11] <- "region" state.counts <- data.frame(table(loans$region)) colnames(state.counts) <- c("region", "Num.Loans") result<-merge(state.counts, states, by=c("region")) result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State') print(p)

Click for Larger Image

It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company.

Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s.

p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State') print(p)

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.

Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers.

## Aggregate median debt to income ratio by state debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE)) colnames(debt.to.income) <- c("region", "debt.to.income") result <- merge(debt.to.income, states, by="region") result <- result[order(result$order),]

p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):

ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)

Click for Larger Image

It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.

issue.year <- substr(loans$StartDate, 0, 4) loans$Issued.Year <- issue.year interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE)) years <- c("2006", "2007", "2008", "2009", "2010", "2011") colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate") interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100 result <- merge(interest.by.year.by.state, states, by="region") result <- result[order(result$order),]
#Calculate the lower and upper bounds for the gradient lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]] upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]
states2 <- data.frame(map("state", plot=FALSE)[c("x","y")]) animateMap <- function(year){ result.year <- result[grep(year, result$year),] usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate)) print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab("")) }

saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);

Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.

Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!

To leave a comment for the author, please follow the link and comment on their blog: Dataspora » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)