Visualizations for credit modeling in R

October 17, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Visualizing Data

    Tags

    1. Data Visualisation
    2. ggplot2
    3. R Programming
    4. tidyverse

    Visualization is a great way to get an overview of credit modeling. Typically you will start by making data management and data cleaning and after this, your credit modeling analysis will start with visualizations. This article is, therefore, the first part of a credit machine learning analysis with visualizations. The second part of the analysis will typically use logistic regression and ROC curves.

    Library of R packages

    In the following section we will use R for visualization of credit modelling. First we read the packages into the R library:

    # Data management packages
    library(readr) 
    library(lubridate)
    library(magrittr)
    library(plyr)
    library(dplyr) 
    library(gridExtra) 
    # Visualization packages
    library(ggplot2) 
    library(plotly)
    library(ggthemes) 
    

    Load dataset and data management

    Next it is time to read the dataset and do some data management. We use the lending club loan dataset:

    # Read the dataset into R library
    loan <- read.csv("/loan.csv")
    # Data management of the dataset
    loan$member_id <- as.factor(loan$member_id)
    loan$grade <- as.factor(loan$grade)
    loan$sub_grade <- as.factor(loan$sub_grade)
    loan$home_ownership <- as.factor(loan$home_ownership)
    loan$verification_status <- as.factor(loan$verification_status)
    loan$loan_status <- as.factor(loan$loan_status)
    loan$purpose <- as.factor(loan$purpose)
    

    After the above data management it is time for data selection and data cleaning:

    # Selection of variables for the analysis
    loan <- loan[,c("grade","sub_grade","term","loan_amnt","issue_d","loan_status","emp_length",
                              "home_ownership", "annual_inc","verification_status","purpose","dti",
                              "delinq_2yrs","addr_state","int_rate", "inq_last_6mths","mths_since_last_delinq",
                              "mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc")]
    # Data cleaningt for missing observations
    loan$mths_since_last_delinq[is.na(loan$mths_since_last_delinq)] <- 0
    loan$mths_since_last_record[is.na(loan$mths_since_last_record)] <- 0
    var.has.na <- lapply(loan, function(x){any(is.na(x))})
    num_na <- which( var.has.na == TRUE )	
    loan <- loan[complete.cases(loan),]
    skim(loan)
    Skim summary statistics
     n obs: 886877 
     n variables: 23 
    
    -- Variable type:factor --------------------------------------------------------
                variable missing complete      n n_unique                                       top_counts ordered
              addr_state       0   886877 886877       51      CA: 129456, NY: 74033, TX: 71100, FL: 60901   FALSE
              emp_length       0   886877 886877       12  10+: 291417, 2 y: 78831, < 1: 70538, 3 y: 69991   FALSE
                   grade       0   886877 886877        7       B: 254445, C: 245721, A: 148162, D: 139414   FALSE
          home_ownership       0   886877 886877        6   MOR: 443319, REN: 355921, OWN: 87408, OTH: 180   FALSE
                 issue_d       0   886877 886877      103   Oct: 48619, Jul: 45938, Dec: 44323, Oct: 38760   FALSE
             loan_status       0   886877 886877        8 Cur: 601533, Ful: 209525, Cha: 45956, Lat: 11582   FALSE
                 purpose       0   886877 886877       14 deb: 524009, cre: 206136, hom: 51760, oth: 42798   FALSE
               sub_grade       0   886877 886877       35       B3: 56301, B4: 55599, C1: 53365, C2: 52206   FALSE
                    term       0   886877 886877        2                   36: 620739,  60: 266138, NA: 0   FALSE
     verification_status       0   886877 886877        3     Sou: 329393, Ver: 290896, Not: 266588, NA: 0   FALSE
    
    -- Variable type:numeric -------------------------------------------------------
                   variable missing complete      n     mean       sd     p0      p25      p50      p75       p100     hist
                 annual_inc       0   886877 886877 75019.4  64687.38   0    45000    65000    90000    9500000    ????????
                delinq_2yrs       0   886877 886877     0.31     0.86   0        0        0        0         39    ????????
                        dti       0   886877 886877    18.16    17.19   0       11.91    17.66    23.95    9999    ????????
             inq_last_6mths       0   886877 886877     0.69     1      0        0        0        1         33    ????????
                   int_rate       0   886877 886877    13.25     4.38   5.32     9.99    12.99    16.2       28.99 ????????
                  loan_amnt       0   886877 886877 14756.97  8434.43 500     8000    13000    20000      35000    ????????
     mths_since_last_delinq       0   886877 886877    16.62    22.89   0        0        0       30        188    ????????
     mths_since_last_record       0   886877 886877    10.83    27.65   0        0        0        0        129    ????????
                   open_acc       0   886877 886877    11.55     5.32   1        8       11       14         90    ????????
                    pub_rec       0   886877 886877     0.2      0.58   0        0        0        0         86    ????????
                  revol_bal       0   886877 886877 16924.56 22414.33   0     6450    11879    20833    2904836    ????????
                 revol_util       0   886877 886877    55.07    23.83   0       37.7     56       73.6      892.3  ????????
                  total_acc       0   886877 886877    25.27    11.84   1       17       24       32        169    ????????
    

    Visualizations for credit modeling

    After loading the dataset and data management it is time to make the credit modelling visualizations in R:

    # Chart on customers
    ggplot(data = loan,aes(x = grade)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized()
    ggplotly(p = ggplot2::last_plot())
    

    The above coding gives us the following graph:

    Let’s look at which grading group are house owners:

    # Chart on customers living
    ggplot(data = loan,aes(x = home_ownership)) + geom_bar(color = "blue",fill ="green") +geom_text(stat='count', aes(label=..count..))+ theme_solarized()
    ggplotly(p = ggplot2::last_plot())
    

    This gives us the following bar plot:

    Now for the next visualizations, we need to make some data management:

    # Data management for loan status
    revalue(loan$loan_status, c("Does not meet the credit policy. Status:Charged Off" = "Charged Off")) -> loan$loan_status
    revalue(loan$loan_status, c("Does not meet the credit policy. Status:Fully Paid" = "Fully Paid")) -> loan$loan_status
    loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data
    loan %>% group_by(loan$loan_status) %>% dplyr::summarize(total = n()) -> loan_status_data
    # Chart with customer living and loan status
    ggplot(data=loan, aes(x=home_ownership, fill=loan_status)) + geom_bar()
    ggplotly(p = ggplot2::last_plot())
    

    The above coding gives us the following visualization:

    Now lets look at customers on loan verification:

    # Customer and loan verification
    ggplot(data=loan, aes(x=verification_status, fill=loan_status))+ geom_bar()
    ggplotly(p = ggplot2::last_plot())
    

    This gives us the following plot:

    Lets look at the loan verification as loan amount and interest rate graph:

    # Loan amount
    ggplot(data = loan,aes(x = loan_amnt)) + geom_bar(color = 'green')
    ggplotly(p = ggplot2::last_plot())
    # Interest rate
    ggplot(data = loan,aes(x = int_rate))+ geom_bar(color = 'green')
    ggplotly(p = ggplot2::last_plot())
    

    This gives the following two graphs:

    Now lets look at histogram based upon loan amount and interest rate:

    #Histogram on loan amount
    ggplot(data = loan,aes (x = loan_amnt,fill= grade))+ geom_histogram()
    ggplotly(p = ggplot2::last_plot())
    #Histogram  on interest rate
    ggplot(data = loan,aes (x = int_rate,fill= grade))+ geom_histogram()
    ggplotly(p = ggplot2::last_plot())
    

    This gives us the following two histograms:

    Now let’s look at density plot based on interest rate and loan amount:

    # Density on interest rate
    ggplot(data = loan,aes(x = int_rate)) + geom_density(fill = 'green',color = 'blue')
    ggplotly(p = ggplot2::last_plot())
    # Density on loan amount
    ggplot(data = loan,aes(x = loan_amnt)) + geom_density(fill = 'green',color = 'blue')
    ggplotly(p = ggplot2::last_plot())
    

    This gives us the following density plots:

    Next, it is time to look at the density plot on loan- and interest rate based grade type

    #density on loan based on grade type
    ggplot(data = loan,aes(x = loan_amnt,fill = grade)) + geom_density()
    ggplotly(p = ggplot2::last_plot())
    #density on interest rate based on grade type
    ggplot(data = loan,aes(x = int_rate,fill = grade)) + geom_density()
    ggplotly(p = ggplot2::last_plot())
    

    This gives us the following plots:

    Lastly let us look at box plots for interest rate based on purpose and grade:

    # Box plot interest rate & purpose
    boxplot(int_rate ~ purpose, col="darkgreen", data=loan)
    # Boxplot interest rate & grade 
    boxplot(int_rate ~ grade, col="darkgreen", data=loan)
    

    The above coding gives us the following two histograms:


    References

    1. Using readr in R – CRAN.R-project.org
    2. Using lubridate in R – CRAN.R-project.org
    3. Using magrittr in R – CRAN.R-project.org
    4. Using plyr in R – CRAN.R-project.org
    5. Using dplyr in R – CRAN.R-project.org
    6. Using gridExtra in R – CRAN.R-project.org
    7. Using ggplot2 in R – CRAN.R-project.org
    8. Using plotly in R – CRAN.R-project.org
    9. Using ggthemes in R – CRAN.R-project.org

    Related Post

    1. Decision Trees and Random Forests in R
    2. Add value to your visualizations in R
    3. Visualize your Portfolio’s Performance and Generate a Nice Report with R
    4. Exploring San Francisco Bay Area’s Bike Share System
    5. Analysis and Visualization of Blue Bikes Sharing in Boston

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)