This post shares the video from the talk presented in August 2013 by Ross Gayler on Credit Scoring and R at Melbourne R Users.
Credit scoring tends to involve the balancing of mutually contradictory objectives spiced with a liberal dash of methodological conservatism. This talk emphasises the craft of credit scoring, focusing on combining technical components with some less common analytical techniques. The talk describes an analytical project which R helped to make relatively straight forward.
Ross Gayler describes himself as a recovered psychologist who studied rats and stats (minus the rats) a very long time ago. Since then he has mostly worked in credit scoring (predictive modelling of risk-related customer behaviour in retail finance) and has forgotten most of the statistics he ever knew.
Credit scoring involves counterfactual reasoning. Lenders want to set policies based on historical experience, but what they really want to know is what would have happened if their historical policies had been different. The statistical consequence of this is that we are required to build statistical models of structure that is not explicitly present in the available data and that the available data is systematically censored. The simplest example of this is that the applicants who are estimated to have the highest risk are declined credit and consequently, we do not have explicit knowledge of how they would have performed if they had been accepted. Overcoming this problem is known as ‘reject inference’ in credit scoring. Reject inference is typically discussed as a single-level phenomenon, but in reality there can be multiple levels of censoring. For example, an applicant who has been accepted by the lender may withdraw their application with the consequence that we don’t know whether they would have successfully repaid the loan had they taken up the offer.
Independently of reject inference, it is standard to summarise all the available predictive information as a single score that predicts a behaviour of interest. In reality, there may be multiple behaviours that need to be simultaneously considered in decision making. These may be predicted by multiple scores and in general there will be interactions between the scores — so they need to be considered jointly in decision making. The standard technique for implementing this is to divide each score into a small number of discrete levels and consider the cross-tabulation of both scores. This is simple but limited because it does not make optimal use of the data, raises problems of data sparsity, and makes it difficult to achieve a fine level of control.
This talk covers a project that dealt with multiple, nested reject inference problems in the context of two scores to be considered jointly. It involved multivariate smoothing spline regression and some general R carpentry to plug all the pieces together.