Articles by statcompute

Additional Thoughts on Estimating LGD with Proportional Odds Model

February 6, 2018 | statcompute

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in ... [Read more...]

Modeling LGD with Proportional Odds Model

January 28, 2018 | statcompute

The LGD model is an important component in the expected loss calculation. In https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas, I discussed how to model LGD with the quasi-binomial regression that is simple and makes no distributional assumption. In the real-world LGD data, we usually would observe 3 ordered categories of values, including 0, 1, ... [Read more...]

Model Non-Negative Numeric Outcomes with Zeros

September 17, 2017 | statcompute

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, ... [Read more...]

Variable Selection with Elastic Net

September 3, 2017 | statcompute

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive. The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, ... [Read more...]

DART: Dropout Regularization in Boosting Ensembles

August 20, 2017 | statcompute

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks. In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in ... [Read more...]

Model Operational Losses with Copula Regression

August 20, 2017 | statcompute

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately. In the prevailing modeling ... [Read more...]

DART: Dropout Regulation in Boosting Ensembles

August 19, 2017 | statcompute

The dropout approach developed by Hinton has been widely employed in the context of deep learnings to prevent the deep neural network from over-fitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks. In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout is also proposed to address the ... [Read more...]

Model Operational Loss Directly with Tweedie GLM

June 29, 2017 | statcompute

In the development of operational loss forecasting models, the Frequency-Severity modeling approach, which the frequency and the severity of a Unit of Measure (UoM) are modeled separately, has been widely employed in the banking industry. However, sometimes it also makes sense to model the operational loss directly, especially for UoMs ... [Read more...]

GLM with H2O in R

June 27, 2017 | statcompute

Below is an example showing how to fit a Generalized Linear Model with H2O in R. The output is much more comprehensive than the one generated by the generic R glm(). [Read more...]

H2O Benchmark for CSV Import

June 25, 2017 | statcompute

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv(). [Read more...]

Using Tweedie Parameter to Identify Distributions

June 24, 2017 | statcompute

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity? In my previous post ... [Read more...]

Finer Monotonic Binning Based on Isotonic Regression

June 15, 2017 | statcompute

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First ... [Read more...]

Joining Tables in SparkR

June 12, 2017 | statcompute

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = "")) sc <- sparkR.session(master = "local") df1 <- read.df("nycflights13.csv", source = "csv", header = "true", inferSchema = "true") grp1 <- groupBy(filter(df1, "month in (1, 2, 3)"), "month") sum1 <- withColumnRenamed(agg(grp1, min_dep = min(df1$dep_delay)), "month", "month1") grp2 <- groupBy(filter(df1, "month in (2, 3, 4)"), "month") sum2 <- withColumnRenamed(agg(grp2, max_dep = max(df1$dep_delay)), "month", "month2") # INNER JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all = FALSE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "inner")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # LEFT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.x = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, "left")) #+------+-------+------+-------+ #|month1|min_dep|month2|max_dep| #+------+-------+------+-------+ #| 1| -30| null| null| #| 3| -25| 3| 911| #| 2| -33| 2| 853| #+------+-------+------+-------+ # RIGHT JOIN showDF(merge(sum1, sum2, by.x = "month1", by.y = "month2", all.y = TRUE)) showDF(join(sum1, sum2, sum1$month1 == sum2$month2, [...] [Read more...]

Monotonic Binning with Smbinning Package

January 22, 2017 | statcompute

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view: 1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive ... [Read more...]

Estimate Regression with (Type-I) Pareto Response

December 11, 2016 | statcompute

The Type-I Pareto distribution has a probability function shown as below f(y; a, k) = k * (a ^ k) / (y ^ (k + 1)) In the formulation, the scale parameter 0 __ a __ y and the shape parameter k __ 1 . The positive lower bound of Type-I Pareto distribution is particularly […] [Read more...]

More about Flexible Frequency Models

November 27, 2016 | statcompute

Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed. In addition to the quasi-poisson regression, three flexible frequency modeling techniques, including generalized ... [Read more...]

Fastest Way to Add New Variables to A Large Data.Frame

October 30, 2016 | statcompute

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table") lapply(pkgs, require, character.only = T) data(hflights) benchmark(replications = 10, order = "user.self", relative = "user.self", transform = { ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ### transform(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay) }, within = { ### EVALUATE THE EXPRESSION WITHIN THE LOCAL ENVIRONMENT ### within(hflights, {wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'); delay = ArrDelay + DepDelay}) }, mutate = { ### THE SPECIFIC FUNCTION IN DPLYR PACKAGE TO ADD VARIABLES ### mutate(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay) }, foreach = { ### SPLIT AND THEN COMBINE IN PARALLEL ### registerDoParallel(cores = 2) v <- c(names(hflights), 'wday', 'delay') f <- expression(ifelse(hflights$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), hflights$ArrDelay + hflights$DepDelay) df <- foreach(fn = iter(f), .combine = mutate, .init = hflights) %dopar% { [...] [Read more...]

Risk Models with Generalized PLS

June 12, 2016 | statcompute

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress ... [Read more...]

More Flexible Approaches to Model Frequency

May 12, 2016 | statcompute

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11) In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson ... [Read more...]
1 2 3 4 5 6 8

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)