Blog Archives

Simple Distributions for Mixtures?

February 3, 2016
By
Simple Distributions for Mixtures?

The idea of GLMs is that given some covariates,  has a distribution in the exponential family (Gaussian, Poisson, Gamma, etc). But that does not mean that  has a similar distribution… so there is no reason to test for a Gamma model for  before running a Gamma regression, for instance. But are there cases where it might work? That the non-conditional distribution is...

Read more »

Confidence Regions for Parameters in the Simplex

January 18, 2016
By
Confidence Regions for Parameters in the Simplex

Consider here the case where, in some parametric inference problem, parameter  is a point in the Simplex, For instance, consider some regression, on compositional data, > library(compositions) > data(DiagnosticProb) > Y=DiagnosticProb-1 > X=DiagnosticProb > model = glm(Y~ilr(X),family=binomial) > b = ilrInv(coef(model),orig=X) > as.numeric(b) 0.3447106 0.2374977 0.4177917 We can visualize that estimator on the simplex, using > tripoint=function(s){ + p=s/sum(s)...

Read more »

Regression with Splines: Should we care about Non-Significant Components?

January 4, 2016
By
Regression with Splines: Should we care about Non-Significant Components?

Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression.  Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant? My initial intuition was to...

Read more »

How Could Classification Trees Be So Fast on Categorical Variables?

December 8, 2015
By
How Could Classification Trees Be So Fast on Categorical Variables?

I think that over the past months, I have been saying non-correct things about classification with categorical covariates. Because I never took time to look at it carefuly. Consider some simulated dataset, with a logistic regression, > n=1e3 > set.seed(1) > X1=runif(n) > q=quantile(X1,(0:26)/26) > q=0 > X2=cut(X1,q,labels=LETTERS) > p=exp(-.1+qnorm(2*(abs(.5-X1))))/(1+exp(-.1+qnorm(2*(abs(.5-X1))))) > Y=rbinom(n,size=1,p) > df=data.frame(X1=X1,X2=X2,p=p,Y=Y) Here, we use some continuous covariate, except...

Read more »

Inter-relationships in a matrix

December 1, 2015
By
Inter-relationships in a matrix

Last week, I wanted to displaying inter-relationships between data in a matrix. My friend Fleur, from AXA, mentioned an interesting possible application, in car accidents. In car against car accidents, it might be interesting to see which parts of the cars were involved. On https://www.data.gouv.fr/fr/, we can find such a dataset, with a lot of information of car accident involving bodily injuries...

Read more »

Additional thoughts about ‘Lorenz curves’ to compare models

November 28, 2015
By
Additional thoughts about ‘Lorenz curves’ to compare models

A few month ago, I did mention a graph, of some so-called Lorenz curves to compare regression models, see e.g. Progressive’s slides (thanks Guillaume for the reference) The idea is simple. Consider some model for the pure premium (in insurance, it is the quantity that we like to model), i.e. the conditional expected valeur On some dataset, we have...

Read more »

Profile Likelihood

November 16, 2015
By
Profile Likelihood

Consider some simulated data > set.seed(1) > x=exp(rnorm(100)) Assume that those data are observed i.id. random variables with distribution, with . The natural idea is to consider the maximum likelihood estimator For instance, consider some maximum likelihood estimator, > library(MASS) > (F=fitdistr(x,"gamma")) shape rate 1.4214497 0.8619969 (0.1822570) (0.1320717) > F$estimate+c(-1,1)*1.96*F$sd 1.064226 1.778673 Here, we have an approximated (since the...

Read more »

Variable Importance with Correlated Features

November 6, 2015
By
Variable Importance with Correlated Features

Variable importance graphs are great tool to see, in a model, which variables are interesting. Since we usually use it with random forests, it looks like it is works well with (very) large datasets. The problem with large datasets is that a lot of features are ‘correlated’, and in that case, interpretation of the values of variable importance plots can...

Read more »

Applications of Chi-Square Tests

November 3, 2015
By
Applications of Chi-Square Tests

This morning, in our mathematical statistical class, we’ve seen the use of the chi-square test. The first one was related to some goodness of fit of a multinomial distribution. Assume that . In order to test  against , use the statistic Under , . For instance, we have the number of weddings, in a large city, per season, > n=c(301,356,413,262) We want to test...

Read more »

Statistical Tests: Asymptotic, Exact, ou based on Simulations?

October 20, 2015
By
Statistical Tests: Asymptotic, Exact, ou based on Simulations?

This morning, in our mathematical statistics course, we’ve been discussing the ‘proportion test‘, i.e. given a sample of Bernoulli trials, with , we want to test against  A natural test (which can be related to the maximum likelihood ratio test) is  based on the statistic The test function is here To get the bounds of the acceptance region, we need the...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)