Blog Archives

Reverse Engineering with Correlated Features

February 11, 2016
By
Reverse Engineering with Correlated Features

In econometric modeling, I usually have a problem with correlated features. A few weeks ago, I was discussing feature selection when features are correlated. This week, I was wondering about reverse engineering when features might be correlated (not to say very correlated). The way I see reverse engineering is the following someone has some dataset, and based on that dataset, a...

Read more »

Clustering French Cities (based on Temperatures)

February 11, 2016
By
Clustering French Cities (based on Temperatures)

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities. > temp=read.table( + "http://freakonometrics.free.fr/FR_temp.txt", + header=TRUE,dec=",") We have 15 cities, with monthly observations > X=temp > boxplot(X) Since the variance seems to be rather stable, we will not ‘normalize’ the variables here, > apply(X,2,sd) Janv Fevr Mars...

Read more »

Clusters of Texts

February 10, 2016
By
Clusters of Texts

Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example,  inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics, > library(tm) > library(stringi) > library(proxy) > titles = c("Boosting_(machine_learning)", + "Random_forest", + "K-nearest_neighbors_algorithm", + "Logistic_regression", + "Boston_Bruins", + "Los_Angeles_Lakers", + "Game_of_Thrones", + "House_of_Cards_(U.S._TV_series)", + "True Detective...

Read more »

Clusters of (French) Regions

February 9, 2016
By
Clusters of (French) Regions

For the data scienec course of tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections > elections2012=read.table( "http://freakonometrics.free.fr/elections_2012_T1.csv",sep=";",dec=",",header=TRUE) > voix=which(substr(names( + elections2012),1,11)=="X..Voix.Exp") > elections2012=elections2012 > X=as.matrix(elections2012) > colnames(X)=c("JOLY","LE PEN","SARKOZY","MÉLENCHON","POUTOU","ARTHAUD","CHEMINADE","BAYROU","DUPONT-AIGNAN","HOLLANDE") > rownames(X)=elections2012 The hierarchical cluster analysis is obtained using > cah=hclust(dist(X)) > plot(cah,cex=.6) To get five groups, we have...

Read more »

Simple Distributions for Mixtures?

February 3, 2016
By
Simple Distributions for Mixtures?

The idea of GLMs is that given some covariates,  has a distribution in the exponential family (Gaussian, Poisson, Gamma, etc). But that does not mean that  has a similar distribution… so there is no reason to test for a Gamma model for  before running a Gamma regression, for instance. But are there cases where it might work? That the non-conditional distribution is...

Read more »

Confidence Regions for Parameters in the Simplex

January 18, 2016
By
Confidence Regions for Parameters in the Simplex

Consider here the case where, in some parametric inference problem, parameter  is a point in the Simplex, For instance, consider some regression, on compositional data, > library(compositions) > data(DiagnosticProb) > Y=DiagnosticProb-1 > X=DiagnosticProb > model = glm(Y~ilr(X),family=binomial) > b = ilrInv(coef(model),orig=X) > as.numeric(b) 0.3447106 0.2374977 0.4177917 We can visualize that estimator on the simplex, using > tripoint=function(s){ + p=s/sum(s)...

Read more »

Regression with Splines: Should we care about Non-Significant Components?

January 4, 2016
By
Regression with Splines: Should we care about Non-Significant Components?

Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression.  Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant? My initial intuition was to...

Read more »

How Could Classification Trees Be So Fast on Categorical Variables?

December 8, 2015
By
How Could Classification Trees Be So Fast on Categorical Variables?

I think that over the past months, I have been saying non-correct things about classification with categorical covariates. Because I never took time to look at it carefuly. Consider some simulated dataset, with a logistic regression, > n=1e3 > set.seed(1) > X1=runif(n) > q=quantile(X1,(0:26)/26) > q=0 > X2=cut(X1,q,labels=LETTERS) > p=exp(-.1+qnorm(2*(abs(.5-X1))))/(1+exp(-.1+qnorm(2*(abs(.5-X1))))) > Y=rbinom(n,size=1,p) > df=data.frame(X1=X1,X2=X2,p=p,Y=Y) Here, we use some continuous covariate, except...

Read more »

Inter-relationships in a matrix

December 1, 2015
By
Inter-relationships in a matrix

Last week, I wanted to displaying inter-relationships between data in a matrix. My friend Fleur, from AXA, mentioned an interesting possible application, in car accidents. In car against car accidents, it might be interesting to see which parts of the cars were involved. On https://www.data.gouv.fr/fr/, we can find such a dataset, with a lot of information of car accident involving bodily injuries...

Read more »

Additional thoughts about ‘Lorenz curves’ to compare models

November 28, 2015
By
Additional thoughts about ‘Lorenz curves’ to compare models

A few month ago, I did mention a graph, of some so-called Lorenz curves to compare regression models, see e.g. Progressive’s slides (thanks Guillaume for the reference) The idea is simple. Consider some model for the pure premium (in insurance, it is the quantity that we like to model), i.e. the conditional expected valeur On some dataset, we have...

Read more »

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



http://www.eoda.de









ODSC

CRC R books series













Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)