Blog Archives

The Basics of Encoding Categorical Data for Predictive Models

October 23, 2013
By
The Basics of Encoding Categorical Data for Predictive Models

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible. (I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have...

Read more »

Equivocal Zones

August 16, 2013
By
Equivocal Zones

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as "equivocal" or "indeterminate" instead of one of the class levels. This only works if the...

Read more »

UseR! Slides for “Classification Using C5.0″

July 17, 2013
By

I've had a lot of requests, so here they are.  Hopefully, all of the slides will be posted on the conference website.

Read more »

UseR! 2013 Highlights

July 13, 2013
By

The conference was excellent this year. My highlights: Bojan Mihaljevic gave a great presentation on machine learning models built from network models. Their package isn't on CRAN yet, but I'm really looking forward to it. Jim Harner's presentation ...

Read more »

Measuring Associations

June 20, 2013
By
Measuring Associations

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al (2011). A summary of the initial reactions to the MIC are Speed and Tibshirani (and others can be found here). My (minor) beef with it is the lack...

Read more »

type = “what?”

June 13, 2013
By
type = “what?”

One great thing about R is that has a wide diversity of packages written by many different people of many different viewpoints on how software should be designed. However, this does tend to bite us periodically.  When I teach newcomers about R and...

Read more »

Feature Selection 3 – Swarm Mentality

June 6, 2013
By
Feature Selection 3 – Swarm Mentality

"Bees don't swarm in a mango grove for nothing. Where can you see a wisp of smoke without a fire?" - Hla Stavhana In the last two posts, genetic algorithms were used as feature wrappers to search for more effective subsets of predictors. Here, I will do the same with another type of search algorithm: particle swarm optimization....

Read more »

Recent Changes to caret

May 18, 2013
By

Here is a summary of some recent changes to caret. Feature Updates: train was updated to utilize recent changes in the gbm package that allow for boosting with three or more classes (via the multinomial distribution) The Yeo-Johnson power transformation was added. This is very similar to the Box-Cox transformation, but it does not require the data to be...

Read more »

Projection Pursuit Classification Trees

May 14, 2013
By

I've been looking at this article for a new tree-based method. It uses other classification methods (e.g. LDA) to find a single variable use in the split and builds a tree in that manner. The subtleties of the model are: The model does not prune but ...

Read more »

Feature Selection 2 – Genetic Boogaloo

May 8, 2013
By
Feature Selection 2 – Genetic Boogaloo

Previously, I talked about genetic algorithms (GA) for feature selection and illustrated the algorithm using a modified version of the GA R package and simulated data. The data were simulated with 200 non-informative predictors and 12 linear effects and three non-linear effects. Quadratic discriminant analysis (QDA) was used to model the data. The last set of...

Read more »