Blog Archives

Bay Area RUG Talk on 3/17

March 9, 2014
By

I'm making my yearly pilgrimage to San Fransico to teach at PAW.I'll also be giving a short talk at the Bay Area R Users Group on model tags in the caret package and the code that produced this interactive plot.It is at 7:00 PM on Monday March 17...

Read more »

caret webinar materials

February 28, 2014
By

The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos. 

Read more »

Optimizing Probability Thresholds for Class Imbalances

February 6, 2014
By
Optimizing Probability Thresholds for Class Imbalances

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this is that the performance is generally very biased against the class with the smallest frequencies. For example, if the data have a majority of samples belonging to the first...

Read more »

caret webinar on Feb 25

February 2, 2014
By

I"ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM - 2:00 PM EST.Here is the url in case you are interested: https://www3.gotomeeting.com/register/673845982Thanks to Ray DiGiacom...

Read more »

Calibration Affirmation

January 4, 2014
By
Calibration Affirmation

In the book, we discuss the notion of a probability model being "well calibrated". There are many different mathematical techniques that classification models use to produce class probabilities. Some of values are "probability-like" in that they are between zero and one and sum to one. This doesn't necessarily mean that the probability estimates are consistent with the true event...

Read more »

Down-Sampling Using Random Forests

December 8, 2013
By
Down-Sampling Using Random Forests

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discussdown-sampling: sample the majority class to make their frequencies closer to the rarest class.up-sampling: the minority class is resampled to increase the corresponding frequencieshybrid approaches: some methodologies do a little of both and...

Read more »

The Basics of Encoding Categorical Data for Predictive Models

October 23, 2013
By
The Basics of Encoding Categorical Data for Predictive Models

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible.(I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have...

Read more »

Equivocal Zones

August 16, 2013
By
Equivocal Zones

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as "equivocal" or "indeterminate" instead of one of the class levels. This only works if the...

Read more »

UseR! Slides for “Classification Using C5.0″

July 17, 2013
By

I've had a lot of requests, so here they are.  Hopefully, all of the slides will be posted on the conference website.

Read more »

UseR! 2013 Highlights

July 13, 2013
By

The conference was excellent this year. My highlights:Bojan Mihaljevic gave a great presentation on machine learning models built from network models. Their package isn't on CRAN yet, but I'm really looking forward to it. Jim Harner's presentation ...

Read more »