The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos.

The webinar was recorded (thanks to Ray DiGiacomo and the Orange County RUG). The slides are here minus a few typos.

One of the toughest problems in predictive model occurs when the classes have a severe imbalance. We spend an entire chapter on this subject itself. One consequence of this is that the performance is generally very biased against the class with the smallest frequencies. For example, if the data have a majority of samples belonging to the first...

I"ll be doing a webinar with the Orange County R User Group on the caret package on Tue, Feb 25, 2014 1:00 PM - 2:00 PM EST.Here is the url in case you are interested: https://www3.gotomeeting.com/register/673845982Thanks to Ray DiGiacom...

In the book, we discuss the notion of a probability model being "well calibrated". There are many different mathematical techniques that classification models use to produce class probabilities. Some of values are "probability-like" in that they are between zero and one and sum to one. This doesn't necessarily mean that the probability estimates are consistent with the true event...

We discuss dealing with large class imbalances in Chapter 16. One approach is to sample the training set to coerce a more balanced class distribution. We discuss down-sampling: sample the majority class to make their frequencies closer to the rarest class. up-sampling: the minority class is resampled to increase the corresponding frequencies hybrid approaches: some methodologies do a little of both and...

Thomas Yokota asked a very straight-forward question about encodings for categorical predictors: "Is it bad to feed it non-numerical data such as factors?" As usual, I will try to make my answer as complex as possible. (I've heard the old wives tale that eskimos have 180 different words in their language for snow. I'm starting to think that statisticians have...

In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as "equivocal" or "indeterminate" instead of one of the class levels. This only works if the...

I've had a lot of requests, so here they are. Hopefully, all of the slides will be posted on the conference website.

The conference was excellent this year. My highlights: Bojan Mihaljevic gave a great presentation on machine learning models built from network models. Their package isn't on CRAN yet, but I'm really looking forward to it. Jim Harner's presentation ...

In Chapter 18, we discuss a relatively new method for measuring predictor importance called the maximal information coefficient (MIC). The original paper is by Reshef at al (2011). A summary of the initial reactions to the MIC are Speed and Tibshirani (and others can be found here). My (minor) beef with it is the lack...

e-mails with the latest R posts.

(You will not see this message again.)