# Articles by Nina Zumel

### Principal Components Regression, Pt. 3: Picking the Number of Components

May 30, 2016 |

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate ...

### Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016 |

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what ...

### Principal Components Regression, Pt.1: The Standard Method

May 16, 2016 |

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for ...

### Finding the K in K-means by Parametric Bootstrap

February 10, 2016 |

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available ...

### Using PostgreSQL in R: A quick how-to

February 1, 2016 |

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead ...

### “Introduction to Data Science” video course contest is closed

January 26, 2016 |

Congratulations to all the winners of the Win-Vector “Introduction to Data Science” Video Course giveaway! We’ve emailed all of you your individual subscription coupons. Even though this contest is over, we still encourage those interested to join our mailing list. Our updates to the list will be infrequent, but (... [Read more...]

### Upcoming Win-Vector Appearances

November 9, 2015 |

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect ... [Read more...]

### Our Differential Privacy Mini-series

November 1, 2015 |

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, ...

### A Simpler Explanation of Differential Privacy

October 2, 2015 |

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this ...

### How do you know if your model is going to work?

September 22, 2015 |

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of ...

### Bootstrap Evaluation of Clusters

September 4, 2015 |

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of ...

### Working with Sessionized Data 2: Variable Selection

July 15, 2015 |

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we ... [Read more...]

### Working with Sessionized Data 1: Evaluating Hazard Models

July 8, 2015 |

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a ... [Read more...]

### Wanted: A Perfect Scatterplot (with Marginals)

June 11, 2015 |

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear ... [Read more...]

### Does Balancing Classes Improve Classifier Performance?

February 27, 2015 |

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer ... [Read more...]

### The Geometry of Classifiers

December 18, 2014 |

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, ... [Read more...]

### Estimating Generalization Error with the PRESS statistic

September 25, 2014 |

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however ... [Read more...]

### Vtreat: designing a package for variable treatment

August 7, 2014 |

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear ... [Read more...]

### Trimming the Fat from glm() Models in R

May 30, 2014 |

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our ... [Read more...]

### Bandit Formulations for A/B Tests: Some Intuition

April 24, 2014 |

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a ... [Read more...]
1 2 3