Articles by Nina Zumel

Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016 | Nina Zumel

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what ...
[Read more...]

Principal Components Regression, Pt.1: The Standard Method

May 16, 2016 | Nina Zumel

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for ...
[Read more...]

Finding the K in K-means by Parametric Bootstrap

February 10, 2016 | Nina Zumel

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available ...
[Read more...]

Using PostgreSQL in R: A quick how-to

February 1, 2016 | Nina Zumel

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead ...
[Read more...]

Upcoming Win-Vector Appearances

November 9, 2015 | Nina Zumel

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect ... [Read more...]

Our Differential Privacy Mini-series

November 1, 2015 | Nina Zumel

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, ...
[Read more...]

A Simpler Explanation of Differential Privacy

October 2, 2015 | Nina Zumel

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this ...
[Read more...]

How do you know if your model is going to work?

September 22, 2015 | Nina Zumel

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of ...
[Read more...]

Bootstrap Evaluation of Clusters

September 4, 2015 | Nina Zumel

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of ...
[Read more...]

Working with Sessionized Data 2: Variable Selection

July 15, 2015 | Nina Zumel

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we ... [Read more...]

Wanted: A Perfect Scatterplot (with Marginals)

June 11, 2015 | Nina Zumel

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear ... [Read more...]

Does Balancing Classes Improve Classifier Performance?

February 27, 2015 | Nina Zumel

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer ... [Read more...]

The Geometry of Classifiers

December 18, 2014 | Nina Zumel

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, ... [Read more...]

Estimating Generalization Error with the PRESS statistic

September 25, 2014 | Nina Zumel

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however ... [Read more...]

Vtreat: designing a package for variable treatment

August 7, 2014 | Nina Zumel

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear ... [Read more...]

Trimming the Fat from glm() Models in R

May 30, 2014 | Nina Zumel

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our ... [Read more...]

Bandit Formulations for A/B Tests: Some Intuition

April 24, 2014 | Nina Zumel

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a ... [Read more...]
1 2 3

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)