Blog Archives

Estimating Generalization Error with the PRESS statistic

September 25, 2014
By
Estimating Generalization Error with the PRESS statistic

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In Related posts:

Read more »

Vtreat: designing a package for variable treatment

August 7, 2014
By
Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear in the training data (especially Related posts:

Read more »

Trimming the Fat from glm() Models in R

May 30, 2014
By
Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to Related posts:

Read more »

Bandit Formulations for A/B Tests: Some Intuition

April 24, 2014
By
Bandit Formulations for A/B Tests: Some Intuition

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new Related posts:

Read more »

Practical Data Science with R: Release date announced

March 25, 2014
By
Practical Data Science with R: Release date announced

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version Related posts:

Read more »

The Statistics behind “Verification by Multiplicity”

March 1, 2014
By
The Statistics behind “Verification by Multiplicity”

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here at Win-Vector, but we do Related posts:

Read more »

The Extra Step: Graphs for Communication versus Exploration

January 12, 2014
By
The Extra Step: Graphs for Communication versus Exploration

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical. One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of Related posts:

Read more »

Big News! Practical Data Science with R is content complete!

December 19, 2013
By
Big News! Practical Data Science with R is content complete!

The last appendix has gone to the editors; the book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, if you haven’t yet. The link Related posts:

Read more »

Bayesian and Frequentist Approaches: Ask the Right Question

May 6, 2013
By
Bayesian and Frequentist Approaches: Ask the Right Question

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you Related posts:

Read more »

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

February 18, 2013
By
Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) generated useful advice for designing Related posts:

Read more »