Articles by Nina Zumel

Exploring the XI Correlation Coefficient

December 29, 2021 | Nina Zumel

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]
[Read more...]

When Cross-Validation is More Powerful than Regularization

November 12, 2019 | Nina Zumel

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that ...
[Read more...]

Why Do We Plot Predictions on the x-axis?

September 27, 2019 | Nina Zumel

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an "ideal" linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*...
[Read more...]

WVPlots 1.1.2 on CRAN

September 12, 2019 | Nina Zumel

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to ...
[Read more...]

An Ad-hoc Method for Calibrating Uncalibrated Models

July 16, 2019 | Nina Zumel

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized ...
[Read more...]

Common Ensemble Models can be Biased

July 11, 2019 | Nina Zumel

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be ...
[Read more...]

Link Functions versus Data Transforms

July 7, 2019 | Nina Zumel

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason ...
[Read more...]

Cohen’s D for Experimental Planning

June 18, 2019 | Nina Zumel

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program ...
[Read more...]

PDSwR2: New Chapters!

February 6, 2019 | Nina Zumel

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, ...
[Read more...]

More on sigr

November 6, 2018 | Nina Zumel

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm [Read more...]

Faceted Graphs with cdata and ggplot2

October 21, 2018 | Nina Zumel

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I ...
[Read more...]

Announcing Practical Data Science with R, 2nd Edition

August 15, 2018 | Nina Zumel

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R! Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as ...
[Read more...]

Partial Pooling for Lower Variance Variable Encoding

September 28, 2017 | Nina Zumel

Banaue rice terraces. Photo: Jon Rawlinson In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R. We will ...
[Read more...]

Custom Level Coding in vtreat

September 25, 2017 | Nina Zumel

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and ... [Read more...]

Teaching pivot / un-pivot

April 11, 2017 | Nina Zumel

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “...
[Read more...]

A Simple Example of Using replyr::gapply

December 19, 2016 | Nina Zumel

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of ...
[Read more...]

Using replyr::let to Parameterize dplyr Expressions

December 6, 2016 | Nina Zumel

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). ...
[Read more...]
1 2 3

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)