Blog Archives

Announcing Practical Data Science with R, 2nd Edition

August 15, 2018
By
Announcing Practical Data Science with R, 2nd Edition

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R! Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as they become available, and give … Continue reading Announcing...

Read more »

Partial Pooling for Lower Variance Variable Encoding

September 28, 2017
By
Partial Pooling for Lower Variance Variable Encoding

Banaue rice terraces. Photo: Jon Rawlinson In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R. We will use the lme4 package to fit … Continue reading Partial...

Read more »

Custom Level Coding in vtreat

September 25, 2017
By
Custom Level Coding in vtreat

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and statistically preferable to one-hot encoding … Continue reading Custom...

Read more »

Teaching pivot / un-pivot

April 11, 2017
By
Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to … Continue reading Teaching...

Read more »

A Simple Example of Using replyr::gapply

December 19, 2016
By
A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work … Continue reading A...

Read more »

Using replyr::let to Parameterize dplyr Expressions

December 6, 2016
By
Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want … Continue reading Using...

Read more »

Principal Components Regression, Pt. 3: Picking the Number of Components

May 30, 2016
By
Principal Components Regression, Pt. 3: Picking the Number of Components

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in … Continue reading Principal...

Read more »

Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016
By
Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components … Continue reading Principal...

Read more »

Principal Components Regression, Pt.1: The Standard Method

May 16, 2016
By

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for predictive modeling, such as y-aware … Continue reading Principal...

Read more »

Finding the K in K-means by Parametric Bootstrap

February 10, 2016
By
Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample … Continue reading Finding...

Read more »

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)