Blog Archives

New R package: replyr (get a grip on remote dplyr data services)

November 22, 2016
By
New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that … Continue...

Read more »

MySql in a container

November 19, 2016
By

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL … Continue...

Read more »

Teaching Practical Data Science with R

November 16, 2016
By
Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of different sections of the … Continue...

Read more »

You should re-encode high cardinality categorical variables

November 11, 2016
By
You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes. In a sort … Continue...

Read more »

Laplace noising versus simulated out of sample methods (cross frames)

November 9, 2016
By
Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, … Continue...

Read more »

Some vtreat design principles

November 1, 2016
By
Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design. Introduction vtreat … Continue...

Read more »

A quick look at RStudio’s R notebooks

October 22, 2016
By

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (see http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ )

Read more »

Data science for executives and managers

October 21, 2016
By

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope … Continue...

Read more »

On calculating AUC

October 7, 2016
By
On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots … Continue...

Read more »

Adding polished significance summaries to papers using R

October 4, 2016
By

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more … Continue...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)