Articles by John Mount

magrittr’s Doppelgänger

December 13, 2016 | John Mount

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I ...
[Read more...]

The Case For Using -> In R

December 12, 2016 | John Mount

R has a number of assignment operators (at least ““; plus “” which have different semantics). The R-style guides routinely insist on “” when using magrittr pipelines. Don Quijote and … Continue reading The Case For Using -__ In R
[Read more...]

The case for index-free data manipulation

December 10, 2016 | John Mount

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) ...
[Read more...]

Parametric variable names and dplyr

December 3, 2016 | John Mount

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R ...
[Read more...]

Be careful evaluating model predictions

December 2, 2016 | John Mount

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For ...
[Read more...]

MySql in a container

November 19, 2016 | John Mount

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin ... [Read more...]

Teaching Practical Data Science with R

November 16, 2016 | John Mount

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of ... [Read more...]

You should re-encode high cardinality categorical variables

November 11, 2016 | John Mount

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and ... [Read more...]

Some vtreat design principles

November 1, 2016 | John Mount

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles ... [Read more...]

A quick look at RStudio’s R notebooks

October 22, 2016 | John Mount

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (see http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ ) [Read more...]

Data science for executives and managers

October 21, 2016 | John Mount

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is ... [Read more...]

On calculating AUC

October 7, 2016 | John Mount

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is ... [Read more...]

Proofing statistics in papers

October 2, 2016 | John Mount

Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for ... [Read more...]

Relative error distributions, without the heavy tail theatrics

September 19, 2016 | John Mount

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (... [Read more...]

Did she know we were writing a book?

September 3, 2016 | John Mount

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge. Nina Zumel and I ... [Read more...]

Variables can synergize, even in a linear model

September 1, 2016 | John Mount

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve ... [Read more...]
1 14 15 16 17 18 22

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)