Blog Archives

help(let, package=’replyr’)

December 17, 2016
By

A bit more on our replyr R package. library("replyr") help(let, package='replyr') let {replyr} R Documentation Prepare expr for execution with name substitutions specified in alias. Description replyr::let implements a mapping from desired names (names used directly in the expr code) to names used in the data. Mnemonic: "expr code symbols are on the left, external … Continue...

Read more »

Organize your data manipulation in terms of “grouped ordered apply”

December 15, 2016
By
Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is … Continue...

Read more »

magrittr’s Doppelgänger

December 13, 2016
By
magrittr’s Doppelgänger

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I wrote some code that … Continue...

Read more »

The Case For Using -> In R

December 12, 2016
By
The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics). The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. Don Quijote and … Continue...

Read more »

The case for index-free data manipulation

December 10, 2016
By
The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit … Continue...

Read more »

Parametric variable names and dplyr

December 3, 2016
By
Parametric variable names and dplyr

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently … Continue...

Read more »

Be careful evaluating model predictions

December 2, 2016
By
Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter … Continue...

Read more »

vtreat data cleaning and preparation article now available on arXiv

November 30, 2016
By

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 . vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer … Continue...

Read more »

New R package: replyr (get a grip on remote dplyr data services)

November 22, 2016
By
New R package: replyr (get a grip on remote dplyr data services)

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted data service providers. Things that … Continue...

Read more »

MySql in a container

November 19, 2016
By

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin up disposable PostgreSQL or MySQL … Continue...

Read more »

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)