Articles by John Mount

R Tip: Use qc() For Fast Legible Quoting

February 17, 2018 | John Mount

Here is an R tip. Need to quote a lot of names at once? Use qc(). This is particularly useful in selecting columns from data.frames: library("wrapr") # get qc() definition head(mtcars[, qc(mpg, cyl, wt)]) # mpg cyl wt # Mazda RX4 21.0 6 2.620 # Mazda RX4 Wag 21.0 6 2.875 # Datsun 710 22.8 … Continue reading R Tip: ... [Read more...]

Is 10,000 Cells Big?

February 12, 2018 | John Mount

Trick question: is a 10,000 cell numeric data.frame big or small? In the era of "big data" 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you the system, ...

Why No Exact Permutation Tests at Scale?

February 1, 2018 | John Mount

Here at Win-Vector LLC we like permutation tests. Our team has written on them (for example: How Do You Know if Your Data Has Signal?) and they are used to estimate significances in our sigr and WVPlots R packages. For example permutation methods are used to estimate the significance reported ...

Supercharge your R code with wrapr

January 27, 2018 | John Mount

I would like to demonstrate some helpful wrapr R notation tools that really neaten up your R code. Img: Christopher Ziemnowicz. Named Map Builder First I will demonstrate wrapr‘s "named map builder": :=. The named map builder adds names to vectors and lists by nice "names on the left and ...

Latest vtreat up on CRAN

January 24, 2018 | John Mount

There is a new version of the R package vtreat now up on CRAN. vtreat is an essential data preparation system for predictive modeling that helps defend your predictive modeling work against real world data issues including: High cardinality categorical variables Rare levels (including new or novel levels during application) ... [Read more...]

Advisory on Multiple Assignment dplyr::mutate() on Databases

January 21, 2018 | John Mount

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases. (image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License) In this note I exhibit a troublesome example, and a systematic solution. First let’s set up dplyr, our database, and some example data. ...

Data Reshaping with cdata

January 17, 2018 | John Mount

I’ve just shared a short webcast on data reshaping in R using the cdata package. (link) We also have two really nifty articles on the theory and methods: Fluid data reshaping with cdata Coordinatized Data: A Fluid Data Specification Please give it a try! This is the material I ...

Base R can be Fast

January 15, 2018 | John Mount

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are faster than R code.” The benchmark results of “rquery: Fast ...

Setting up RStudio Server quickly on Amazon EC2

January 13, 2018 | John Mount

I have recently been working on projects using Amazon EC2 (elastic compute cloud), and RStudio Server. I thought I would share some of my working notes. Amazon EC2 supplies near instant access to on-demand disposable computing in a variety of sizes (billed in hours). RStudio Server supplies an interactive user ...

rquery: Fast Data Manipulation in R

January 9, 2018 | John Mount

Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This ...

New wrapr R pipeline feature: wrapr_applicable

January 6, 2018 | John Mount

The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. ...

Big cdata News

January 4, 2018 | John Mount

I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) ...

Announcing rquery

December 28, 2017 | John Mount

We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big ... [Read more...]

Plotting Deep Learning Model Performance Trajectories

December 23, 2017 | John Mount

I am excited to share a new deep learning model performance trajectory graph. Here is an example produced based on Keras in R using ggplot2: The ideas include: We plot model performance as a function of training epoch, data set (training and validation), and metric. For legibility we facet on ...

How to Greatly Speed Up Your Spark Queries

December 20, 2017 | John Mount

For some time we have been teaching R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis." The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri ...

More Pipes in R

December 16, 2017 | John Mount

Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners). data.table has essentially used the square bracket sequence “][” in a manner equivalent to piping in R since about 2006. Here is an example. The ...

Getting started with seplyr

December 14, 2017 | John Mount

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool. For how and why, please check out our new introductory article.

How to Avoid the dplyr Dependency Driven Result Corruption

December 6, 2017 | John Mount

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases. To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any ... [Read more...]

Please inspect your dplyr+database code

December 2, 2017 | John Mount

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not ... [Read more...]

Win-Vector LLC announces new “big data in R” tools

November 29, 2017 | John Mount

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN): partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or ...

« 1 … 9 10 11 12 13 … 22 »

Copyright © 2022 | MH Corporate basic by MH Themes