Max Kuhn, Director of Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling joined us on February 17, 2015 and shared his experience with Data Mining

Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy. We designed the course as an introduction to an advanced topic. The course description is: Use the R Programming Language to execute data science projects and become a data … Continue reading Announcing:...

by Marek Gagolewski, Maciej Bartoszuk, Anna Cena, and Jan Lasek (Rexamine). Introduction In a recent blog post we explained how we managed to set up a working Hadoop environment on a few CentOS7 machines. To test the installation, let’s play…Read more ›

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2) One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into … Continue reading How...

Consider the following dataset, with (only) ten points x=c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85) y=c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3) plot(x,y,pch=19,cex=2) We want to get – say – two clusters. Or more specifically, two sets of observations, each of them sharing some similarities. Since the number of observations is rather small, it is actually possible to get an exhaustive list of all partitions, and to minimize some criteria, such...

RStudio’s data viewer provides a quick way to look at the contents of data frames and other column-based data in your R environment. You invoke it by clicking on the grid icon in the Environment pane, or at the console by typing View(mydata). As part of the RStudio Preview Release, we’ve completely overhauled RStudio’s data

by Andrie de Vries R has strong support for parallel programming, both in base R and additional CRAN packages. For example, we have previously written about foreach and parallel programming in the articles Tutorial: Parallel programming with foreach and Intro to Parallel Random Number Generation with RevoScaleR. The foreach package provides simple looping constructs in R, similar to lapply()...

The other day I got stuck working with a huge data set using data.table in R. It took me a little while to realise that I had to produce a minimal reproducible example to actually understand why I got stuck in the first place. I know, this is the mantra I should follow before I reach out to R-help,...

I spent last week at the Strata 2015 Conference in San José, California. As always, Strata made for a wonderful conference to catch up on the latest developments on big data and data science, and to connect with colleagues and friends old and new. Having been to every Strata conference since the first in XXXX, it's been interesting to...

