Open-source software is awesome. If I found that a piece of closed-source software was missing a feature that I wanted, well, bad luck. I probably couldn't even tell if was actually missing or if I just didn't know about it....Continue Reading →

This article is a demonstration the use of the R vtreat variable preparation package followed by caret controlled training. In previous writings we have gone to great lengths to document, explain and motivate vtreat. That necessarily gets long and unnecessarily feels complicated. In this example we are going to show what building a predictive model … Continue reading...

A not so enticing Le Monde mathematical puzzle: Find the minimal value of a five digit number divided by the sum of its digits. This can formalised as finding the minimum of N/(a+b+c+d+e) when N writes abcde. And solved by brute force. Using a rough approach to finding the digits of a five-digit number, the

Question I recently got a mail from Václav on reference semantics in data.tree, reading as follows: Dear Christoph, I am rather inexperienced when it comes to environments in R and henceforth I apologize if my question is basic; however, my colleagues are no better than me to answer my question. I would have a question iro The post

In this post we’ll show how to create Triangular Surface Plots in R. This post is based on timelyportfolio’s gist. Moebius Strip 2D Surface over a disk Chopper from python

An R programmer can determine the order of processing of commands, via use of the control statements; repeat{}, while(), for(), break, and next Answers to the exercises are available here. Exercise 1 The repeat{} loop processes a block of code until the condition specified by the break statement, (that is mandatory within the repeat{} loop),

A MilanoR meeting is an occasion to bring together R users from the Milano area to share R tips and experience: the next one will be Thursday, October 27th. We are looking for volunteers to present at the next meeting: if you feel you have something to input or you can recommend someone, please contact us! The post

Vocativ did an interesting analysis of the President’s State of the Union (SOTU) speeches. They showed that across the past couple hundred years and many Presidents, SOTU speeches have been targeted at audiences with lower and lower education levels. Vocativ’s in-print interpretation of the downward sloping trend was that a speeches have gotten less sophisticated. Their recommended share-tweet for the article...

The last month or so has been a whirlwind of awesomeness with a veritable bevvy of user group and conference talks on my part! I thought I would share the materials with you and provide some brief thoughts on how each presentation went. Sessions SQL Saturday Exeter : Stats 101 London Business Analytics (LBAG) : The post

Recently, in my own little scientific community bubble there was increasing interest in markdown and its use for science. As a big fan of markdown and espacially rmarkdown, I created the following cheat sheet and shared it at a couple of events. Sinc...

I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package. tl;dr By running the following 3 lines of code: install.packages("heatmaply") library(heatmaply) heatmaply(mtcars, k_col = 2, k_row = 3) %>% layout(margin = list(l = 130, b = 40)) You will get this output in your browser … Continue reading...

by John Mount Ph. D. Data Scientist at Win-Vector LLC In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces: Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression). Part 2: the introduction of y-aware...

(By Achim Zeileis) From 10 June to 10 July 2016 the best European football teams will meet in France to determine the European Champion in the UEFA European Championship 2016 tournament. For the first time 24 teams compete, expanding the format from 16 teams as in the previous five Euro tournaments. For forecasting the winning probability of each team...

Previously in this series: Understanding the beta distribution Understanding empirical Bayes estimation Understanding credible intervals Understanding the Bayesian approach to false discovery rates Understanding Bayesian A/B testing In this series we’ve been using the empirical Bayes method to estimate batting averages of baseball players. Empirical Bayes is useful here because when we...

Today’s post is by Kurt Menke, the owner of Bird’s Eye View GIS, a GIS consultancy. Kurt also wrote the book Mastering QGIS. In my latest course (Shapefiles for R Programmers) I briefly introduce people to QGIS. Kurt’s post below gives you a roadmap for learning more. I come to this blog from a slightly different, The post

In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math. One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists “spend 80% The post

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in … Continue reading...

It is often said that “R it its packages.” One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is … Continue reading...

The zoo package consists of the methods for totally ordered indexed observations. It aims at performing calculations containing irregular time series of numeric vectors, matrices & factors. The zoo package interfaces to all other time series packages on CRAN. This makes it easy to pass the time series objects between zoo & other time series

This blog post was first published on EXEGETIC ANALYTICS‘s blog and kindly re-posted on Data Science Africa. We are planning to host one of the three inaugural satRday conferences in Cape Town during 2017. The R Consortium has committed to funding three of these events: one will be in Hungary, another will be somewhere in the USA and the third...

I got a note from Karim Lahrichi, who even thinks about math when he’s supposed to be drinking beer. The bar puzzle they were trying to solve goes like this: Using all of the numbers 1, 3, 4, 6 exactly once, and any combination of: addition, subtraction, multiplication and division (and parenthesis to group operations however you The post

We all have used stepwise regression at some point. Stepwise regression is known to be sensitive to initial inputs. One way to mitigate this sensitivity is to repeatedly run stepwise regression on bootstrap samples. R has a nice package called bootStepAIC() which (from its description) “Implements a Bootstrap procedure to investigate the variability of model

Although this is far from a paradox when realising why the phenomenon occurred, it took me a few lines to understand why the empirical average of a log-normal sample is apparently a biased estimator of its mean. And why the biased plug-in estimator does not appear to present a bias. The picture below compares two