## Plumber Logging

August 12, 2019
By

The plumber R package is used to expose R functions as API endpoints. Due to plumber’s incredible flexibility, most major API design decisions are left up to the developer. One important consideration to be made when developing APIs is how to log information about API requests and responses. This information can be used to determine how plumber APIs are...

## Synthesizing population time-series data from the USA Long Term Ecological Research Network

Introduction The availability of large quantities of freely available data is revolutionizing the world of ecological research. Open data maximizes the opportunities to perform comparative analyses and meta-analyses. Such synthesis efforts will increasingly exploit “population data”, which we define here as time series of population abundance. Such population data plays a central role in testing ecological theory and guiding management...

## Can we use a neural network to generate Shiny code?

August 12, 2019
By

Many news reports scare us with machines taking over our jobs in the not too distant future. Common examples of take-over targets include professions like truck drivers, lawyers and accountants. In this article we will explore how far machines are from replacing us (R programmers) in writing Shiny code. Spoiler alert: you should not be Article Can we use...

## Vectors and Functions

August 12, 2019
By

In the previous set we started with arithmetic operations on vectors. We’ll take this a step further now, by practising functions to summarize, sort and round the elements of a vector. Sofar, the functions we have practised (log, sqrt, exp, sin, cos, and acos) always return a vector with the same length as the input Related exercise sets: Spatial Data...

## tsbox 0.2: supporting additional time series classes

August 12, 2019
By

The tsbox package makes life with time series in R easier. It is built around a set of functions that convert time series of different classes to each other. They are frequency-agnostic, and allow the user to combine time series of multiple non-standard and irregular frequencies. A detailed overview of the package functionality is given

## Visualization of Red Tide in the Gulf of Mexico

August 11, 2019
By

The Red Tide Visualization App offers a quick, interactive snapshot of harmful algae blooms, or 'Red Tide,' observed in the Gulf of Mexico from 2000 to 2018.  This app allows the user to examine 6,000 algal blooms recorded by the National Oceanic and Atmosphereic Association's harmful algae bloom observation system and their corresponding water temperatures, water

## objectremover RStudio Addin

August 11, 2019
By

A Learning Exercise Workflow I created my first ever R package and got it released onto CRAN in March 2019. It’s taken me a while to get round to actually writing about this which tells me that despite many years of trying to overcome procrast...

## R courses February 2020

August 11, 2019
By

R courses February 2020 Our next series of R courses for professionals and graduate students is now open for registration. The courses include 1-day for absolute beginners, 3 days for intermediate material and 3 days for the advanced class. They are r...

## objectremover RStudio Addin

August 11, 2019
By

A Learning Exercise Workflow I created my first ever R package and got it released onto CRAN in March 2019. It’s taken me a while to get round to actually writing about this which tells me that despite many years of trying to overcome procrast...

## Introducing mlrPlayground

First of all The idea The features Usage First of all You may ask yourself how is this name ‘mlrPlayground’ even justified? What a person dares to put two such opposite terms in a single word and expects people to take him seriously? I as...

## My book’s pdf generation workflow

August 11, 2019
By

The process used to generate the pdf of my evidence-based software engineering book has been on my list of things to blog about, for ever. An email arrived this afternoon, asking how I produced various effects using Asciidoc; this post probably contains rather more than N. Psaris wanted to know. It’s very easy to get

## R: SVM to Predict MPG for 2019 Vehicles

August 11, 2019
By

Continuing on the below post, I am going to use a support vector machine (SVM) to predict combined miles per gallon for all 2019 motor vehicles. Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles Part 2: Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles The raw data is located on the...

## Using SVM to Predict MPG for 2019 Vehicles

August 11, 2019
By

Continuing on the below post, I am going to use a support vector machine (SVM) to predict combined miles per gallon for all 2019 motor vehicles. Part 1: Using Decision Trees and Random Forest to Predict MPG for 2019 Vehicles Part 2: Using Gradient Boosted Machine to Predict MPG for 2019 Vehicles The raw data is located on the...

## vtreat up on PyPi

August 11, 2019
By

I am excited to announce vtreat is now available for Python on PyPi, in addition for R on CRAN. vtreat is: A data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. vtreat prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. … Continue reading vtreat...

## Gosset part 2: small sample statistics

August 10, 2019
By

A NICE ONELINER HERE? This post is an explainer about the small sample experiment and determining the ideal sample size for inference. Economic perspectives and business logic Brewing beer at scale One of the problems William S. Gosset worked on was determining the quality of Malt. To brew beer you need 3 ingredients, yeast, hops and a cereal grain. You start with extracting the starch from the...

## Building a data pipeline- uploading external data in AWS S3

August 10, 2019
By

Introduction Recently, I stepped into the AWS ecosystem to learn and explore its capabilities. I’m documenting my experiences in these series of posts. Hopefully, they will serve as a reference point to me in future or for anyone else following this...

## Returning to Tides

August 10, 2019
By

Fred Viole shared a great “data only” R solution to the forecasting tides problem. The methodology comes from a finance perspective, and has some great associated notes and articles. This gives me a chance to comment on the odd relation between prediction and profit in finance. If there really was a trade-able item with low … Continue reading Returning...

August 10, 2019
By

## Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

August 10, 2019
By

Introduction In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages. In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right...

## Road to Rugby World Cup 2019: Rugby scores decomposition

August 10, 2019
By

With the Rugby World Cup 2019 Japan starting on 20th September, I thought I’d take a look at the tournament from a few different statistical angles. For this post I’ll be looking at the problem: given a rugby score, how can we decompose it into possible combinations of tries, conversions, penalties and dropped goals? Context … Continue reading Road...

## Permutation Test for NHST of 2 Samples in R

August 9, 2019
By

As engineers, it is not uncommon to be asked to determine whether or not two different configurations of a product perform the same. Perhaps we are asked to compare the durability of a next-generation prototype to the current generation. Sometimes we are testing the flexibility of our device versus a competitor for marketing purposes. Maybe we identify a new...

## Correlation is not transitive, in general at least: A simulation approach

Let $$\rho_{XY}$$ be the correlation between the stochastic variables $$X$$ and $$Y$$ and similarly for $$\rho_{XZ}$$ and $$\rho_{YZ}$$. If we know two of these, can we say anything about the third? In a recent blog post I dealt with the problem mathematically and I used the concept of a partial correlation coefficient. Here I will take a simulation approach. First z is simulated. Then...

## Where do p-values come from? Fundamental concepts and simulation approach

tl;dr: P-values are tail probabilities calculated from the sampling distribution of a sample-based statistic. This sampling distribution will depend on the size of the sample, the statistic being calculated and assumptions about the random population from which the data could have been sampled. For a few cases, analytical p-values are available and, for the rest of cases, approximations based...

## Germination Project Fellows come to Penn

August 9, 2019
By

I was recently fortunate to be invited to speak with an impressive group of high-school students as a part of the Germination Project. They came to Penn to learn about innovation in health care and I spoke with them about how we’re using Data Science to improve patient outcomes.

## Deploying a ML model in R

August 9, 2019
By

As a product of a series of training sessions Draper & Dash have been undertaking on Machine Learning using R – a candidate from the course asked “how do you split a model to later be used in a live / production environment?”. This post aims to answer that question: I hope you find this...

## How to generate meaningful fake data for learning, experimentation and teaching

August 9, 2019
By

The Problem There’s one thing about R that a lot of people have as their Top-of-Mind. That’s the black-and-white plot of iris dataset which is definitely a huge boring view of R. That’s boring because of aesthetics but also because it’s such a cliched example used over and over again. The other problem is finding the right set of dataset...

## A weighty matter

August 8, 2019
By

When we were testing random correlations and weighthings in our last post on diversification, we discovered that randomizing correlations often increased portfolio risk. Then, when we randomized stock weightings on top of our random correlations, we began to see more cases in which one would have better off not being diversified. In other words, the percentage of portfolios whose...

## Correlation is not transitive, in general at least

Update Aug 10, 2019: I wrote a new blog post about the same as below but using a simulation approach. Update Aug 27, 2019: Minor change in how equations are solved (from version 0.9.0.9122). Let $$\rho_{XY}$$ be the correlation between the stochastic variables $$X$$ and $$Y$$ and similarly for $$\rho_{XZ}$$ and $$\rho_{YZ}$$. If we know two of these, can we say anything...