# Small gotcha when using negative indexing

Negative indexing is a commonly used method in R to drop elements from a vector or rows/columns from a matrix that the user does not want. For example, the code below drops the third column from the matrix M: M <- matrix(1:9, nrow = 3) M # [,1] [,2] [,3] # [1,] 1 4 7 # [2,] … Continue reading

# Introducing cvwrapr for your cross-validation needs

TLDR: I’ve written an R package, cvwrapr, that helps users to cross-validate hyperparameters. The code base is largely extracted from the glmnet package. The R package is available for download from Github, and contains two vignettes which demonstrate how to use it. Comments, feedback and bug … Continue reading

# What is the Tukey loss function?

The Tukey loss function The Tukey loss function, also known as Tukey’s biweight function, is a loss function that is used in robust statistics. Tukey’s loss is similar to Huber loss in that it demonstrates quadratic behavior near the origin. However, it is even more insensitive to … Continue reading

# Is the EPL getting more unequal?

I recently heard that Manchester City were so far ahead in the English Premier League (EPL) that the race for first was basically over, even though they were still about 6-7 more games to go (out of a total of 38 games). At the other end of the table, I heard that Sheffield United were so far … Continue reading

# Estimating pi using the method of moments

Happy Pi Day! I don’t encounter very much in my area of statistics, so this post might seem a little forced… In this post, I’m going to show one way to estimate . The starting point is the integral identity There are two ways to see why this identity is true. The first is that … Continue reading

# What is a sunflower plot?

A sunflower plot is a type of scatterplot which tries to reduce overplotting. When there are multiple points that have the same (x, y) values, sunflower plots plot just one point there, but has little edges (or “petals”) coming out from the point to indicate how many points are really … Continue reading

# covidcast package for COVID-19-related data

(This is a PSA post, where I share a package that I think that might be of interest to the community but I haven’t looked too deeply into myself.) Today I learnt of the covidcast R package, which provides access to the COVIDcast Epidata API published by the Delphi group at Carnegie Mellon … Continue reading

# The Mendoza line

The Mendoza Line is a term from baseball. Named after Mario Mendoza, it refers to the threshold of incompetent hitting. It is frequently taken to be a batting average of .200, although all the sources I looked at made sure to note that Mendoza’s career average was actually a little better: … Continue reading

# glmnet v4.1: regularized Cox models for (start, stop] and stratified data

My latest work on the glmnet package has just been pushed to CRAN! In this release (v4.1), we extend the scope of regularized Cox models to include (start, stop] data and strata variables. In addition, we provide the survfit method for plotting survival curves based on the model (as the survival … Continue reading

# Simulating the dice game “Toss Up!” in R

Toss Up! is a very simple dice game that I’ve always wanted to simulate but never got around to doing so (until now!). This post outlines how to simulate a Toss Up! game in R, as well as how to evaluate the effectiveness of different game strategies. All the code for this blog post is … Continue reading

# Exploring the game “First Orchard” with simulation in R

My daughter received the board game First Orchard as a Christmas present and she’s hooked on it so far. In playing the game with her, a few probability/statistics questions came to mind. This post outlines how I answered some of them using simulation in R. All code for this blog post can be … Continue reading

# A shiny app for exploratory data analysis

I recently learnt how to build basic R Shiny apps. To practice using Shiny, I created a simple app that you can use to perform simple exploratory data analysis. You can use the app here to play around with the diamonds dataset from the ggplot2 package. To use the app for your own dataset, download … Continue reading

# Some notes when using dot-dot-dot (…) in R

When writing functions R, the ... argument is a special argument useful for passing an unknown number of arguments to another function. This is widely used in R, especially in generic functions such as plot(), print(), and apply(). Hadley Wickham’s Advanced R has a nice short section on the … Continue reading

# How is the F-statistic computed in anova() when there are multiple models?

Background In the linear regression context, it is common to use the F-test to test whether a proposed regression model fits the data well. Say we have predictors, and we are comparing the model fit for Linear regression where are allowed to vary freely but are fixed at zero, vs. Linear … Continue reading

# Attributes in R

In R, objects are allowed to have attributes, which is a way for users to tag additional information to an R object. There are a few reasons why one might want to use attributes. One reason that I encountered recently was to ensure that the type of object returned from a function remains consistent … Continue reading