Articles by David Robinson

Understanding Bayesian A/B testing (using baseball statistics)

May 23, 2016 | 0 Comments

Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) Understanding credible intervals (using baseball statistics) Understanding the Bayesian approach to false discovery rates (using baseball statistics) Who is a better batter: Mike Piazza or Hank Aaron? Well, Mike Piazza has a ... [Read more...]

The monetizr package: make money on your open source R packages

March 31, 2016 | 0 Comments

I’ve had the great privilege to be a small part of the R open source community, contributing packages like broom, gganimate, fuzzyjoin, and ggfreehand. In the process I’ve become friends and colleagues with brilliant statisticians and data scientists and learned to engage with data in powerful ways. But ... [Read more...]

How to replace a pie chart

March 14, 2016 | 0 Comments

Yesterday a family member forwarded me a Wall Street Journal interview titled What Data Scientists Do All Day At Work. The title intrigued me immediately, partly because I find myself explaining that same topic somewhat regularly. I wasn’t disappointed in the interview: General Electric’s Dr. Narasimhan gave insightful ... [Read more...]

Why I use ggplot2

February 12, 2016 | 0 Comments

If you’ve read my blog, taken one of my classes, or sat next to me on an airplane, you probably know I’m a big fan of Hadley Wickham’s ggplot2 package, especially compared to base R plotting. Not everyone agrees. Among the anti-ggplot2 crowd is JHU Professor Jeff ... [Read more...]

Analyzing networks of characters in ‘Love Actually’

December 25, 2015 | 0 Comments

Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here). Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This ...
[Read more...]

Modeling gene expression with broom: a case study in tidy analysis

November 25, 2015 | 0 Comments

Previously in this series Cleaning and visualizing genomic data: a case study in tidy analysis In the last post, we examined an available genomic dataset from Brauer et al 2008 about yeast gene expression under nutrient starvation. We learned to tidy it with the dplyr and tidyr packages, and saw how ... [Read more...]

Cleaning and visualizing genomic data: a case study in tidy analysis

November 19, 2015 | 0 Comments

I recently ran into a question looking for a case study in genomics, particularly for teaching ggplot2, dplyr, and the tidy data framework developed by Hadley Wickham. There exist many great resources for learning how to analyze genomic data using Bioconductor tools, including these workflows and package vignettes. But case ...
[Read more...]

What are the most polarizing programming languages?

November 3, 2015 | 0 Comments

Users on Stack Overflow Careers, our site for matching developers with jobs, can create customized profiles (“CVs”) to show to prospective employers. As part of these profiles, they have the option of specifying specific technologies they like or dislike. This produces an interesting and unusual opportunity for our data team ... [Read more...]

Understanding the Bayesian approach to false discovery rates (using baseball statistics)

November 2, 2015 | 0 Comments

Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) Understanding credible intervals (using baseball statistics) In my last few posts, I’ve been exploring how to perform estimation of batting averages, as a way to demonstrate empirical Bayesian methods. We’ve ... [Read more...]

Understanding credible intervals (using baseball statistics)

October 20, 2015 | 0 Comments

Previously in this series Understanding the beta distribution (using baseball statistics) Understanding empirical Bayes estimation (using baseball statistics) In my last post, I explained the method of empirical Bayes estimation, a way to calculate useful proportions out of many pairs of success/total counts (e.g. 0/1, 3/10, 235/1000). I used the example ... [Read more...]

Understanding empirical Bayes estimation (using baseball statistics)

September 30, 2015 | 0 Comments

Which of these two proportions is higher: 4 out of 10, or 300 out of 1000? This sounds like a silly question. Obviously , which is greater than . But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One ... [Read more...]

Is Bayesian A/B Testing Immune to Peeking? Not Exactly

August 20, 2015 | 0 Comments

Since I joined Stack Exchange as a Data Scientist in June, one of my first projects has been reconsidering the A/B testing system used to evaluate new features and changes to the site. Our current approach relies on computing a p-value to measure our confidence in a new feature. ... [Read more...]

Slides from my talk on the broom package

April 13, 2015 | 0 Comments

This weekend I gave a presentation on my broom package for tidying model objects (see my introduction here) at the UP-STAT 2015 conference at SUNY Geneseo. I’m sharing the slides here, along with some highlights below. I first explained how broom fits with other tidy tools such as dplyr, tidyr ... [Read more...]

broom: a package for tidying statistical models into data frames

March 19, 2015 | 0 Comments

The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation, analysis, and visualization. Popular packages like dplyr, tidyr and ggplot2 take great advantage of this framework, as explored in several recent posts by others. But there’s an important step in a tidy ... [Read more...]

View package downloads over time with Shiny

March 5, 2015 | 0 Comments

Almost everyone with an R package in CRAN wonders how often it’s installed and used. Two years ago RStudio kindly started offering anonymized logs of their downloads from their CRAN mirror, which allows one to graph the number of downloads over time. Much easier than downloading and processing all ... [Read more...]

Introducing stackr: An R package for querying the Stack Exchange API

February 3, 2015 | 0 Comments

There’s no end of interesting data analyses that can be performed with Stack Overflow and the Stack Exchange network of Q&A sites. Earlier this week I posted a Shiny app that visualizes the personalized prediction data from their machine learning system, Providence. I’ve also looked at whether ... [Read more...]

K-means clustering is not a free lunch

January 15, 2015 | 0 Comments

I recently came across this question on Cross Validated, and I thought it offered a great opportunity to use R and ggplot2 to explore, in depth, the assumptions underlying the k-means algorithm. The question, and my response, follow. K-means is a widely used method in cluster analysis. In my understanding, ... [Read more...]
1 2 3 4

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)