# Articles by Jacob Simmering

### Poor Donald – his tweets keep getting more negative

February 10, 2017 |

Last summer, David Robinson did this interesting text analysis of Donald Trump’s tweets and found that they more angry ones came from Android (which Trump is known to use). But he didn’t consider how Trump’s emotional state varies over time and he certainly couldn’t have considered ... [Read more...]

January 23, 2017 |

A handy little trick I picked up today when using readr. Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, ... [Read more...]

### Inter-ocular trauma test

November 17, 2016 |

I’ve recently been thinking about the role statistics can play in answering questions. I think the it came up on the NSSD podcast a few weeks ago. Basically, problems can be divided into three classes: those that don’t need statistics because the answer is obvious (problems without much ... [Read more...]

### Using tidytext to make sentiment analysis easy

November 15, 2016 |

Last week I discovered the R package tidytext and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.” It certainly helped that a lot of the examples are from Pride and ... [Read more...]

### Easy Cross Validation in R with `modelr`

November 11, 2016 |

When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs ... [Read more...]

### Parallel Simulation of Heckman Selection Model

April 22, 2015 |

Parallel Simulation of Heckman Selection Model One of the, if not the, fundamental problems in observational data analysis is the estimation of the value of the unobserved choice. If the (i^{text{th}}) unit chooses the value of (t) on the basis of some factors (mathbf{x_i}), which may ... [Read more...]

### The Problem with Propensity Scores

April 14, 2015 |

Are Propensity Scores Useful? Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to ... [Read more...]

### Frequentist German Tank Problem

March 20, 2014 |

The German Tank Problem: The Frequentist Way Many things are given a serial number and often that serial number, logically, starts at 1 and for each new unit is increased by 1. For example, German tanks in World War II had several parts with serial numbers. By collecting the value of these ... [Read more...]

### Stop using bivariate correlations for variable selection

March 19, 2014 |

Stop using bivariate correlations for variable selection Something I've never understood is the widespread calculation and reporting of univariate and bivariate statistics in applied work, especially when it comes to model selection. Bivariate statistics are, at best, useless for multi-variate model selection and, at worst, harmful. Since nearly all questions ... [Read more...]

### Bayesian Search Models

March 13, 2014 |

Bayesian Search Theory The US had a pretty big problem on their hands in 1966. Two planes had hit each other during a in-flight refueling and crashed. Normally, this would be an unfortunate thing and terrible for the families of those involved in the crash but otherwise fairly limited in importance. ... [Read more...]

### Instrumental Variables Simulation

January 9, 2014 |

Instrumental Variables Instrumental variables are an incredibly powerful for dealing with unobserved heterogenity within the context of regression but the language used to define them is mind bending. Typically, you hear something along the lines of “an instrumental variable is a variable that is correlated with x but uncorrelated with ... [Read more...]

### Penalizing P Values

November 19, 2013 |

Penalizing P Values Ioannidis' paper suggesting that most published results in medical research are not true is now high profile enough that even my dad, an artist who wouldn't know a test statistic if it hit him in the face, knows about it. It has even shown up recently in ... [Read more...]

### TV Ratings Myths

August 28, 2013 |

TV Show Cancellations: Myths and Models TV shows are amazing ways to waste time and, on occasion, the story is so good that you actually start to care. The problem is that some shows get cancelled before they jump the shark. Classic examples are shows like Firefly or Arrested Development. ... [Read more...]

### Fixing My Internet With R and Python

February 20, 2013 |

Last summer, I had some internet connectivity problems. Specifically, I would have massive latency issues that affected my conversations on Skype and my relatively pathetic under the best of circumstances efforts at online gaming. It was driving me up a wall and I couldn't figure it out. It hadn't occurred ... [Read more...]

### Taking Expectations to the Next Level

January 31, 2013 |

Higher Expectations I came across this post on Thursday and found it to be quite interesting. Clearly rental prices vary according to where you live. That isn't too surprising. I started thinking a bit more about it and thought that Boston and the nearby communities would have to have some ... [Read more...]

January 30, 2013 |

A Problem A major problem in secondary data analysis is that you didn't get to decide what data was collected. Lets say you were interested in how many times a student has read the Twilight books). Specifically, you want to know how effective the ads for the movies and books ... [Read more...]

### How slow is R really?

January 28, 2013 |

One thing you always hear about R is how slow it is, especially when the code is not well vectorized or includes loops. But R is an interpreted language and its strong suit really isn't speed but rather the comparative advantage is the 4,284 packages o... [Read more...]