# Articles by Edwin Chen

### Bayesian Confidence Intervals: Obama’s ‘That’-Addition and Informality

May 1, 2011 |

No “That” Left Behind? I came across a post on Language Log last week giving some evidence that Obama tends to add that to the prepared version of his speeches. For example, in a recent speech at George Washington University, … Continue reading → [Read more...]

### Filtering for English Tweets: Unsupervised Language Detection on Twitter

April 30, 2011 |

(See a demo here.) While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple ...

### Choosing a Machine Learning Classifier

April 26, 2011 |

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by ... [Read more...]

### Kickstarter Data Analysis: Success and Pricing

April 25, 2011 |

Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if … Continue reading → [Read more...]

### A Mathematical Introduction to Least Angle Regression

April 20, 2011 |

(For a layman’s introduction, see here.) Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods: Forward selection starts with no ... [Read more...]

### Introduction to Cointegration and Pairs Trading

April 15, 2011 |

Introduction Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths. But suppose instead you have a drunk walking with her dog. This … Continue reading → [Read more...]

### Hacker News Analysis

March 13, 2011 |

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of my findings. Activity on the Site My first question was: how has activity on the site increased over time? I … Continue reading →

### Piiikaaachuuuuuu vs. KHAAAAAN!

March 13, 2011 |

This is a fun image I found on Neil Kodner’s blog: But I’ve never actually watched any of the Star Trek movies, so I decided to recreate the graph with Pikachu instead: Here’s a smoothed version to better compare the counts … Continue reading →

### A Kernel Density Approach to Outlier Detection

March 13, 2011 |

I describe a kernel density approach to outlier detection on small datasets. In particular, my model is the set of prices for a given item that can be found online. Introduction Suppose you’re searching online for the cheapest place to … Continue reading → [Read more...]

### Eigensheep

March 13, 2011 |

Aaron Koblin’s Sheep Market visualization is an awesome use of Mechanical Turk. But it’d be even more awesome if the grid were ordered, so inspired by the use of eigenfaces in facial recognition, I decided to try projecting the sheep … Continue reading →

### Counting Clusters

March 13, 2011 |

Given a set of numerical datapoints, we often want to know how many clusters the datapoints form. Two practical algorithms for determining the number of clusters are the gap statistic and the prediction strength. Gap Statistic The gap statistic algorithm … Continue reading →

### Topological Combinatorics and the Evasiveness Conjecture

March 13, 2011 |

The Kahn, Saks, and Sturtevant approach to the Evasiveness Conjecture (see the original paper here) is an epic application of pure mathematics to computer science. I’ll give an overview of the approach here, and probably try to add some more information on the problem in other posts. tl;dr ... [Read more...]

### Netflix Prize Summary: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

March 13, 2011 |

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.) This is a summary of Bell and ... [Read more...]

### Layman’s Introduction to Measure Theory

March 13, 2011 |

Measure theory studies ways of generalizing the notions of length/area/volume. Even in 2 dimensions, it might not be clear how to measure the area of the following fairly tame shape: much less the “area” of even weirder shapes in higher dimensions or different spaces entirely. For example, suppose you ... [Read more...]

### Netflix Prize Summary: Factorization Meets the Neighborhood

March 13, 2011 |

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.) This is a summary of Koren’s 2008 ... [Read more...]

### Layman’s Introduction to Random Forests

March 13, 2011 |

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her ... [Read more...]

### Prime Numbers and the Riemann Zeta Function

March 13, 2011 |

Lots of people know that the Riemann Hypothesis has something to do with prime numbers, but most introductions fail to say what or why. I’ll try to give one angle of explanation. Layman’s Terms Suppose you have a bunch of friends, each with an instrument that plays at ... [Read more...]

### Item-to-Item Collaborative Filtering with Amazon’s Recommendation System

February 14, 2011 |

Introduction In making its product recommendations, Amazon makes heavy use of an item-to-item collaborative filtering approach. This essentially means that for each item X, Amazon builds a neighborhood of related items S(X); whenever you buy/look at an item, Amazon then recommends you items from that item’s neighborhood. ... [Read more...]
1 2