# Blog Archives

## What Is the Probability of a 16 Seed Beating a 1 Seed?

April 20, 2013
Note: I started this post way back when the NCAA men's basketball tournament was going on, but didn't finish it until now. Since the NCAA Men's Basketball Tournament has moved to 64 teams, a 16 seed as never upset a 1 seed. You might be tempted to say...

## Copying Data from Excel to R and Back

February 24, 2013
A lot of times we are given a data set in Excel format and we want to run a quick analysis using R's functionality to look at advanced statistics or make better visualizations. There are packages for importing/exporting data from/to Excel, but I have f...

## Restricted Boltzmann Machines in R

January 14, 2013
Restricted Boltzmann Machines (RBMs) are an unsupervised learning method (like principal components). An RBM is a probabilistic and undirected graphical model. They are becoming more popular in machine learning due to recent success in training them with contrastive divergence. They have been proven useful in collaborative filtering, being one of the most successful methods...

## Factor Analysis of Baseball’s Hall of Fame Voters

January 9, 2013
Factor Analysis of Baseball's Hall of Fame VotersRecently, Nate Silver wrote a post which analyzed how voters who voted for and against Barry Bonds for Baseball's Hall of Fame differed. Not surprisingly, those who voted for Bonds were more likely to vote for other suspected steroids users (like Roger Clemens). This got...

## Quick Post About Getting and Plotting Polls in R

November 5, 2012
With the election nearly upon us, I wanted to share an easy way I just found to download polling data and graph a few with ggplot2. dlinzer at github created a function to download poll data from the Huffington Post's Pollster API.The default is to dow...

## Finding the Best Subset of a GAM using Tabu Search and Visualizing It in R

August 24, 2012
Finding the best subset of variables for a regression is a very common task in statistics and machine learning. There are statistical methods based on asymptotic normal theory that can help you decide whether to add or remove a variable at a time. The ...

## Random Forest Variable Importance

July 19, 2012
Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests. They can deal with messy, real data. If there are lots of extraneous predictors, it has no problem. It automatically does a good job...

## Rounding in R

June 15, 2012
Forgive me if you are already aware of this, but I found it quite alarming. I know that most code is interpreted by the computer in binary and we input in decimal, so problems can arise in conversion and with floating point. But the example I have below is so simple that it really surprised me.I was converting...

## Space Time Swing Probability Plot for Ichiro

May 30, 2012
I was having some fun with PITCHf/x data and generalize additive models. PITCHf/x keeps track of the trajectory, path, location of every pitch in the MLB. It is pretty accurate and opens up baseball to more analyses than ever before. Generalized additi...