# Factor Analysis of Baseball’s Hall of Fame Voters

**Statistically Significant**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, Nate Silver wrote a post which analyzed how voters who voted for and against Barry Bonds for Baseball's Hall of Fame differed. Not surprisingly, those who voted for Bonds were more likely to vote for other suspected steroids users (like Roger Clemens). This got me thinking that this would make an interesting case study for factor analysis to see if there are latent factors that drive hall of fame voting.

The Twitter user @leokitty has kept track of all the known ballots of the voters in a spreadsheet. The spreadsheet is a matrix that has one row for each voter and one column for each player being voted upon. (Players need 75% of the vote to make it to the hall of fame.) I removed all players that had no votes and all voters that had given a partial ballot.

(This matrix either has a 1 or a 0 in each entry, corresponding to whether a voter voted for the player or not. Note that this kind of matrix is similar to the data that is analyzed in information retrieval. I will be decomposing the (centered) matrix using singular value decomposition to run the factor analysis. This is the same technique used for latent semantic indexing in information retrieval.)

Starting with the analysis, there is a drop off of the variance after the first 2 factors, which means it might make sense to look only at the first 2 (which is good because I was only planning on doing so).

votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)<br />pca = princomp(votes)<br />screeplot(pca, type = "l", main = "Scree Plot")<br />

Looking at the loadings, it appears that the first principal component corresponds strongly to steroid users, which Bonds and Clemens having large negative values and other suspected steroid users being on the negative end. The players on the positive end have no steroid suspicions.

dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")<br />

The second component isn't as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn't that far separated from them.

There also may be some secondary steroid association in the component as well separating players who have proof of steroid use versus those which have no proof but “look like” they took steroids. For example, there is no hard evidence that Bagwell or Piazza took steroids, but they were very muscular and hit a lot of home runs, so they are believed to have taken steroids. There is some sort of evidence the top five players of this component did take steroids.

dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")<br />

Projecting the votes onto two dimensions, we can look at how the voters for Bonds and Clemens split up. You can see there is a strong horizontal split between the voters for and against Bonds/Clemens. There are also 3 voters that voted for Bonds, but not Clemens.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, <br /> Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)), <br /> size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")<br />

Similarly, I can look at how the voters split up on the issue of steroids by looking at both Bonds and Bagwell. The voters in the upper left do not care about steroid use, but believe that Bagwell wasn't good enough to make it to the hall of fame. The voters in the lower right do care about steroid use, but believe that Bagwell was innocent of any wrongdoing.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, <br /> Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) + <br /> geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + <br /> labs(colour = "Bonds / Bagwell")<br />

We can also look at a similar plot with Schilling instead of Bagwell. The separation here appears to be stronger.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, <br /> Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) + <br /> geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + <br /> labs(colour = "Bonds / Schilling")<br />

Finally, we can look at a biplot (using code from here).

PCbiplot <- function(PC = fit, x = "PC1", y = "PC2") {<br /> # PC being a prcomp object<br /> library(grid)<br /> data <- data.frame(obsnames = row.names(PC$x), PC$x)<br /> plot <- ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4, <br /> size = 3, aes(label = obsnames))<br /> plot <- plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)<br /> datapc <- data.frame(varnames = rownames(PC$rotation), PC$rotation)<br /> mult <- min((max(data[, y]) - min(data[, y])/(max(datapc[, y]) - min(datapc[, <br /> y]))), (max(data[, x]) - min(data[, x])/(max(datapc[, x]) - min(datapc[, <br /> x]))))<br /> datapc <- transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult * <br /> (get(y)))<br /> plot <- plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2, <br /> label = varnames), size = 5, vjust = 1, color = "red")<br /> plot <- plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1, <br /> yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")<br /> plot<br />}<br /><br />fit <- prcomp(votes, scale = F)<br />PCbiplot(fit)<br />

I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2-d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0's and 1's).

**leave a comment**for the author, please follow the link and comment on their blog:

**Statistically Significant**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.