In my last two posts I talked about Ordinary Least Squares, then extended this discussion to the multiple predictor case and briefly talked about some of the problems that may arise. These problems can include omitted variable bias, heteroskedasticity, non-normality, and multicollinearity. Most of these problems are relatively minor in practice and have easy fixes, but omitted variable bias can be problematic. Unfortunately, the only way to fix that is to include every single predictor ever, which isn’t realistic. But you always want to include as many variable that you think are relevant and contribute to your model as possible with the degrees of freedom you have. If you’re curious about model selection and understanding which of the variables are relevant (and contribute to your predictions or coefficient estimates), check out the addendum to the last article. Model checking–and the use of cross-validation–is becoming more and more prevalent and expected, especially in forecasting.
Before I move on to regression with discrete data, I want to talk about a way to visualize bivariate correlations between many variables that I did not have room for last time or when I covered basic graphing. Of course, you can always just use the “cor()” function in R to evaluate the correlations between your predictors, but who doesn’t like a visualization? I’ve left it for now because it actually uses a separate library for the plotting called “lattice“. Here, I’ll walk you through the scatter plot matrix. They can also come in handy later on when we cover clustering and PCA. Thanks to R, creating something like the plot below is easy!
For this exercise–and for my upcoming post on regression with a binomial dependent variable (logit/probit)–I’ll be using the data at my website called “hallhitters2.csv“. This is the same data as for some of the earlier posts, but I added a binary variable that indicates whether or not each player was inducted into the Hall of Fame by the BBWAA. These are only data for those players who retired in 1950 or later. Let’s set the working directory and load it in:
##set working directory and load data
hall <- read.csv(file="hallhitters2.csv", h=T)
Go ahead and take a look at the variables in the data set and familiarize yourself with it. Only traditional statistics (and OPS) are included here, so we’re not going to worry about any advanced sabermetric stats at this point. Plus, we usually think of BBWAA inductions being based on more traditional stats anyways. Whenever we load in a data set, it’s always a good idea to get a feel of how the data behave.
While I did not mention this in my Regression tutorials, this should always be done. Blindly fitting models is kind of silly when you have the ability to take a look (this does not mean that if you’re doing an experiment, you should subjectively bias your data). Using something like the “summary()” command will give us a good idea of the properties of our data. But we can do a bit more visually like the histograms and box plots that I discussed earlier on in the tutorial series. In addition, we could condition any of these functions on Hall of Fame status to get an idea of the differences between the groups…but I’ll save this for next time.
Unfortunately, we have a lot of variables here! What to do? One thing we may want to do is just run a correlation of all the non-binary variables. This way, we’ll have some idea bout the consequences of including highly correlated predictors in our model. For this, as I said earlier, we can just use “cor()” and type in the data set range of columns that we want to include. I’ll only include some of the variables, as the matrix that is output from this can get ugly pretty quickly. I also round them to three decimal places using “round()” to keep things a little cleaner:
##run correlation of all interesting predictors
round(cor(hall[,c(“AB”, “R”, “H”, “HR”, “RBI”, “SB”, “BA”, “OBP”, “SLG”)]), 3)
Remember that to call vectors by their name in this situation, we need to use quotations and the “c()” command tells R that we want to use multiple columns. We could have also simply called these by their column number in the data frame, but this involves counting across the columns and that’s just annoying.
While we have the information here about the relationships of these variables, it’s just not easy to read. When numbers get complex, trying to simplify them into a visualization is often good practice. This is where the scatter plot matrix comes in.
For this portion of the code, you’ll need to download a new library. Remember to go up to your menu bar at the top of the R GUI. From there:
Click “Packages” ==> “Install Package(s)”
From there, find a mirror near you (just click a city). Now, the packages are listed in alphabetical order. Go down to the “L’s” and find “lattice“. Highlight this package and click “OK“. Now, do the same for the package called “car“. Both of these provide scatter plot matrix options (and I prefer the one in the “car” package). I’ll show code for both, but “car” has many more options for plotting.
The library will now be downloaded onto your computer. From now on, all you have to do is type the code:
And it will know that you want to source this R package during your session. You should not have to fully install it again, just use the “library()” command when you first get into R. You can always use the “help()” function to get more information about the package and its functions.
Now that we have the package loaded, we’ll want to find a function in the package that suits our needs. In the “lattice” package, there is a function called “splom()” which will create scatter plot matrices for us. Similarly, we use “spm()” in the “car” package. This will create a matrix just like our correlation matrix–with variable names on the diagonal where the correlation would be 1–but uses scatter plots instead of correlations to show the relationships between variables. In fact, the generic version of the function works just like the “cor()” function, as you can see below:
##check out variables with scatterplot matrix
splom(hall[,c(“AB”, “R”, “H”, “HR”, “RBI”, “SB”, “BA”, “OBP”, “SLG”)])
The above plots are just a bit unwieldy and very ugly (click them to make them larger). I usually like to include less variables in one of these, as this output is rather large and the diagonal axes and labels are tough to read. Let’s scale it down to some major categories, which I’ll simply call “The Big 5” statistics for hitters (they’re essentially Fantasy Roto categories). The “spm” function goes a bit nuts on the plotting, so I limit to plotting density in the diagonal and only the regression line within each plot. In addition, I indicate hall of fame players and non-hall of fame players using different colors indexed by the BBWAA variable as we’ve done in past tutorials. I add titles to the plots, rather than go with the generic look:
##narrow down variables
splom(hall[,c(“R”, “H”, “HR”, “RBI”, “SB”)], groups=hall$BBWAA,
key = list(title = “The Big 5 Stats”, text=list(“”,””)))
spm(hall[,c(“R”, “H”, “HR”, “RBI”, “SB”)], diagonal=”density”, smooth=F, groups=hall$BBWAA, main=”The Big 5 Stats”, cex.main=2.5)
As you can see, adding titles and keys is not as straight forward in “splom” for me. So let’s look at the second plot using the “spm” function and talk about what we see. If you are interested in loess smooths being included in the latter plots, definitely use the code “help(spm)” to help you out. You have the option of including histograms or boxplots in the diagonals as well.
Obviously, Hall of Fame players are in red (triangles), while non-Hall of Famers (as decided by the BBWAA) are in black (circles). There is a key for this in the top left of the scatter plot matrix. I made the title nice and big here because it is a large plot, so for this I used “cex.main=” (the default is 1). We also see from the density plots that each of the stats is generally skewed with long right tails. The right tails, of course, are the Hall of Fame players for the most part, as indicated by the red triangles that generally sit in the top-right of the scatter plots (more counting stats than non-hall guys). There is pretty good separation between the two groups, which will be an important topic when we talk about clustering, learning, and classification.
Okay, so back to correlation. A correlation simply gives us an idea of the association between two variables. In two dimensions, we’re just visualizing this on the plot. In fact, the added regression line’s R-squared is simply the correlation squared, so there is a direct comparison to our previous correlation matrix early on in this tutorial.
As we would expect, Home Runs and RBI are highly correlated as indicated by the positive upward slope in either the window in the 3rd column and 4th row, or 4th column and 3rd row (it is a symmetrical matrix, but flips the x and y axis). Looking back at our original correlation matrix, we see that the correlation between the two is 0.919, confirming the visual we’re getting from the scatter plot matrix.
You can also see that Home Runs and Stolen Bases are not highly correlated, which agrees with our intuition that HR hitters are not generally base-stealers. This is seen in the 5th column, 3rd row and 3rd column, 5th row windows. You can see the fairly flat regression line more clearly in the latter window, as there seems to be an outlier pulling up the regression line in the 5th column-3rd row window. The correlation between HR and SB in a career? A small 0.117.
Next time, we’ll use the exploratory information we have here to work with logistic regression in R using the “glm()” function and a binomial family. There are a number of options for “glm()“, and I definitely won’t get to them all. But logistic regression is widely used and important to cover.
As I said, there are a number of options to be used in these plots, and you can always find ways to spruce them up. I believe “ggplot2” package has some prettier versions of these, but I’m not very savvy with this package yet. As usual, I post the R code below using Pretty R.
ADDENDUM: Jeromy Anglim in the comments section suggests another function for creating these sorts of plots in the “psych” package. He provides this link for an explanation. I like that you can get both the correlation values AND the scatter plot matrix on the same grid. Very useful and just another way to get more information out of a single line of code. Like I’ve said before, one of the great things about R is that there are always people around that can help and tell you something you don’t know about.
############################# ################Scatterplot Matrices ############################# ##set working directory and load data setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics") hall