Last time, I covered ordinary least squares with a single variable. This time, I’ll extend this to using multiple predictor variables in a regression, interacting terms in R, and start thinking about using polynomials of certain terms in the regression (like Age and Age Squared). This should be a pretty straight forward tutorial, especially if you’ve got the last one down pat. For those interested in the assumptions and conditions for using multiple regression, I won’t be covering much of that. I assume some knowledge of introductory regression for these tutorials, as I want to focus how to do them in R.
I had written up some code to write up for a post here, but then I realized that I had skipped over both multiple regression and logistic regression (and other distributional families). I’ll do these in two posts, follow with LOESS regression and some more advanced graphing techniques, and show how to make some bubble plots (something that is becoming fairly popular these days). After these basics, we’ll get into some real fun.
So, the first thing to do is go and get the data. For today, I’m going to use data with a few more variables. For this, I’ll use 2005-2010 team-level data and we’ll build an extremely simple model to predict expected wins or runs based on what we know. You can find the data here at my site. It is titled “teamsdata.csv”, and originally comes from Baseball Reference, where you can find a glossary for each variable. However, I’ve already changed a few of the names in my file to make things as logical and straight forward as possible and to interact with R a little better. Let’s begin by loading the data.
##set working directory and load data
teams <- read.csv(file="teamsdata.csv", h=T)
Alright, now what questions might we want to ask? The data include some variables that we would of course think have to do with making winning happen, specifically runs scored per game and runs allowed per game. Keep in mind we’re not looking to predict any unobserved teams’ wins here, but simply use a simple example to learn how to code multiple regression in R.
Let’s begin with something simple: just running a regression using Wins as the dependent variable with Runs Scored (“R“) and Runs Against (“R.allow”) as the predictor variables. We’ll do this just as in the previous post on regression with a single predictor variable, but we’ll use a plus sign to indicate a new variable–essentially it is set up just as you would a regression equation, but without the Beta Coefficients, Intercept, or Error Term (those get estimated in the output, of course):
##run multiple regression
fit.1 <- lm(W ~ R + R.allow, data=teams)
If you run the code above, you should end up with a regression output that tells you the intercept is 83.02 wins (about the average number of wins in a season, which must be 81 given a 162 game season), while each run scored is associated with an increase of 0.095 wins and each run allowed is associated with a decrease in 0.098 wins. That makes sense, as we’d expect these two to be pretty close in practice, especially in such a simplistic model. In other words, according to this first regression, a run is a run is a run, and there is some teensy bit of evidence that a run saved is slightly better than a run scored. This regression, however, isn’t particularly interesting. Predicting wins by runs scored and runs allowed is a well-documented phenomena (not sure phenomena is the right word here), and there is a fairly useful formula (the Pythagorean Expectation for baseball) that does this well. So let’s try and work with some other variables and try to predict run scoring.
Before moving on, I will first remind everyone that the coefficients from a regression predicting run scoring are not usually very accurate. For prediction purposes, it shouldn’t really matter. But one should take caution when making inferences about the coefficients as run values (or linear weights) for different types of hitting statistics. This is a well-documented case, and likely has a lot to do with omitted variable bias (while significance may be influenced by multicollinearity). In fact, it’s something that sabermetricians and saber-knowledgable sports economists actually agree on (side note: I do not consider myself an economist or sabermetrician). I’ll ignore this for now, though, as it is a good place to start when learning to run a regression in R.
We’ll begin by creating a variable for singles, as we only have Hits, Doubles, Triples and Home Runs in the data. This is pretty easy, and will be a nice review for adding a new variable onto the data set:
teams$Singles <- teams$H - teams$Doubles - teams$Triples - teams$HR
Now that we have this, let’s begin with a simple regression using just hit types to predict run scoring. We’ll do just what we did with the win prediction model based on runs:
##predict run scoring using very basic stats
fit.2 <- lm(R ~ Singles + Doubles + Triples + HR, data=teams)
We see coefficients of 0.53, 0.94, 1.45 and 1.62 for 1B, 2B, 3B and HR, respectively. These aren’t totally consistent with run values that others have calculated with conditional calculations of hits or other events in given game states (for the non-saber people, the values of runs with no outs, one out with a man on second, two outs with a man on first, etc.). Also keep in mind that the intercept is not meaningful here (you can’t score negative runs) and results from the fact that our data really don’t go lower than 500 Runs in a season. There are ways to fix the intercept, but it’s not something that really matters in this context. If you want to know how to score zero runs, that’s pretty easy: don’t reach base. But maybe we can fix the coefficients a little bit by adding walks, strikeouts, grounded into double plays, hit by pitches, stolen bases, and caught stealing.
##more extensive regression
fit.3 <- lm(R ~ Singles + Doubles + Triples + HR + BB + SO + HBP + GDP + SB + CS, data=teams)
This seems to be a little bit better and the R-squared is a nice 0.912 (see the bottom right of the output, for the previous regression it was around 0.83). But we see a new problem. While this fitted model is all fine and dandy for predicting run scoring it still doesn’t seem sufficient for reading the coefficients directly off the regression and taking them as is. You can see that stolen bases have a positive coefficient, but that caught stealing also does (*gasp*).
Obviously, that doesn’t make much sense and planning to get caught stealing as many times as possible probably won’t increase your runs scored. There may be some confounding reasons for this. It’s likely that teams with more stolen bases also have more attempts. Therefore, those teams with more stolen bases also have more caught stealing. This doesn’t mean the caught stealing numbers are good, but it may mean that more stolen bases as a whole are positive, as long as they aren’t getting caught too much. Others have researched this issue and found optimal success rates for steals. I believe the caught stealing coefficient is just picking up the fact that there are more attempts on teams with more wins. In other words, we’re missing something. If that interests you, I’d just Google it and you’ll find some analysis on the subject.
Let’s see if we can take this a little further. Sometimes when we think of hits in a game and how they cause runs to score, we think about interactions. If someone hits a single and someone else hits a home run right after that, this results in two runs, rather than only one that a solo home run would create (Note: I’m only talking about directly creating runs, not the fact that more runs may be expected because the batters did not get out–this is why we’d expect even a solo home run to be worth more than a single run: there is an opportunity cost to costing an out!).
Unfortunately, it might be tough to find any significant interactions with aggregate season data, but we can at least introduce the concept in R and how it is coded. If you really want to get into interactions in this light, inning-level data would probably be best.
For an interaction, R has two options. First, you can interact two variables by using a colon like this:
lm(y ~ x:z)
However, you’ll also want to include the original variable for both x and z, so you’ll want to type:
lm(y ~ x + z + x:z)
In regressions with a lot of variables, this can get a little unwieldy. Therefore, using a shortcut is useful to minimize typing. We can indicate this using a simple multiplication symbol (*):
lm(y ~ x*z)
Using this code, R will automatically include both variables in the regression in addition to the interaction between the two. It’s also possible to do this with more than two variables, but R assumes you want interactions between, say, X and Z, W and Z, and X and W and Z. Whether or not that is appropriate is up to the modeler, but obviously it would save room in this situation as well to use the last interaction coding above. Let’s start simple and interact something like Home Runs and Singles, returning to the first run prediction regression to keep the code short (I include both code types to show they end up the same):
##multiple regression with interactions
fit.4 <- lm(R ~ Doubles + Triples + Singles*HR, data=teams)
fit.4b <- lm(R ~ Singles + Doubles + Triples + HR + Singles:HR, data=teams)
We see a negligible increase in R-squared under this model (though, and increase is an increase) and the interaction seems to obscure any analysis of run values of each hit type (the coefficient on HR is negative, so it of course must be taken together with the singles-hr interaction coefficient). This interaction isn’t particularly useful, which I imagine is a product of using this sort of aggregate data. But I use this simply to show how to code the additional interaction for you to use in your own work. There are plenty of other instances in which an interaction may be useful for your modeling.
Let’s use this last regression to predict run scoring. We can use these predictions to also create residuals and see who were the most efficient teams, based on the statistics they put up. If they scored more than expected, then they are making the most out of their singles, doubles, and so on to score more runs. It could be that these are fast teams, or teams that like taking an extra base. Maybe they’re small ball teams. We don’t really know, but it’s always fun to see these differences.
Just as in a regression with only one predictor, we can use the “predict()” function to do this. You can also use “resid()“:
##predict runs scored using multiple regression
teams$pred.R <- predict(fit.4b, teams, type="response")
teams$resid.R <- teams$R - teams$pred.R
We can then find out who the teams are that most overperformed their projected runs and underperformed their projected runs scored by using the “min()” and “max()” commands.
##find out min and max
##pull entire rows for min and max to find out the teams
top <- subset(teams, teams$resid.R==max(teams$resid.R))
bot <- subset(teams, teams$resid.R==min(teams$resid.R))
ends <- rbind(top, bot)
From the second portion of the code above, I simply subsetted the data into the top and bottom run scoring performers based on the hitting statistics, stuck them together, and printed them on the screen. If you use all of my code, then you’ll find that the team that most outperformed their expected Runs–based on our regression–was the 2010 Tampa Bay Rays. The one that most underperformed? The 2005 Chicago Cubs. From the looks of the Age variable and the steals variable, we can see that the Rays were a younger (27.5 to 29.8 average batter age), faster (172 steals vs. 65 with a much higher success rate), more aggressive team than the 2005 Cubs (more steal attempts), and that seemed to work for them despite their 0.247 team batting average with less power (SLG, HR) compared to a .270 batting average with more power for the Cubs.
One thing I have not touched on is using polynomial predictors. In some cases, you may have good reason to believe that certain predictors are not related to the dependent variable in a perfectly linear fashion. Age is a perfect example of this, as players tend to improve their performance from the beginning of a career through the middle. But there is a peak, after which there is a decline in performance with age. Normally, when using age you’ll want to use both Age and Age Squared variables in your regression (always use both). The age-squared variable mediates the curved relationship for you. This is easily done in R and you can just create a new variable. Something like,
##create an age variable for batters
teams$BatAge2 <- teams$BatAge^2
will create this variable, which you can then just include in your regression along with the standard age variable. Unfortunately, using average team age isn’t very helpful here. The window of average ages is extremely small in baseball (about age 26 to 34, but these are tails, and the bulk are more in the 27 to 30 range). For individual players, for example predicting WAR or something like that, the age variable would be more useful.
Because graphing variables in multiple regression gets a little more complicated, I won’t cover this yet. However, if you’re dying to find something that you can use to visualize all of your variables at once, I would suggest downloading the library called “lattice” and “splom()“. I will try and cover this next time before moving to regression with a binomial dependent variable (logit, probit).
Now that we have the general model set up for multiple variables, the next bunch of tutorials should be straight forward. Many of the modeling approaches and their coding that I will cover in the future will use this same format, including LOESS, Logit, Probit, Poisson, any GLM, GAM, and so on. Of course, there are some extra parameters and assumptions to think about, but the general code will be the same. Most likely, I will talk a bit more about the process behind each of the methods in following posts, as they are not as widely used as OLS regression. But I promise I won’t get too mathy.
ADDENDUM: In the comments, Jeff Cole mentions that I should have simply removed Caught Stealing from the model using some model selection criteria. He is correct, but I disagree with the generality inferred about the p-value not being below 0.05. In R, you can do this with the function “step“. If you see the comments, he has the simple code to do this:
##do stepwise model fit
step.fit.3 <- step(fit.3, direction="backward")
I had not covered this portion of mode fitting because of the length of the post, but it is an important point that I want to mention here. However, I disagree with Jeff on the point that predictors should be excluded solely based on the p-value. I find it perfectly reasonable to include Strikeouts, SB and GDP in our model, as the direction of the coefficient is in the direction we would expect.
Since I am not a statistician, I’ll defer to the Gelman and Hill book “Data Analysis Using Regression and Multilevel/Hierarchical Models” (Page 72 has a specific example). While it doesn’t help all that much (the R-squared change is negligent), it shouldn’t cause any real problems here. So, depending on your objectives, it would be reasonable to leave these additional variables in or take them out.
Thanks to Jeff for commenting!
Below is the code from Pretty R:
############################# ################Multiple Regression and Interactions ############################# ##set working directory and load data setwd("c:/Users/Brian/Dropbox/Blog Stuff/sab-R-metrics") teams