Site icon R-bloggers

The Case of the Missing Offense

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his first class project – Shiny (due on the 4th week of the program).

Links:   GitHub   |   App

 

Introduction

Why does it matter?

Baseball is a game flooded with statistics, and through the mountain of data it can be easy to forget what matters. On one end of the spectrum there are irrelevant statistics; I have heard commentators say of hitters something along the lines of, “Did you know that he is the youngest player ever to hit a home run on the second day of the first week of August!?” On the other end of the spectrum there is Paul DePodesta and his Moneyball strategy. If you have not seen the movie, basically the strategy is to identify key statistics by which to define value in order to buy undervalued players and sell overvalued players. This can be a good short-run strategy, but once the market has caught up to the definition of value, there needs to be new definitions.

The goal

What I had initially set out to do was solely compare the statistics of playoff teams to non-playoff teams to get an idea of what a long-term strategy to make the playoffs might consist of. While I found interesting numbers between playoff versus non-playoff teams, I was more fascinated by league-wide trends and the combination of findings between playoff status and league-wide trends.

 

Dashboard

A demo of the three sections of my dashboard which can also be viewed here:

 

Dataset

Sean Lahman’s Baseball Database contains a wide range of MLB data, including data on batting, pitching, and fielding. There are team-wide and individual player-level data for the regular season and playoffs, and the latest database as of this writing has data ranging from 1871 to 2015.

For the dashboard, I used the Teams.csv file which has team summary data by year, including summarized batting and pitching statistics. I used only the latest decade (2005 to 2015) of available data because, for the purposes of the dashboard, I did not think it was necessary to look further back in time unless I could not find trends in the last decade.

Dataset sample

Preview of columns from a sample of the data:

Year Team Games Wins Losses Runs Hits Home Runs Walks Strikeouts ERA
2005 COL 162 67 95 740 1477 150 509 1103 5.13
2007 TBD 162 66 96 782 1500 187 545 1324 5.53
2008 PHI 162 92 70 799 1407 214 586 1117 3.88
2009 SFG 162 88 74 657 1411 122 392 1158 3.55
2011 FLA 162 72 90 625 1358 149 542 1244 3.95

 

Columns added to dataset

I calculated new columns based on existing columns:

Year Team BA OBP Win Ratio Made Playoffs Group
2005 COL 0.267 0.333 0.414 No 3
2007 TBD 0.268 0.336 0.407 No 3
2008 PHI 0.255 0.332 0.568 Yes 1
2009 SFG 0.257 0.309 0.543 No 2
2011 FLA 0.247 0.318 0.444 No 3

 

Calculated column definitions:

Column Full Name Calculation
BA* Batting Average Total Hits / Total At Bats
OBP On Base Percentage (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies)
Win Ratio Win Ratio Wins / Total Games Played
Made Playoffs Made Playoffs “Yes” if team has reached one of the playoff rounds, “No” if not
Group Group 1: Playoff team, 2: Top half of non-playoff teams based on win ratio, 3: Bottom half of non-playoff teams based on win ratio

* In baseball, walks, hits-by-pitch, and sacrifice flies and bunts are not counted into batting averages because they are not hits. However, because they are positive outcomes they are excluded from the count of at-bats so as not to misrepresent the batting average, hence the need for OBP.

 

Results

Playoff vs. non-playoff

Explanation

A key statistic I used to compare playoff and non-playoff teams is the mean of their differences of each year for a given variable. Here is the function I wrote to calculate this:

View the code on Gist.

 

Let’s use the number of Runs Scored to go step-by-step into how the mean difference is calculated.

Here are the first six rows of Runs Scored means per year by playoff status. For example, the first row is saying that in 2005, teams that did not make the playoffs scored an average of 730.7 runs:

Year Made Playoffs Runs
2005 No 730.7
2005 Yes 781.1
2006 No 777.7
2006 Yes 811.2
2007 No 756.3
2007 Yes 835.5

 

The differences of means (playoff minus non-playoff) per year from the previous table:

2005 2006 2007
50.4 33.6 79.2

 

The mean of the differences of Runs Scored from 2005-2015:

What this tells us is that, on average from 2005-2015, playoff teams are scoring 69.338 more runs per year than non-playoff teams. Not only are they scoring more runs, they are scoring more runs every year as you can see here in the difference between the blue and red bars:

 

Findings

Similarly to Runs Scored, Hits and Walks topped the batting statistics, while Hits Allowed and Walks Allowed topped the pitching statistics favoring playoff teams.

Okay, but this is expected, right? Shouldn’t the teams that can score more runs and get more hits win? Not so fast!

Examining these correlation matrices of hitting statistics (all insignificant correlations are set to 0)…

Playoff teams:

Non-playoff teams:

…it appears that:

 

League-wide trends

Not only can the previous correlation observations be useful to keep in mind when acquiring players, they can also be useful in setting rosters and lineups. However, the league-wide trends in the MLB caught my eye more than the differences between teams of varying playoff status.

I saw that runs and hits were down:

…so I thought “Okay, it probably has a lot to do with the performance enhancing drug crackdown”. Then I saw the home run numbers, and it looks like home runs are not significantly trending one way or another. They have their ups and downs. 2014 just happened to be a low year.

Finally, I saw the strikeouts:

and walks numbers:

 

When home runs remain the same, hits, runs, and walks are down, and strikeouts are up, you cannot blame drug-testing, large, pitcher-friendly stadiums, or other external factors. Either the hitters are getting worse, or the pitchers and defense are getting better. I will explore further, but it seems logical to say that if hitters’ numbers are low league-wide for numbers that aren’t improved by external factors but rather by natural ability, then it’s plausible that pitchers are just doing a better job at adjusting to hitters than the reverse. Repeatedly, I see fielders strategically shift to the side of the infield where the hitter tends to hit the ball – so much that there is a wide-open gap for a hitter to hit the ball – yet the hitter hits it right to the fielder. A large part of this is the pitcher being able to pitch in a way that forces this outcome. I see commentators showing graphics of hitters’ weak zones, and so many times a pitch in that weak zone results in an out. The pitcher and his defense seem to have out-smarted and out-strategized the offense, and we see the effects of this in hitters’ overall numbers.

 

Conclusion

Closing remarks

Especially at a time in baseball when the offensive numbers are down and the defensive side has the upper hand, every scoring opportunity must be taken and sources of runs must be varied. Every opportunity missed is another potential loss, and the opportunities are only getting fewer. The teams who make the playoffs may just be best at this. However, efficiency and varied source of runs might not be the only solution. The defense has adjusted; can the offense adjust back?

Next steps

I made some remarks based on eyeballing of plots, but data science is more than that. Upon further reading and given more time, I would:

  • run significance tests to compare the difference in mean differences and variable correlations
  • further divide the playoff and non-playoff teams. I would compare the bottom 3 to 5 playoff teams to the top 3 to 5 non-playoff teams
  • do a deeper player-level analysis on the league-wide trends to see what I can uncover

The post The Case of the Missing Offense appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.