The Case of the Missing Offense

Posted on February 1, 2017 by Emil Parikh in R bloggers | 0 Comments

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his first class project – Shiny (due on the 4th week of the program).

Links: GitHub | App

Introduction

Why does it matter?

Baseball is a game flooded with statistics, and through the mountain of data it can be easy to forget what matters. On one end of the spectrum there are irrelevant statistics; I have heard commentators say of hitters something along the lines of, “Did you know that he is the youngest player ever to hit a home run on the second day of the first week of August!?” On the other end of the spectrum there is Paul DePodesta and his Moneyball strategy. If you have not seen the movie, basically the strategy is to identify key statistics by which to define value in order to buy undervalued players and sell overvalued players. This can be a good short-run strategy, but once the market has caught up to the definition of value, there needs to be new definitions.

The goal

What I had initially set out to do was solely compare the statistics of playoff teams to non-playoff teams to get an idea of what a long-term strategy to make the playoffs might consist of. While I found interesting numbers between playoff versus non-playoff teams, I was more fascinated by league-wide trends and the combination of findings between playoff status and league-wide trends.

Dashboard

A demo of the three sections of my dashboard which can also be viewed here:

Dataset

Sean Lahman’s Baseball Database contains a wide range of MLB data, including data on batting, pitching, and fielding. There are team-wide and individual player-level data for the regular season and playoffs, and the latest database as of this writing has data ranging from 1871 to 2015.

For the dashboard, I used the Teams.csv file which has team summary data by year, including summarized batting and pitching statistics. I used only the latest decade (2005 to 2015) of available data because, for the purposes of the dashboard, I did not think it was necessary to look further back in time unless I could not find trends in the last decade.

Dataset sample

Preview of columns from a sample of the data:

Year	Team	Games	Wins	Losses	Runs	Hits	Home Runs	Walks	Strikeouts	ERA
2005	COL	162	67	95	740	1477	150	509	1103	5.13
2007	TBD	162	66	96	782	1500	187	545	1324	5.53
2008	PHI	162	92	70	799	1407	214	586	1117	3.88
2009	SFG	162	88	74	657	1411	122	392	1158	3.55
2011	FLA	162	72	90	625	1358	149	542	1244	3.95

Columns added to dataset

I calculated new columns based on existing columns:

Year	Team	BA	OBP	Win Ratio	Made Playoffs	Group
2005	COL	0.267	0.333	0.414	No	3
2007	TBD	0.268	0.336	0.407	No	3
2008	PHI	0.255	0.332	0.568	Yes	1
2009	SFG	0.257	0.309	0.543	No	2
2011	FLA	0.247	0.318	0.444	No	3

Calculated column definitions:

Column	Full Name	Calculation
BA*	Batting Average	Total Hits / Total At Bats
OBP	On Base Percentage	(Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies)
Win Ratio	Win Ratio	Wins / Total Games Played
Made Playoffs	Made Playoffs	“Yes” if team has reached one of the playoff rounds, “No” if not
Group	Group	1: Playoff team, 2: Top half of non-playoff teams based on win ratio, 3: Bottom half of non-playoff teams based on win ratio

* In baseball, walks, hits-by-pitch, and sacrifice flies and bunts are not counted into batting averages because they are not hits. However, because they are positive outcomes they are excluded from the count of at-bats so as not to misrepresent the batting average, hence the need for OBP.

Results

Playoff vs. non-playoff

Explanation

A key statistic I used to compare playoff and non-playoff teams is the mean of their differences of each year for a given variable. Here is the function I wrote to calculate this:

View the code on Gist.

Let’s use the number of Runs Scored to go step-by-step into how the mean difference is calculated.

Here are the first six rows of Runs Scored means per year by playoff status. For example, the first row is saying that in 2005, teams that did not make the playoffs scored an average of 730.7 runs:

Year	Made Playoffs	Runs
2005	No	730.7
2005	Yes	781.1
2006	No	777.7
2006	Yes	811.2
2007	No	756.3
2007	Yes	835.5

The differences of means (playoff minus non-playoff) per year from the previous table:

2005	2006	2007
50.4	33.6	79.2

The mean of the differences of Runs Scored from 2005-2015:

What this tells us is that, on average from 2005-2015, playoff teams are scoring 69.338 more runs per year than non-playoff teams. Not only are they scoring more runs, they are scoring more runs every year as you can see here in the difference between the blue and red bars:

Findings

Similarly to Runs Scored, Hits and Walks topped the batting statistics, while Hits Allowed and Walks Allowed topped the pitching statistics favoring playoff teams.

Okay, but this is expected, right? Shouldn’t the teams that can score more runs and get more hits win? Not so fast!

Examining these correlation matrices of hitting statistics (all insignificant correlations are set to 0)…

Playoff teams:

Non-playoff teams:

…it appears that:

Runs (R) to Win Percent: Playoff teams have a weak positive correlation of Runs to Win Percent (0.25), while non-playoff teams have a moderate positive correlation (.40). We have seen previously that Playoff teams are scoring more runs overall. The low correlation of Runs to Win Percent could potentially be a sign that it’s not the number of runs but the efficiency of runs that sets playoff teams apart.
Home runs (HR) to Runs and Walks (BB) to Runs: Playoff teams have a moderate positive correlation of HR to Runs (0.55), while non-playoff teams of a strong positive correlation (0.67). The reverse is true of Walks to Runs. Playoff teams are walking more and hitting more home runs as seen previously but potentially are spreading out their sources of runs more effectively. This is relevant so that if a few of their players go cold, others can pick them up. I’ve seen teams rely on their home run hitters, and if they went cold, the team went cold.

League-wide trends

Not only can the previous correlation observations be useful to keep in mind when acquiring players, they can also be useful in setting rosters and lineups. However, the league-wide trends in the MLB caught my eye more than the differences between teams of varying playoff status.

I saw that runs and hits were down:

…so I thought “Okay, it probably has a lot to do with the performance enhancing drug crackdown”. Then I saw the home run numbers, and it looks like home runs are not significantly trending one way or another. They have their ups and downs. 2014 just happened to be a low year.

Finally, I saw the strikeouts:

and walks numbers:

When home runs remain the same, hits, runs, and walks are down, and strikeouts are up, you cannot blame drug-testing, large, pitcher-friendly stadiums, or other external factors. Either the hitters are getting worse, or the pitchers and defense are getting better. I will explore further, but it seems logical to say that if hitters’ numbers are low league-wide for numbers that aren’t improved by external factors but rather by natural ability, then it’s plausible that pitchers are just doing a better job at adjusting to hitters than the reverse. Repeatedly, I see fielders strategically shift to the side of the infield where the hitter tends to hit the ball – so much that there is a wide-open gap for a hitter to hit the ball – yet the hitter hits it right to the fielder. A large part of this is the pitcher being able to pitch in a way that forces this outcome. I see commentators showing graphics of hitters’ weak zones, and so many times a pitch in that weak zone results in an out. The pitcher and his defense seem to have out-smarted and out-strategized the offense, and we see the effects of this in hitters’ overall numbers.

Conclusion

Closing remarks

Especially at a time in baseball when the offensive numbers are down and the defensive side has the upper hand, every scoring opportunity must be taken and sources of runs must be varied. Every opportunity missed is another potential loss, and the opportunities are only getting fewer. The teams who make the playoffs may just be best at this. However, efficiency and varied source of runs might not be the only solution. The defense has adjusted; can the offense adjust back?

Next steps

I made some remarks based on eyeballing of plots, but data science is more than that. Upon further reading and given more time, I would:

run significance tests to compare the difference in mean differences and variable correlations
further divide the playoff and non-playoff teams. I would compare the bottom 3 to 5 playoff teams to the top 3 to 5 non-playoff teams
do a deeper player-level analysis on the league-wide trends to see what I can uncover

The post The Case of the Missing Offense appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The Case of the Missing Offense