Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his first class project – Shiny (due on the 4th week of the program).
Why does it matter?
Baseball is a game flooded with statistics, and through the mountain of data it can be easy to forget what matters. On one end of the spectrum there are irrelevant statistics; I have heard commentators say of hitters something along the lines of, “Did you know that he is the youngest player ever to hit a home run on the second day of the first week of August!?” On the other end of the spectrum there is Paul DePodesta and his Moneyball strategy. If you have not seen the movie, basically the strategy is to identify key statistics by which to define value in order to buy undervalued players and sell overvalued players. This can be a good short-run strategy, but once the market has caught up to the definition of value, there needs to be new definitions.
What I had initially set out to do was solely compare the statistics of playoff teams to non-playoff teams to get an idea of what a long-term strategy to make the playoffs might consist of. While I found interesting numbers between playoff versus non-playoff teams, I was more fascinated by league-wide trends and the combination of findings between playoff status and league-wide trends.
A demo of the three sections of my dashboard which can also be viewed here:
Sean Lahman’s Baseball Database contains a wide range of MLB data, including data on batting, pitching, and fielding. There are team-wide and individual player-level data for the regular season and playoffs, and the latest database as of this writing has data ranging from 1871 to 2015.
For the dashboard, I used the Teams.csv file which has team summary data by year, including summarized batting and pitching statistics. I used only the latest decade (2005 to 2015) of available data because, for the purposes of the dashboard, I did not think it was necessary to look further back in time unless I could not find trends in the last decade.
Preview of columns from a sample of the data:
Columns added to dataset
I calculated new columns based on existing columns:
|Year||Team||BA||OBP||Win Ratio||Made Playoffs||Group|
Calculated column definitions:
|BA*||Batting Average||Total Hits / Total At Bats|
|OBP||On Base Percentage||(Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies)|
|Win Ratio||Win Ratio||Wins / Total Games Played|
|Made Playoffs||Made Playoffs||“Yes” if team has reached one of the playoff rounds, “No” if not|
|Group||Group||1: Playoff team, 2: Top half of non-playoff teams based on win ratio, 3: Bottom half of non-playoff teams based on win ratio|
* In baseball, walks, hits-by-pitch, and sacrifice flies and bunts are not counted into batting averages because they are not hits. However, because they are positive outcomes they are excluded from the count of at-bats so as not to misrepresent the batting average, hence the need for OBP.
Playoff vs. non-playoff
A key statistic I used to compare playoff and non-playoff teams is the mean of their differences of each year for a given variable. Here is the function I wrote to calculate this:
View the code on Gist.
Let’s use the number of Runs Scored to go step-by-step into how the mean difference is calculated.
Here are the first six rows of Runs Scored means per year by playoff status. For example, the first row is saying that in 2005, teams that did not make the playoffs scored an average of 730.7 runs:
The differences of means (playoff minus non-playoff) per year from the previous table:
The mean of the differences of Runs Scored from 2005-2015:
What this tells us is that, on average from 2005-2015, playoff teams are scoring 69.338 more runs per year than non-playoff teams. Not only are they scoring more runs, they are scoring more runs every year as you can see here in the difference between the blue and red bars:
Similarly to Runs Scored, Hits and Walks topped the batting statistics, while Hits Allowed and Walks Allowed topped the pitching statistics favoring playoff teams.
Okay, but this is expected, right? Shouldn’t the teams that can score more runs and get more hits win? Not so fast!
Examining these correlation matrices of hitting statistics (all insignificant correlations are set to 0)…
…it appears that:
- Runs (R) to Win Percent: Playoff teams have a weak positive correlation of Runs to Win Percent (0.25), while non-playoff teams have a moderate positive correlation (.40). We have seen previously that Playoff teams are scoring more runs overall. The low correlation of Runs to Win Percent could potentially be a sign that it’s not the number of runs but the efficiency of runs that sets playoff teams apart.
- Home runs (HR) to Runs and Walks (BB) to Runs: Playoff teams have a moderate positive correlation of HR to Runs (0.55), while non-playoff teams of a strong positive correlation (0.67). The reverse is true of Walks to Runs. Playoff teams are walking more and hitting more home runs as seen previously but potentially are spreading out their sources of runs more effectively. This is relevant so that if a few of their players go cold, others can pick them up. I’ve seen teams rely on their home run hitters, and if they went cold, the team went cold.
Not only can the previous correlation observations be useful to keep in mind when acquiring players, they can also be useful in setting rosters and lineups. However, the league-wide trends in the MLB caught my eye more than the differences between teams of varying playoff status.
I saw that runs and hits were down:
…so I thought “Okay, it probably has a lot to do with the performance enhancing drug crackdown”. Then I saw the home run numbers, and it looks like home runs are not significantly trending one way or another. They have their ups and downs. 2014 just happened to be a low year.
Finally, I saw the strikeouts:
and walks numbers:
When home runs remain the same, hits, runs, and walks are down, and strikeouts are up, you cannot blame drug-testing, large, pitcher-friendly stadiums, or other external factors. Either the hitters are getting worse, or the pitchers and defense are getting better. I will explore further, but it seems logical to say that if hitters’ numbers are low league-wide for numbers that aren’t improved by external factors but rather by natural ability, then it’s plausible that pitchers are just doing a better job at adjusting to hitters than the reverse. Repeatedly, I see fielders strategically shift to the side of the infield where the hitter tends to hit the ball – so much that there is a wide-open gap for a hitter to hit the ball – yet the hitter hits it right to the fielder. A large part of this is the pitcher being able to pitch in a way that forces this outcome. I see commentators showing graphics of hitters’ weak zones, and so many times a pitch in that weak zone results in an out. The pitcher and his defense seem to have out-smarted and out-strategized the offense, and we see the effects of this in hitters’ overall numbers.
Especially at a time in baseball when the offensive numbers are down and the defensive side has the upper hand, every scoring opportunity must be taken and sources of runs must be varied. Every opportunity missed is another potential loss, and the opportunities are only getting fewer. The teams who make the playoffs may just be best at this. However, efficiency and varied source of runs might not be the only solution. The defense has adjusted; can the offense adjust back?
I made some remarks based on eyeballing of plots, but data science is more than that. Upon further reading and given more time, I would:
- run significance tests to compare the difference in mean differences and variable correlations
- further divide the playoff and non-playoff teams. I would compare the bottom 3 to 5 playoff teams to the top 3 to 5 non-playoff teams
- do a deeper player-level analysis on the league-wide trends to see what I can uncover