The Case of the Missing Offense

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his first class project – Shiny (due on the 4th week of the program).

Links:   GitHub   |   App



Why does it matter?

Baseball is a game flooded with statistics, and through the mountain of data it can be easy to forget what matters. On one end of the spectrum there are irrelevant statistics; I have heard commentators say of hitters something along the lines of, “Did you know that he is the youngest player ever to hit a home run on the second day of the first week of August!?” On the other end of the spectrum there is Paul DePodesta and his Moneyball strategy. If you have not seen the movie, basically the strategy is to identify key statistics by which to define value in order to buy undervalued players and sell overvalued players. This can be a good short-run strategy, but once the market has caught up to the definition of value, there needs to be new definitions.

The goal

What I had initially set out to do was solely compare the statistics of playoff teams to non-playoff teams to get an idea of what a long-term strategy to make the playoffs might consist of. While I found interesting numbers between playoff versus non-playoff teams, I was more fascinated by league-wide trends and the combination of findings between playoff status and league-wide trends.



A demo of the three sections of my dashboard which can also be viewed here:

Dashboard Demo



Sean Lahman’s Baseball Database contains a wide range of MLB data, including data on batting, pitching, and fielding. There are team-wide and individual player-level data for the regular season and playoffs, and the latest database as of this writing has data ranging from 1871 to 2015.

For the dashboard, I used the Teams.csv file which has team summary data by year, including summarized batting and pitching statistics. I used only the latest decade (2005 to 2015) of available data because, for the purposes of the dashboard, I did not think it was necessary to look further back in time unless I could not find trends in the last decade.

Dataset sample

Preview of columns from a sample of the data:


Columns added to dataset

I calculated new columns based on existing columns:


Calculated column definitions:

* In baseball, walks, hits-by-pitch, and sacrifice flies and bunts are not counted into batting averages because they are not hits. However, because they are positive outcomes they are excluded from the count of at-bats so as not to misrepresent the batting average, hence the need for OBP.



Playoff vs. non-playoff


A key statistic I used to compare playoff and non-playoff teams is the mean of their differences of each year for a given variable. Here is the function I wrote to calculate this:

View the code on Gist.


Let’s use the number of Runs Scored to go step-by-step into how the mean difference is calculated.

Here are the first six rows of Runs Scored means per year by playoff status. For example, the first row is saying that in 2005, teams that did not make the playoffs scored an average of 730.7 runs:


The differences of means (playoff minus non-playoff) per year from the previous table:


The mean of the differences of Runs Scored from 2005-2015:

What this tells us is that, on average from 2005-2015, playoff teams are scoring 69.338 more runs per year than non-playoff teams. Not only are they scoring more runs, they are scoring more runs every year as you can see here in the difference between the blue and red bars:

Runs Scored - Playoff vs. Non-playoff Teams



Similarly to Runs Scored, Hits and Walks topped the batting statistics, while Hits Allowed and Walks Allowed topped the pitching statistics favoring playoff teams.

Okay, but this is expected, right? Shouldn’t the teams that can score more runs and get more hits win? Not so fast!

Examining these correlation matrices of hitting statistics (all insignificant correlations are set to 0)…

Playoff teams:

Playoff Team Hitting Variable Correlations

Non-playoff teams:

Non-Playoff Team Hitting Variable Correlations

…it appears that:

  • Runs (R) to Win Percent: Playoff teams have a weak positive correlation of Runs to Win Percent (0.25), while non-playoff teams have a moderate positive correlation (.40). We have seen previously that Playoff teams are scoring more runs overall. The low correlation of Runs to Win Percent could potentially be a sign that it’s not the number of runs but the efficiency of runs that sets playoff teams apart.
  • Home runs (HR) to Runs and Walks (BB) to Runs: Playoff teams have a moderate positive correlation of HR to Runs (0.55), while non-playoff teams of a strong positive correlation (0.67). The reverse is true of Walks to Runs. Playoff teams are walking more and hitting more home runs as seen previously but potentially are spreading out their sources of runs more effectively. This is relevant so that if a few of their players go cold, others can pick them up. I’ve seen teams rely on their home run hitters, and if they went cold, the team went cold.


League-wide trends

Not only can the previous correlation observations be useful to keep in mind when acquiring players, they can also be useful in setting rosters and lineups. However, the league-wide trends in the MLB caught my eye more than the differences between teams of varying playoff status.

I saw that runs and hits were down:

Runs Scored - Playoff vs. Non-playoff Teams

Hits - Playoff vs. Non-playoff Teams

…so I thought “Okay, it probably has a lot to do with the performance enhancing drug crackdown”. Then I saw the home run numbers, and it looks like home runs are not significantly trending one way or another. They have their ups and downs. 2014 just happened to be a low year.

Home Runs - Playoff vs. Non-playoff Teams

Finally, I saw the strikeouts:

Strikeouts - Playoff vs. Non-playoff Teams

and walks numbers:

Walks - Playoff vs. Non-playoff Teams




Closing remarks

Especially at a time in baseball when the offensive numbers are down and the defensive side has the upper hand, every scoring opportunity must be taken and sources of runs must be varied. Every opportunity missed is another potential loss, and the opportunities are only getting fewer. The teams who make the playoffs may just be best at this. However, efficiency and varied source of runs might not be the only solution. The defense has adjusted; can the offense adjust back?

Next steps

I made some remarks based on eyeballing of plots, but data science is more than that. Upon further reading and given more time, I would:

  • run significance tests to compare the difference in mean differences and variable correlations
  • further divide the playoff and non-playoff teams. I would compare the bottom 3 to 5 playoff teams to the top 3 to 5 non-playoff teams
  • do a deeper player-level analysis on the league-wide trends to see what I can uncover

The post The Case of the Missing Offense appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)