I promised something related to Major League Soccer and here it is. Caveat: It’s not much. Why so sparse? (1) The data is a bit messy due to teams folding, expansion, name changes, etc. (2) I was backpacking all weekend and didn’t have time to work on this side project. Yes, I have a real job and working during the work week is a bit difficult.
My first step was to scrape the “stats” section of the MLS site to get all of their public data. Or at least all of the data that is relatively easy to find and easy to scrape. I’ll post the code soon once I setup the master repository on github. Needless to say, I think it looks a bit better than my initial foray into beautifulsoup as posted here.
I decided to look at goals per game by team and year. Most people who like soccer like goals, so this seems like a good starting point. Here is the initial figure.
As you can see, there are a lot of blank spaces. The reason for this is because a lot of teams changed their name and/or relocated (e.g., San Jose), some teams folded (e.g., Tampa Bay), and MLS added teams over the years (e.g., Chivas USA). The bottom line is that it makes for an ugly graph. In an attempt to clean it up a bit, I tried to consolidate some of the names. Here is the new figure.
It still doesn’t look great, but I do think that you can learn a bit from this figure. Overall, I would say that the goals per game for each team is decreasing over time. Is it a statistically significant decline? I dunno. I’m not writing a paper here — it’s a freaking blog, i.e., speculation reigns supreme! In any case, this raises more questions. For example,
- Does this apparent decline affect attendance numbers?
- What is the cause of this decline? Better defenders coming into the league? Um, I doubt it. I would imagine that quality strikers are being added at about the same rate.
I would hypothesize that it’s just that the quality of the league has improved significantly over the years. Hence, the teams are holding possession more and not just firing shots whenever they get a chance. As a result, I will look into attendance numbers, shots, shots on goal, etc. in the upcoming days or possibly weeks. I believe that some interesting questions can be answered with these data. However, I am still trying to discover what these questions might be. If you have any ideas for questions, let me hear about them in the comments section.
The R code for this project isn’t too interesting, so I won’t post it below — it will be on the github repository in time though. One thing that I did learn about R is that reading in numeric data measured in the thousands (e.g., attendance figures) can be problematic if the numbers have commas. It took me a while to find the workaround and it’s given below.
mls.reg.dat$h_tot <- as.numeric(gsub(",", "", mls.reg.dat$h_tot))
mls.reg.dat$h_avg <- as.numeric(gsub(",", "", mls.reg.dat$h_avg))
mls.reg.dat$a_tot <- as.numeric(gsub(",", "", mls.reg.dat$a_tot))
mls.reg.dat$a_avg <- as.numeric(gsub(",", "", mls.reg.dat$a_avg))