Visualizing MLS Player Salaries with ggplot2

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, I came across this great visualization of MLS Player salaries. I tried to do something similar with ggplot2, and while I was unable to replicate the interactivity or the tree-map nature of the graph, the graph still looks pretty cool.

Data

The data is contained in this pdf file. I obtained a CSV file extracted from the PDF file by using PDFtables.com. The data can be found here.

Exploratory Analysis

We will need the plyr and ggplot2 libraries for this. Let’s load them up and read in the data. To learn more about ggplot2 read my previous tutorial.

library(plyr)
library(ggplot2)

salary <- read.csv('September 15 2015 Salary Information - Alphabetical.csv', na.strings = '')
head(salary)

  Club    Last.Name First.Name Pos X Base.Salary X.1 Compensation
1   NY        Abang    Anatole   F $   50,000.00   $    50,000.00
2   KC Abdul-Salaam       Saad   D $   60,000.00   $    73,750.00
3  CHI        Accam      David   F $  650,000.00   $   720,937.50
4  DAL       Acosta     Kellyn   M $   60,000.00   $    84,000.00
5  VAN     Adekugbe     Samuel   D $   60,000.00   $    65,000.00
6  POR          Adi    Fanendo   F $  651,500.00   $   664,000.00

The X and X.1 columns have nothing but the $ sign, so we can remove them. Also, the base salary is stored as factor. To convert to numeric, first we have to remove the commas in the data. We can use the gsub function for this. Next, we need to convert it to numeric. However, we cannot directly convert from factor to numeric, because R assigns a factor level to each data variable and if you convert it directly, it will just return that number. The way to convert it without losing information is to first convert it to character and then to numeric.

salary$X <- NULL
salary$X.1 <- NULL

salary$Base.Salary <- gsub(',', '', salary$Base.Salary)
salary$Base.Salary <- as.numeric(as.character(salary$Base.Salary))
salary$Base.Salary <- salary$Base.Salary / 1000000

I decided to divide the salary by a million so that everyone’s salary is displayed in units of millions of dollars.

Plotting the data

Now, for plotting the data, we will use ggplot2. We want the names of players to be displayed in the bars that correspond to their salaries. Normally, text is displayed at the top of each section of the bar. This can cause problems and mess up the way the graph looks. To avoid this, we need to calculate the mid point of each section of the bars and displaying the name at the midpoint. This can be done as follows (as explained in this StackOverflow thread:

salary <- ddply(salary, .(Club), transform, pos = cumsum(Base.Salary) - (0.5 * Base.Salary))

Basically, this splits the data frame by the Club variable, and then calculates the cumulative sum of salaries for that bar minus half the base salary of that specific section of the bar to find its midpoint.

Okay, now, let’s plot the data.

ggplot(salary, aes(x = Club, y = Base.Salary, fill = Base.Salary)) +
  geom_bar(stat = 'identity') +
  labs(y = 'Base Salary in millions of dollars', x = '') + 
  coord_flip() + 
  geom_text(data = subset(salary, Base.Salary > 2), aes(label = Last.Name, y = pos)) +
  scale_fill_gradient(low = 'springgreen4', high = 'springgreen')

which gives us the following plot:
MLS Player Salaries

  • labs is used to specify the labels for the axes.
  • coord_flip is used to flip the axes so that we get a horizontal bar chart instead of a vertical one.
  • geom_text is used to specify the text to include in the chart. Since some of the sections of the chart are very small and cannot fit a players name inside them, I decided to only display the name of all players whose salary is more than 2 million dollars. The position of the players’ name is determined by pos as calculated earlier.
  • scale_fill_gradient is used to specify the color gradient of the chart. The default color gradient is dark blue to blue. The full list of color names in R can be found here.

That brings us to the end of this article. I hope you found it useful! As always, if you have any questions or feedback, leave a comment or reach out to me on Twitter.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)