Analyzing NBA Player Data II: Clustering Players

[This article was first published on schochastics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the second post of my little series Analyzing NBA player data. The first part was concerned with scraping and cleaning player statistics from any NBA season. This post is dealing with gaining some inside in the player stats. In particular, clustering players according to their stats to produce a new set of player positions.

#used libraries
library(tidyverse) # for data wrangling
library(rvest)     # for web scraping
library(janitor)   # for data cleaning
library(factoextra)  # for pca and cluster visuals

theme_set(theme_minimal()+
            theme(legend.position = "bottom",
                  text=element_text(size = 12)))

Building Player Stats Data

Using the function scrape_stats() developed in the previous post, we start by building a data frame of player stats from the last two seasons to have a big enough sample for our analysis. We can use the map_dfr() function from the purrr package for this purpose. The function applies the given function to all arguments and binds the result rowwise in a data frame. It is essentially the same as do.call("rbind",lapply(2016:2017,scrape_stats))

player_stats <- map_dfr(2016:2017,scrape_stats)

The next step will be to reduce the noise of the data by considering only players that played more than 500 minutes in a season.

player_stats <- player_stats %>% 
  dplyr::filter(mp>=500)

This leaves us with 705 rows to analyze.

Clustering Player Data or: Revolutionizing Player Positions

Yes that second title sounds rather pretentious. However, clustering has been used before to define new set of positions which (could/should) supersede the traditional point guard (PG), shooting guard (SG), small forward (SF), power forward (PF), and center (C) (1, 2, 3). There are even two contributions at the SLOAN Conference (a big sports analytics conference) on that topic (1, 2). I will deal with the second one in the third part of this series.

The idea behind using clustering is to find sets of players that have a similar skill set. As the sources from above all note, the traditional five positions oversimplify the skill sets of NBA players and putting players into one of the five positions does not seem to accurately define a player’s specific skill set. Defining new positions by means of similar skills may thus revolutionize modern Basketball (Well, at least on a highly theoretical level).

To obtain our very own set of new positions, we will go through the following steps:

  • Reduce the dimensionality of the data
  • Decide on number of clusters
  • Cluster the low-dimensional data

Reduce Dimensionality

The first step is necessary to not get affected by Curse of Dimensionality. While there are many possibilities to reduce the dimensionality, we will here stick with a traditional principal component analysis (pca) using prcomp() from the stats package.

pca_nba <- player_stats %>% 
  select(fg:vorp) %>% 
  as.matrix() %>% 
  prcomp(center = TRUE,scale = TRUE,retx = TRUE)

# plot the explained variance per PC  
fviz_eig(pca_nba,ncp = 15)

Now we need to decide on how many components to keep. This is a very subjective matter and there does not seem to be a golden standard rule. I decided on keeping 10 components which together account for roughly 90% of the variance in the data.

player_stats_ld <- pca_nba$x[,1:10]

Using the get_pca_var() function from factoextra, we can compute the stats that contribute the most to our chosen components.

nba_var <- get_pca_var(pca_nba)
pcs <- nba_var$contrib[,1:10]
colnames(pcs) <- paste0("PC",1:10)

as_tibble(pcs,rownames = "stat") %>%
  gather(pc,contrib,PC1:PC10) %>%
  mutate(pc=factor(pc,levels=paste0("PC",1:10))) %>% 
  group_by(pc) %>% 
  top_n(5,contrib) %>% 
  ggplot(aes(x=stat,y=contrib))+
  geom_col()+
  coord_flip()+
  facet_wrap(~pc,scales = "free",ncol=5)+
  labs(x="",y="")

The first two components seem to be related to the scoring skills of players (2P and 3P). Very clearly determined are also the last two components, relating to free throws and blocks respectively.

In the next step we will decide on the number of clusters to perform a kmeans clustering.

Number of Clusters

Choosing the number of clusters is yet again a very subjective matter. However, there are some guidlines which may help to settle for an appropriate number. The R package Nbclust for instance implements 30 different indices to determine the optimal number of clusters. So a brute-force approach would be to calculate all 30 indices for a range of possible number of clusters and then decide via majority vote.

While I was very inclined of doing this, I settled for a single measure, the silhouette. The silhouette value describes how similar a data point is to its own cluster compared to other clusters. It ranges from −1 to +1, where high values indicate that data points are well matched to their own cluster and poorly to other clusters.

To calculate the silhouette value for a range of possible clusters, we use the fviz_nbclust() function from the factoextra package.

fviz_nbclust(player_stats_ld,kmeans,method = "silhouette")

So the silhouette value suggest that there are only 3 clusters of players1. Now this is kind of a bummer. Why? Well, as in the other sources, I was hoping to find a larger set of positions than the traditional ones. Not less!

Of course we could just ignore the suggestion of the silhouette value and take any number greater than 5 to create our clustering. But we would not really have a convincing argument for our choice.

Clustering using kmeans

With the “optimal” number of clusters determined, we can call the kmeans() function to compute the clustering.

player_clus <- kmeans(player_stats_ld,centers = 3)

The following code is used to visualizes the cluster centers

as_tibble(player_clus$centers) %>% 
  gather(component,value,PC1:PC10) %>% 
  mutate(clust = rep(1:3,10)) %>% 
  ggplot(aes(x=factor(component,levels = paste0("PC",10:1)),y=value))+
  geom_col()+
  coord_flip()+
  facet_wrap(~clust)+
  labs(x="",y="")

Interestingly enough, the first two components already seem to determine the clusters quite well.

So what do we make out of these clusters? Hard to say, since I am also not that big of an NBA expert. We certainly did not uncover some revolutionary new set of positions, but only three seemingly boring groups of players.

The result is absolutely not was I was hoping for when I started to write the article. But maybe there is still some valuable lesson to learn here. I tried some obscure things while writing to somehow obtain a revolutionary set of new positions. I had to stop when I realized that I was falling for the Texas sharpshooter fallacy and Cherry Picking.

I hope the third part of the series will turn out to be a bit more exciting. There, I am going to look at similarity networks of players.


  1. Of course I could not resist to use the brute force way. The majority vote there also clearly favored 3 as the optimal number of clusters.

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)