Analyzing NBA Player Data III: Similarity Networks

March 9, 2018
By

(This article was first published on schochastics, and kindly contributed to R-bloggers)

This is the last part of the mini series Analysing NBA Player data.
The first part
was concerned with scraping and cleaning player statistics from any NBA season.
The second part
showed how to use principal component analysis and k means clustering to “revolutionize”
player positions. Which kind of failed. Anyway, this third part is now dealing with
something a little more advanced, namely similarity networks of players and what we
can learn from them.

#used libraries
library(tidyverse) # for data wrangling
library(rvest)     # for web scraping
library(janitor)   # for data cleaning
library(igraph)    # for network data structures and tools
library(ggraph)    # for network visualization

What is a similarity network?

If you think of networks, then it is usually individuals interacting in some way.
These relations are most commonly affiliated with positive connotation (friendship, kinship, etc.)
A network can, however, also consist of positive (being friends) and negative (being enemies) relationships.
We then speak of a signed networks.
Analyzing signed networks is a bit trickier than analyzing regular networks and involve
a different set of tools. An interesting application for signed networks is
Heider’s structural balance theory.
A third type of networks are networks that neither have positive nor negative ties, but
where a connection between to nodes signifies some sort of similarity, equality or indifference.
I here refer to them as similarity networks but as far as I know, this is not a standard term
since there is little research on such networks.

In this post, we will construct one example for a similarity network, namely a similarity network
of NBA players. Similarity is based on the player stats and if two players are connected in the network,
then they can be considered to be of the same player type. I am not the first to do this.
There has been a talk at the SLOAN Conference by Muthu Alagappan
who seemed to have done exactly this. I could unfortunately not find out how exactly he
constructed his networks, since he used proprietary software. According to the
abstract, though, it yields “revolutionary insight” and “it can add tremendous
value for coaches owners, general managers, and the everyday fan”.

Constructing an NBA similarity network

Before we begin, we of course need a player stats dataset, which we obtain with the
scrape_stats() function developed in the first part. We will use data from the last
season and filter out players that played less than 150 minutes.

player_stats <- scrape_stats(season = 2017) %>% 
  dplyr::filter(mp>=150)

According to the Wikipedia article, Muthu used
some sort of tpological data analysis
to derive his similarities between players. So we will do the same. We will use UMAP,
a relatively new method based on Riemannian geometry. There
is no R package for it yet, but I showed in a recent post how to use
the python implementation in R. TL;DR: Install
the python version and use rPython to create the following function.

umap <- function(x,n_neighbors=10,n_components=2,min_dist=0.1,metric="euclidean"){
  x <- as.matrix(x)
  colnames(x) <- NULL
  rPython::python.exec( c( "def umap(data,n,d,mdist,metric):",
                           "\timport umap" ,
                           "\timport numpy",
                           "\tembedding = umap.UMAP(n_neighbors=n,n_components=d,min_dist=mdist,metric=metric).fit_transform(data)",
                           "\tres = embedding.tolist()",
                           "\treturn res"))
  
  res <- rPython::python.call( "umap", x,n_neighbors,n_components,min_dist,metric)
  do.call("rbind",res)
}

I decided to map the 70 stats into a 10 dimensional space. This “new” space supposedly
preserves the intrinsic distance of the “old” space, but reduces the noise of the
original data so that the differences and similarities of players become more evident.

umap_player <- player_stats %>% 
    select(fg:vorp) %>%
    as.matrix() %>% 
    scale() %>% 
    umap(n_components = 10)

Now that we have embedded the players in a lower dimensional space, we calculate the
distance among them based on this new space.

D <- dist(umap_player,diag = TRUE,upper = TRUE) %>% 
  as.matrix()

You can think of the distance as an “inverse similarity” The further two players apart,
the less similar they are. Since we are interested only if players are similar or not,
we need to decide on a threshold at which players are considered to be similar.
After a bit of experimenting, I settled for 0.5 as a reasonable threshold. So pairs of
players are considered to be similar if their distance is below 0.5. So we turn the
distance matrix into a 0/1 matrix which is used to construct a graph object.

A <- (D < 0.5) + 0
g <- graph_from_adjacency_matrix(A,"undirected",diag = F)
V(g)$name <- player_stats$player
ggraph(g, layout = "manual", node.positions = layout_igraph_v3(g))+
  geom_edge_link(colour = "grey")+
  geom_node_point(size = 2)+
  theme_graph()

The function layout_igraph_v3() is not part of ggraph but a not yet available
R package visone3 which provides nicer layouts for networks. There exists a complete
software tool though which can be used for free to visualize and
analyze networks (Disclaimer: I know the developers).

If you want to plot the network without the visone package, you can use any of the
layout algorithms of igraph.

ggraph(g, layout = "kk")+
  geom_edge_link(colour = "grey")+
  geom_node_point(size = 2)+
  theme_graph()

V(g)$Position <- player_stats$pos

ggraph(g, layout="manual", node.positions = layout_igraph_v3(g))+
  geom_edge_link(colour = "grey")+
  geom_node_point(aes(color = Position),size = 2)+
  theme_graph()+
  theme(legend.position = "bottom")

Interestingly, the positions of players seem to be a strong indicator for similarity.
Almost all centers are very similar, since they form a component by themselves.
You can also find small cohesive groups of players with the same position within the biggest component.

Of course more interesting is to find where players with special skills are located.
Like the players with the highest scoring per 36 minutes.

V(g)$pts_pm <- player_stats$pts_pm

ggraph(g, layout="manual", node.positions=layout_igraph_v3(g))+
    geom_edge_link(colour = "grey")+
    geom_node_point(aes(color = pts_pm),size = 2)+
    scale_color_gradient(low="#104E8B", high="#CD2626")+
    theme_graph()+
    theme(legend.position="bottom")

Or the players with the most rebounds per 36 minutes.

V(g)$trb_pm <- player_stats$trb_pm

ggraph(g, layout="manual", node.positions=layout_igraph_v3(g))+
    geom_edge_link(colour = "grey")+
    geom_node_point(aes(color = trb_pm),size = 2)+
    scale_color_gradient(low="#104E8B", high="#CD2626")+
    theme_graph()+
    theme(legend.position="bottom")

Players with similar stats seem to neatly cluster together so that any well connected group
of players in the network describes a specific player type.

We can use this networks now to argue about team performances. Take the player position
of last years NBA champions, the Golden State Warriors

V(g)$tm <- ifelse(player_stats$tm=="GSW","GSW","other")

ggraph(g, layout="manual", node.positions=layout_igraph_v3(g))+
    geom_edge_link(colour = "grey")+
    geom_node_point(aes(color = tm),size = 2)+
    scale_color_manual(values=c("GSW"="#CD2626","other"="gray27"))+
    theme_graph()+
    theme(legend.position="bottom")

and the worst team, the Brooklyn Nets.

V(g)$tm <- ifelse(player_stats$tm=="BRK","BRK","other")

ggraph(g, layout="manual", node.positions=layout_igraph_v3(g))+
    geom_edge_link(colour = "grey")+
    geom_node_point(aes(color = tm),size = 2)+
    scale_color_manual(values=c("BRK"="#CD2626","other"="gray27"))+
    theme_graph()+
    theme(legend.position="bottom")

Most of the Golden State players are embedded in different groups, indicating that they
have a very diverse set of players. The players of the Brooklyn Nets on the other hand
are closer together and do not fall into specific groups. They seem to lack players
with distinct and marked skills, which may explain there performance.

A shiny app to analyze NBA similarity networks

If you are interested in different NBA seasons, teams or stats, I have built a little
shiny applications, which allows you to explore interactive similarity networks back to 1990.
You can check the locations of your favorite players and teams and customize the
stats that should be shown on the network. The code for the app
can be found on github. To run the
app you need to install the package visNetwork, since the networks are interactive.
To run the app locally use shiny::runGitHub("schochastics/NBASimNet").

To leave a comment for the author, please follow the link and comment on their blog: schochastics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)