Analyzing Golden State Warriors’ passing network using GraphFrames in Spark

[This article was first published on Opiate for the masses, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Databricks recently announced GraphFrames, awesome Spark extension to implement graph processing using DataFrames.
I performed graph analysis and visualized beautiful ball movement network of Golden State Warriors using rich data provided by NBA.com’s stats

Pass network of Warriors

Passes received & made

The league’s MVP Stephen Curry received the most passes and the team’s MVP Draymond Green provides the most passes.
We’ve seen most of the offense start with their pick & roll or Curry’s off-ball cuts with Green as a pass provider.

via GIPHY

inDegree
idinDegree
CurryStephen3993
GreenDraymond3123
ThompsonKlay2276
LivingstonShaun1925
IguodalaAndre1814
BarnesHarrison1241
BogutAndrew1062
BarbosaLeandro946
SpeightsMarreese826
ClarkIan692
RushBrandon685
EzeliFestus559
McAdooJames Michael182
VarejaoAnderson67
LooneyKevon22
outDegree
idoutDegree
GreenDraymond3841
CurryStephen3300
IguodalaAndre1896
LivingstonShaun1878
BogutAndrew1660
ThompsonKlay1460
BarnesHarrison1300
SpeightsMarreese795
RushBrandon772
EzeliFestus765
BarbosaLeandro758
ClarkIan597
McAdooJames Michael261
VarejaoAnderson94
LooneyKevon36

Label Propagation

Label Propagation is an algorithm to find communities in a graph network.
The algorithm nicely classifies players into backcourt and frontcourt without providing label!

namelabel
Thompson, Klay3
Barbosa, Leandro3
Curry, Stephen3
Clark, Ian3
Livingston, Shaun3
Rush, Brandon7
Green, Draymond7
Speights, Marreese7
Bogut, Andrew7
McAdoo, James Michael7
Iguodala, Andre7
Varejao, Anderson7
Ezeli, Festus7
Looney, Kevon7
Barnes, Harrison7

Pagerank

PageRank can detect important nodes (players in this case) in a network.
It’s no surprise that Stephen Curry, Draymond Green and Klay Thompson are the top three.
The algoritm detects Shaun Livingston and Andre Iguodala play key roles in the Warriors’ passing games.

namepagerank
Curry, Stephen2.17
Green, Draymond1.99
Thompson, Klay1.34
Livingston, Shaun1.29
Iguodala, Andre1.21
Barnes, Harrison0.86
Bogut, Andrew0.77
Barbosa, Leandro0.72
Speights, Marreese0.66
Clark, Ian0.59
Rush, Brandon0.57
Ezeli, Festus0.48
McAdoo, James Michael0.27
Varejao, Anderson0.19
Looney, Kevon0.16

Everything together

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)


Here is a network visualization using the results of above.

  • Node size: pagerank
  • Node color: community
  • Link width: passes received & made

Workflow

Calling API

I used the endpoint playerdashptpass and saved data for all the players in the team into local JSON files.
The data is about who passed how many times in 2015-16 season

# GSW player IDs
playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546,
203110,201939,203105,2733,1626172,203084]

# Calling API and store the results as JSON
for playerid in playerids:
    os.system('curl "http://stats.nba.com/stats/playerdashptpass?'
        'DateFrom=&'
        'DateTo=&'
        'GameSegment=&'
        'LastNGames=0&'
        'LeagueID=00&'
        'Location=&'
        'Month=0&'
        'OpponentTeamID=0&'
        'Outcome=&'
        'PerMode=Totals&'
        'Period=0&'
        'PlayerID={playerid}&'
        'Season=2015-16&'
        'SeasonSegment=&'
        'SeasonType=Regular+Season&'
        'TeamID=0&'
        'VsConference=&'
        'VsDivision=" > {playerid}.json'.format(playerid=playerid))

JSON -> Panda’s DataFrame

Then I combined all the individual JSON files into a single DataFrame for later aggregation.

raw = pd.DataFrame()
for playerid in playerids:
    with open("{playerid}.json".format(playerid=playerid)) as json_file:
        parsed = json.load(json_file)['resultSets'][0]
        raw = raw.append(
            pd.DataFrame(parsed['rowSet'], columns=parsed['headers']))

raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'})

raw['id'] = raw['PLAYER'].str.replace(', ', '')

Prepare vertices and edges

You need a special data format for GraphFrames in Spark, vertices and edges.
Vertices are lis of nodes and IDs in a graph.
Edges are the relathionship of the nodes.
You can pass additional features like weight but I couldn’t find out a way to utilize there features well in later analysis.
A workaround I took below is brute force and not even a proper graph operation but works (suggestions/comments are very welcome).

# Make raw vertices
pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates()
pandas_vertices.columns = ['name', 'id']

# Make raw edges
pandas_edges = pd.DataFrame()
for passer in raw['id'].drop_duplicates():
    for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) &
     (raw['id'] == passer)]['PASS_TO'].drop_duplicates():
        pandas_edges = pandas_edges.append(pd.DataFrame(
        	{'passer': passer, 'receiver': receiver
        	.replace(  ', ', '')}, 
        	index=range(int(raw[(raw['id'] == passer) &
        	 (raw['PASS_TO'] == receiver)]['PASS'].values))))

pandas_edges.columns = ['src', 'dst']

Graph analysis

Bring the local vertices and edges to Spark and let it spark.

vertices = sqlContext.createDataFrame(pandas_vertices)
edges = sqlContext.createDataFrame(pandas_edges)

# Analysis part
g = GraphFrame(vertices, edges)
print("vertices")
g.vertices.show()
print("edges")
g.edges.show()
print("inDegrees")
g.inDegrees.sort('inDegree', ascending=False).show()
print("outDegrees")
g.outDegrees.sort('outDegree', ascending=False).show()
print("degrees")
g.degrees.sort('degree', ascending=False).show()
print("labelPropagation")
g.labelPropagation(maxIter=5).show()
print("pageRank")
g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort(
    'pagerank', ascending=False).show()

Visualise the network

When you run gsw_passing_network.py in my github repo, you have passes.csv, groups.csv and size.csv in your working directory.
I used networkD3 package in R to make a cool interactive D3 chart.

library(networkD3)

setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")

passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50

groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100


forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             fontFamily = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             fontSize = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)

Code

The full codes are available on github.


Analyzing Golden State Warriors’ passing network using GraphFrames in Spark was originally published by Kirill Pomogajko at Opiate for the masses on March 15, 2016.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)