[This article was first published on   Opiate for the masses, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
            < !--html_preserve-->
< !--/html_preserve-->
Databricks recently announced GraphFrames, awesome Spark extension to implement graph processing using DataFrames.
I performed graph analysis and visualized beautiful ball movement network of Golden State Warriors using rich data provided by NBA.com’s stats
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Pass network of Warriors
### Passes received & made The league’s MVP Stephen Curry received the most passes and the team’s MVP Draymond Green provides the most passes. We’ve seen most of the offense start with their pick & roll or Curry’s off-ball cuts with Green as a pass provider.inDegree
| id | inDegree | 
|---|---|
| CurryStephen | 3993 | 
| GreenDraymond | 3123 | 
| ThompsonKlay | 2276 | 
| LivingstonShaun | 1925 | 
| IguodalaAndre | 1814 | 
| BarnesHarrison | 1241 | 
| BogutAndrew | 1062 | 
| BarbosaLeandro | 946 | 
| SpeightsMarreese | 826 | 
| ClarkIan | 692 | 
| RushBrandon | 685 | 
| EzeliFestus | 559 | 
| McAdooJames Michael | 182 | 
| VarejaoAnderson | 67 | 
| LooneyKevon | 22 | 
outDegree
| id | outDegree | 
|---|---|
| GreenDraymond | 3841 | 
| CurryStephen | 3300 | 
| IguodalaAndre | 1896 | 
| LivingstonShaun | 1878 | 
| BogutAndrew | 1660 | 
| ThompsonKlay | 1460 | 
| BarnesHarrison | 1300 | 
| SpeightsMarreese | 795 | 
| RushBrandon | 772 | 
| EzeliFestus | 765 | 
| BarbosaLeandro | 758 | 
| ClarkIan | 597 | 
| McAdooJames Michael | 261 | 
| VarejaoAnderson | 94 | 
| LooneyKevon | 36 | 
Label Propagation
Label Propagation is an algorithm to find communities in a graph network. The algorithm nicely classifies players into backcourt and frontcourt without providing label!| name | label | 
|---|---|
| Thompson, Klay | 3 | 
| Barbosa, Leandro | 3 | 
| Curry, Stephen | 3 | 
| Clark, Ian | 3 | 
| Livingston, Shaun | 3 | 
| Rush, Brandon | 7 | 
| Green, Draymond | 7 | 
| Speights, Marreese | 7 | 
| Bogut, Andrew | 7 | 
| McAdoo, James Michael | 7 | 
| Iguodala, Andre | 7 | 
| Varejao, Anderson | 7 | 
| Ezeli, Festus | 7 | 
| Looney, Kevon | 7 | 
| Barnes, Harrison | 7 | 
Pagerank
PageRank can detect important nodes (players in this case) in a network. It’s no surprise that Stephen Curry, Draymond Green and Klay Thompson are the top three. The algoritm detects Shaun Livingston and Andre Iguodala play key roles in the Warriors’ passing games.| name | pagerank | 
|---|---|
| Curry, Stephen | 2.17 | 
| Green, Draymond | 1.99 | 
| Thompson, Klay | 1.34 | 
| Livingston, Shaun | 1.29 | 
| Iguodala, Andre | 1.21 | 
| Barnes, Harrison | 0.86 | 
| Bogut, Andrew | 0.77 | 
| Barbosa, Leandro | 0.72 | 
| Speights, Marreese | 0.66 | 
| Clark, Ian | 0.59 | 
| Rush, Brandon | 0.57 | 
| Ezeli, Festus | 0.48 | 
| McAdoo, James Michael | 0.27 | 
| Varejao, Anderson | 0.19 | 
| Looney, Kevon | 0.16 | 
Everything together
library(networkD3)
setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")
passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50
groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100
forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             Family = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             Size = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)
- Node size: pagerank
- Node color: community
- Link width: passes received & made
Workflow
Calling API
I used the endpoint playerdashptpass and saved data for all the players in the team into local JSON files. The data is about who passed how many times in 2015-16 season# GSW player IDs
playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546,
203110,201939,203105,2733,1626172,203084]
# Calling API and store the results as JSON
for playerid in playerids:
    os.system('curl "http://stats.nba.com/stats/playerdashptpass?'
        'DateFrom=&'
        'DateTo=&'
        'GameSegment=&'
        'LastNGames=0&'
        'LeagueID=00&'
        'Location=&'
        'Month=0&'
        'OpponentTeamID=0&'
        'Outcome=&'
        'PerMode=Totals&'
        'Period=0&'
        'PlayerID={playerid}&'
        'Season=2015-16&'
        'SeasonSegment=&'
        'SeasonType=Regular+Season&'
        'TeamID=0&'
        'VsConference=&'
        'VsDivision=" > {playerid}.json'.format(playerid=playerid))
JSON -> Panda’s DataFrame
Then I combined all the individual JSON files into a single DataFrame for later aggregation.raw = pd.DataFrame()
for playerid in playerids:
    with open("{playerid}.json".format(playerid=playerid)) as json_file:
        parsed = json.load(json_file)['resultSets'][0]
        raw = raw.append(
            pd.DataFrame(parsed['rowSet'], columns=parsed['headers']))
raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'})
raw['id'] = raw['PLAYER'].str.replace(', ', '')
Prepare vertices and edges
You need a special data format for GraphFrames in Spark, vertices and edges. Vertices are lis of nodes and IDs in a graph. Edges are the relathionship of the nodes. You can pass additional features like weight but I couldn’t find out a way to utilize there features well in later analysis. A workaround I took below is brute force and not even a proper graph operation but works (suggestions/comments are very welcome).# Make raw vertices
pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates()
pandas_vertices.columns = ['name', 'id']
# Make raw edges
pandas_edges = pd.DataFrame()
for passer in raw['id'].drop_duplicates():
    for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) &
     (raw['id'] == passer)]['PASS_TO'].drop_duplicates():
        pandas_edges = pandas_edges.append(pd.DataFrame(
        	{'passer': passer, 'receiver': receiver
        	.replace(  ', ', '')}, 
        	index=range(int(raw[(raw['id'] == passer) &
        	 (raw['PASS_TO'] == receiver)]['PASS'].values))))
pandas_edges.columns = ['src', 'dst']
Graph analysis
Bring the local vertices and edges to Spark and let it spark.vertices = sqlContext.createDataFrame(pandas_vertices)
edges = sqlContext.createDataFrame(pandas_edges)
# Analysis part
g = GraphFrame(vertices, edges)
print("vertices")
g.vertices.show()
print("edges")
g.edges.show()
print("inDegrees")
g.inDegrees.sort('inDegree', ascending=False).show()
print("outDegrees")
g.outDegrees.sort('outDegree', ascending=False).show()
print("degrees")
g.degrees.sort('degree', ascending=False).show()
print("labelPropagation")
g.labelPropagation(maxIter=5).show()
print("pageRank")
g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort(
    'pagerank', ascending=False).show()
Visualise the network
When you run gsw_passing_network.py in my github repo, you have passes.csv, groups.csv and size.csv in your working directory. I used networkD3 package in R to make a cool interactive D3 chart.library(networkD3)
setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network')
passes <- read.csv("passes.csv")
groups <- read.csv("groups.csv")
size <- read.csv("size.csv")
passes$source <- as.numeric(as.factor(passes$PLAYER))-1
passes$target <- as.numeric(as.factor(passes$PASS_TO))-1
passes$PASS <- passes$PASS/50
groups$nodeid <- groups$name
groups$name <- as.numeric(as.factor(groups$name))-1
groups$group <- as.numeric(as.factor(groups$label))-1
nodes <- merge(groups,size[-1],by="id")
nodes$pagerank <- nodes$pagerank^2*100
forceNetwork(Links = passes,
             Nodes = nodes,
             Source = "source",
             Family = "Arial",
             colourScale = JS("d3.scale.category10()"),
             Target = "target",
             Value = "PASS",
             NodeID = "nodeid",
             Nodesize = "pagerank",
             linkDistance = 350,
             Group = "group", 
             opacity = 0.8,
             Size = 16,
             zoom = TRUE,
             opacityNoHover = TRUE)
Code
The full codes are available on github. < !-- htmlwidgets dependencies --> Analyzing Golden State Warriors’ passing network using GraphFrames in Spark was originally published by Kirill Pomogajko at Opiate for the masses on March 15, 2016.To leave a comment for the author, please follow the link and comment on their blog:  Opiate for the masses.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

