All the (NBA) box scores you ever wanted

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this previous post, I showed how one can scrape top-level NBA game data from BasketballReference.com. In the post after that, I demonstrated how to scrape play-by-play data for one game. After writing those posts, I thought to myself: why not do both? And that is what I did: scrape all the box scores for the 2017-18 NBA season and save them to an RDS object.

The code for web scraping can be found here. I don’t think it’s particularly insightful to go through the details line-by-line… What I’ll do instead is explain what is in the saved RDS object which will allow you to make use of the data, and I’ll end off with some visualizations to demonstrate what one can do with this data, specifically:

  • Do Kevin Durant and Steph Curry get in each other’s way?
  • Are the Golden State Warriors really better in the 3rd quarter?

(Code that I wrote for these questions can be found here.) Let’s load the packages and data (data available as links from this page):

library(tidyverse)
game_df <- readRDS("NBA-2018_game_data.rds")
master_list <- readRDS("NBA-2018_box_score.rds")

First, let’s take a quick look at game_df which contains the top-level game data:

head(game_df)
#>        game_id  date_game game_start_time visitor_team_name visitor_pts        home_team_name home_pts
#> 1 201710170CLE 2017-10-17           8:01p    Boston Celtics          99   Cleveland Cavaliers      102
#> 2 201710170GSW 2017-10-17          10:30p   Houston Rockets         122 Golden State Warriors      121
#> 3 201710180BOS 2017-10-18           7:30p   Milwaukee Bucks         108        Boston Celtics      100
#> 4 201710180DAL 2017-10-18           8:30p     Atlanta Hawks         117      Dallas Mavericks      111
#> 5 201710180DET 2017-10-18           7:00p Charlotte Hornets          90       Detroit Pistons      102
#> 6 201710180IND 2017-10-18           7:00p     Brooklyn Nets         131        Indiana Pacers      140
#>   overtimes attendance game_remarks game_type
#> 1                20562                Regular
#> 2                19596                Regular
#> 3                18624                Regular
#> 4                19709                Regular
#> 5                20491                Regular
#> 6                15008                Regular

The first column, game_id, is BasketBallReference.com’s way of uniquely identifying a game. For example, if we wanted to see the results of the game between the Celtics and the Cavs on 17 Oct 2017, we would look up the game_id (201710170CLE) and go to the URL https://www.basketball-reference.com/boxscores/201710170CLE.html. The game_id is how I was able to scrape the box scores for all the games.

Next, let’s look at the master_list object which contains all the box score information:

typeof(master_list)
#> [1] "list"
length(master_list)
#> [1] 1312
head(names(master_list))
#> [1] "201710170CLE" "201710170GSW" "201710180BOS" "201710180DAL" "201710180DET" "201710180IND"

It is a list of length 1312, which corresponds to the total number of games (Regular Season and Playoffs) in the 2017-18 season. The keys of the list are the game_ids, and (as we will see shortly) each element of the list is itself a list of length 5. Let’s take a closer look at the first element of the list:

str(master_list[[1]])
#> List of 5
#>  $ visitor_basic_boxscore:'data.frame':	13 obs. of  22 variables:
#>   ..$ Player: chr [1:13] "Jaylen Brown" "Kyrie Irving" "Jayson Tatum" "Al Horford" ...
#>   ..$ MP    : chr [1:13] "39:36" "39:21" "36:32" "32:07" ...
#>   ..$ FG    : num [1:13] 11 8 5 2 1 5 2 2 0 0 ...
#>   ..$ FGA   : num [1:13] 23 17 12 7 2 16 6 2 2 1 ...
#>   ..$ FG%   : num [1:13] 0.478 0.471 0.417 0.286 0.5 0.313 0.333 1 0 0 ...
#>   ..$ 3P    : num [1:13] 2 4 1 0 0 0 1 0 0 0 ...
#>   ..$ 3PA   : num [1:13] 9 9 2 2 1 4 3 0 1 1 ...
#>   ..$ 3P%   : num [1:13] 0.222 0.444 0.5 0 0 0 0.333 NA 0 0 ...
#>   ..$ FT    : num [1:13] 1 2 3 5 0 2 4 2 0 0 ...
#>   ..$ FTA   : num [1:13] 2 2 3 7 0 3 4 4 0 0 ...
#>   ..$ FT%   : num [1:13] 0.5 1 1 0.714 NA 0.667 1 0.5 NA NA ...
#>   ..$ ORB   : num [1:13] 1 2 4 0 0 0 0 2 0 0 ...
#>   ..$ DRB   : num [1:13] 5 2 6 7 1 9 3 3 0 1 ...
#>   ..$ TRB   : num [1:13] 6 4 10 7 1 9 3 5 0 1 ...
#>   ..$ AST   : num [1:13] 0 10 3 5 0 3 2 1 0 0 ...
#>   ..$ STL   : num [1:13] 2 3 0 0 0 2 4 0 0 0 ...
#>   ..$ BLK   : num [1:13] 0 0 0 1 0 2 0 1 0 0 ...
#>   ..$ TOV   : num [1:13] 3 2 1 0 0 2 0 2 0 0 ...
#>   ..$ PF    : num [1:13] 5 4 4 2 1 2 0 5 1 0 ...
#>   ..$ PTS   : num [1:13] 25 22 14 9 2 12 9 6 0 0 ...
#>   ..$ +/-   : num [1:13] -5 -1 6 8 3 -8 4 -14 -10 2 ...
#>   ..$ Role  : chr [1:13] "Starter" "Starter" "Starter" "Starter" ...
#>  $ visitor_adv_boxscore  :'data.frame':	13 obs. of  17 variables:
#>   ..$ Player: chr [1:13] "Jaylen Brown" "Kyrie Irving" "Jayson Tatum" "Al Horford" ...
#>   ..$ MP    : chr [1:13] "39:36" "39:21" "36:32" "32:07" ...
#>   ..$ TS%   : num [1:13] 0.523 0.615 0.526 0.446 0.5 0.346 0.58 0.798 0 0 ...
#>   ..$ eFG%  : num [1:13] 0.522 0.588 0.458 0.286 0.5 0.313 0.417 1 0 0 ...
#>   ..$ 3PAr  : num [1:13] 0.391 0.529 0.167 0.286 0.5 0.25 0.5 0 0.5 1 ...
#>   ..$ FTr   : num [1:13] 0.087 0.118 0.25 1 0 0.188 0.667 2 0 0 ...
#>   ..$ ORB%  : num [1:13] 2.4 4.9 10.5 0 0 0 0 10.1 0 0 ...
#>   ..$ DRB%  : num [1:13] 13.2 5.3 17.1 22.7 19.9 26.8 16 16.4 0 21.7 ...
#>   ..$ TRB%  : num [1:13] 7.6 5.1 13.7 10.9 9.5 12.8 7.7 13.1 0 10.4 ...
#>   ..$ AST%  : num [1:13] 0 46.5 13.4 22.6 0 14.1 15.8 8.1 0 0 ...
#>   ..$ STL%  : num [1:13] 2.4 3.7 0 0 0 2.8 9.9 0 0 0 ...
#>   ..$ BLK%  : num [1:13] 0 0 0 2.5 0 4.5 0 4.1 0 0 ...
#>   ..$ TOV%  : num [1:13] 11.2 10.1 7 0 0 10.4 0 34.7 0 0 ...
#>   ..$ USG%  : num [1:13] 29.9 22.2 17.3 13.8 16.8 24.3 17.5 13.3 10.2 9.1 ...
#>   ..$ ORtg  : num [1:13] 89 124 115 111 95 70 129 110 0 0 ...
#>   ..$ DRtg  : num [1:13] 104 103 108 105 107 96 87 105 112 107 ...
#>   ..$ Role  : chr [1:13] "Starter" "Starter" "Starter" "Starter" ...
#>  $ home_basic_boxscore   :'data.frame':	14 obs. of  22 variables:
#>   ..$ Player: chr [1:14] "LeBron James" "Jae Crowder" "Derrick Rose" "Dwyane Wade" ...
#>   ..$ MP    : chr [1:14] "41:12" "34:44" "31:15" "28:30" ...
#>   ..$ FG    : num [1:14] 12 3 5 3 4 4 2 3 2 0 ...
#>   ..$ FGA   : num [1:14] 19 10 14 10 9 7 3 8 3 0 ...
#>   ..$ FG%   : num [1:14] 0.632 0.3 0.357 0.3 0.444 0.571 0.667 0.375 0.667 NA ...
#>   ..$ 3P    : num [1:14] 1 1 1 0 1 1 0 0 0 0 ...
#>   ..$ 3PA   : num [1:14] 5 5 3 1 4 3 0 1 0 0 ...
#>   ..$ 3P%   : num [1:14] 0.2 0.2 0.333 0 0.25 0.333 NA 0 NA NA ...
#>   ..$ FT    : num [1:14] 4 4 3 2 6 1 1 0 0 0 ...
#>   ..$ FTA   : num [1:14] 4 4 4 2 7 1 3 0 0 0 ...
#>   ..$ FT%   : num [1:14] 1 1 0.75 1 0.857 1 0.333 NA NA NA ...
#>   ..$ ORB   : num [1:14] 1 1 1 1 3 0 1 0 1 0 ...
#>   ..$ DRB   : num [1:14] 15 4 3 1 8 4 5 0 1 0 ...
#>   ..$ TRB   : num [1:14] 16 5 4 2 11 4 6 0 2 0 ...
#>   ..$ AST   : num [1:14] 9 2 2 3 0 1 2 0 0 0 ...
#>   ..$ STL   : num [1:14] 0 2 0 0 0 0 0 0 1 0 ...
#>   ..$ BLK   : num [1:14] 2 0 0 2 0 0 0 0 0 0 ...
#>   ..$ TOV   : num [1:14] 4 1 2 4 2 0 2 1 1 0 ...
#>   ..$ PF    : num [1:14] 3 2 2 1 2 4 3 3 3 2 ...
#>   ..$ PTS   : num [1:14] 29 11 14 8 15 10 5 6 4 0 ...
#>   ..$ +/-   : num [1:14] 2 7 -7 0 1 7 2 -2 6 -1 ...
#>   ..$ Role  : chr [1:14] "Starter" "Starter" "Starter" "Starter" ...
#>  $ home_adv_boxscore     :'data.frame':	14 obs. of  17 variables:
#>   ..$ Player: chr [1:14] "LeBron James" "Jae Crowder" "Derrick Rose" "Dwyane Wade" ...
#>   ..$ MP    : chr [1:14] "41:12" "34:44" "31:15" "28:30" ...
#>   ..$ TS%   : num [1:14] 0.698 0.468 0.444 0.368 0.621 0.672 0.579 0.375 0.667 NA ...
#>   ..$ eFG%  : num [1:14] 0.658 0.35 0.393 0.3 0.5 0.643 0.667 0.375 0.667 NA ...
#>   ..$ 3PAr  : num [1:14] 0.263 0.5 0.214 0.1 0.444 0.429 0 0.125 0 NA ...
#>   ..$ FTr   : num [1:14] 0.211 0.4 0.286 0.2 0.778 0.143 1 0 0 NA ...
#>   ..$ ORB%  : num [1:14] 2.5 3 3.3 3.7 11 0 5.3 0 8.1 0 ...
#>   ..$ DRB%  : num [1:14] 35 11.1 9.2 3.4 27 17.5 24.5 0 7.5 0 ...
#>   ..$ TRB%  : num [1:14] 19.4 7.2 6.4 3.5 19.4 9.1 15.3 0 7.8 0 ...
#>   ..$ AST%  : num [1:14] 43.7 8.2 10.1 15.3 0 7.5 14.8 0 0 0 ...
#>   ..$ STL%  : num [1:14] 0 2.8 0 0 0 0 0 0 3.8 0 ...
#>   ..$ BLK%  : num [1:14] 4.2 0 0 6 0 0 0 0 0 0 ...
#>   ..$ TOV%  : num [1:14] 16.2 7.8 11.3 26.9 14.2 0 31.6 11.1 25 NA ...
#>   ..$ USG%  : num [1:14] 26 15.9 24.6 22.6 21.4 14.7 13.9 27.3 13.5 0 ...
#>   ..$ ORtg  : num [1:14] 126 100 88 67 117 138 92 67 102 0 ...
#>   ..$ DRtg  : num [1:14] 94 98 105 102 100 102 100 107 96 106 ...
#>   ..$ Role  : chr [1:14] "Starter" "Starter" "Starter" "Starter" ...
#>  $ pdp_df                :'data.frame':	114 obs. of  4 variables:
#>   ..$ time   : num [1:114] 16 72 88 99 131 151 166 191 191 238 ...
#>   ..$ visitor: num [1:114] 2 4 4 4 4 4 6 6 6 8 ...
#>   ..$ home   : num [1:114] 0 0 1 3 5 7 7 8 9 9 ...
#>   ..$ period : num [1:114] 1 1 1 1 1 1 1 1 1 1 ...

The first 4 elements of the list (visitor_basic_boxscore, visitor_adv_boxscore, home_basic_boxscore, home_basic_boxscore) are data frames for the basic and advanced box scores of both teams. They are the box scores you see on BasketBallReference.com (with some minor preprocessing):

Box scores from BasketballReference.com.

The last element of the list is a data frame containing play-by-play information, much like that in the previous post.

Hopefully some of you will take this data and make some cool data visualizations from it! Below, I will walk through how I use this dataset for two visualizations.

Do Kevin Durant and Steph Curry get in each other’s way?

Kevin Durant and Steph Curry are two of the most lethal scorers in today’s game. Does both of them being on the same team hamper the other’s production? This is pretty difficult to pin down what we mean by this statement precisely; we will instead make do with a couple of suggestive plots.

The first plot we can make is a scatterplot of the points scored by these two players. If the relationship is positively correlated, it could mean that they make each other better.

Let’s pull out the game_ids for Golden State:

team_name <- "Golden State Warriors"
player1 <- "Stephen Curry"
player2 <- "Kevin Durant"

# get game_ids
game_rows <- which(game_df$visitor_team_name == team_name |
                       game_df$home_team_name == team_name)
game_ids <- game_df[game_rows, "game_id"]

Next, we set up a data frame consisting of 4 columns: the game_id, game_type (regular or playoff), and points scored by players 1 and 2 respectively.

points_df <- data.frame(matrix(NA, ncol = 4, nrow = length(game_ids)))
names(points_df) <- c("game_id", "game_type", player1, player2)

The loop below builds up the data frame row by row (not the most R-like way of programming but it gets the job done…):

for (i in 1:nrow(points_df)) {
    id <- game_ids[i]
    points_df[i, 1] <- id
    points_df[i, 2] <- game_df[game_rows[i], "game_type"]
    
    # get the correct basic box score
    if (game_df[game_rows[i], "visitor_team_name"] == team_name) {
        boxscore <- master_list[[id]]$visitor_basic_boxscore
    } else {
        boxscore <- master_list[[id]]$home_basic_boxscore
    }
    
    # get player points
    if (player1 %in% boxscore$Player) {
        points_df[i, 3] <- subset(boxscore, Player == player1)$PTS
    }
    if (player2 %in% boxscore$Player) {
        points_df[i, 4] <- subset(boxscore, Player == player2)$PTS } } head(points_df) 
#>        game_id game_type Stephen Curry Kevin Durant
#> 1 201710170GSW   Regular            22           20
#> 2 201710200NOP   Regular            28           22
#> 3 201710210MEM   Regular            37           29
#> 4 201710230DAL   Regular            29           25
#> 5 201710250GSW   Regular            30           29
#> 6 201710270GSW   Regular            20           31

Next, we make the scatterplot (notice how we can use the get function to refer to the columns programmatically). We force the axes to start at 0 and the aspect ratio to be 1, and include the 45 degree line for reference.

ggplot(data = points_df, aes(x = get(player1), y = get(player2))) +
    geom_point(alpha = 0.5) +
    geom_smooth(se = FALSE) +
    geom_abline(slope = 1, intercept = 0, lwd = 0.5, col = "red", lty = 2) +
    scale_x_continuous(limits = c(0, NA)) +
    scale_y_continuous(limits = c(0, NA)) +
    labs(x = player1, y = player2, title = "Points scored") +
    coord_fixed()

One might say, wouldn’t it be better to compare games where KD and Curry play, vs. games where only KD plays? If KD scores more if he is playing with Curry, that could suggest that playing together makes him more productive. To make these plots, we introduce two new columns which are Boolean variables representing whether a given player played in that game or not. We then make boxplots to compare the points scored depending on whether the other player played.

# ignore games where both didn't play
points_df2 <- points_df[!(is.na(points_df[[3]]) & is.na(points_df[[4]])), ]
points_df2[[paste(player1, "played")]] <- !is.na(points_df2[[3]])
points_df2[[paste(player2, "played")]] <- !is.na(points_df2[[4]])

# boxplots
ggplot(data = points_df2, aes(x = get(paste(player1, "played")), y = get(player2))) +
    geom_boxplot() +
    geom_jitter(height = 0, alpha = 0.5, col = "blue") +
    labs(x = paste(player1, "played?"), y = "Points scored", 
         title = paste("Points", player2, "scored"))
ggplot(data = points_df2, aes(x = get(paste(player2, "played")), y = get(player1))) +
    geom_boxplot() +
    geom_jitter(height = 0, alpha = 0.5, col = "blue") +
    labs(x = paste(player2, "played?"), y = "Points scored", 
         title = paste("Points", player1, "scored"))

From the plots, it looks like each player scores a bit more when the other person is not playing, although there is quite a bit of variance.

Instead of looking at points scored, we could look at more sophisticated measures of production. For example, the NBA computes individual offensive and defensive ratings for each game based on a complicated formula (see here for details). These measures are provided as ORtg and DRtg in the advanced box score.

The code below pulls out the ORtg and DRtg values for KD and Curry:

points_df <- data.frame(matrix(NA, ncol = 6, nrow = length(game_ids)))
names(points_df) <- c("game_id", "game_type", 
                      sapply(c(player1, player2), function(x) paste(x, "ORtg")),
                      sapply(c(player1, player2), function(x) paste(x, "DRtg")))

for (i in 1:nrow(points_df)) {
    id <- game_ids[i]
    points_df[i, 1] <- id
    points_df[i, 2] <- game_df[game_rows[i], "game_type"]
    
    # get the correct advanced box score
    if (game_df[game_rows[i], "visitor_team_name"] == team_name) {
        boxscore <- master_list[[id]]$visitor_adv_boxscore
    } else {
        boxscore <- master_list[[id]]$home_adv_boxscore
    }
    
    # get player ORtg & DRtg
    if (player1 %in% boxscore$Player) {
        points_df[i, 3] <- subset(boxscore, Player == player1)$ORtg
        points_df[i, 5] <- subset(boxscore, Player == player1)$DRtg
    }
    if (player2 %in% boxscore$Player) {
        points_df[i, 4] <- subset(boxscore, Player == player2)$ORtg
        points_df[i, 6] <- subset(boxscore, Player == player2)$DRtg
    }
}

And here are the plots (along with a linear fit):


This was pretty interesting to me: it looks like the ORtgs are independent of each other while the DRtgs are highly correlated with each other. Might this suggest that defense is more of a team activity, while offense can be driven by individuals?

There are of course many other visualizations we could do to explore this question. We might also want to see how the KD-Curry pairing compares with other pairings. Because of the way the code is written, it’s really easy to generate these figures for other pairings: we just need to update the team_name, player1 and player2 variables and rerun the code. As an example, here are the ORtg and DRtg scatterplots for Damian Lillard and C.J. McCollum of the Portland Trail Blazers:


We see similar trends here.

Are the Golden State Warriors really better in the 3rd quarter?

Last season, it seemed like the Warriors were often down at the half, only to come roaring back in the third quarter. Is this really their best quarter? (This NYT article says yes and suggests why that might be the case.) To answer this question, we’ll make a lead tracker graphic (much like the previous post) but (i) we will do it for all games GSW played in, and (ii) we reset the score differential to 0 at the beginning of each period.

We first set up a data frame that records the game_id, whether the team was at home, and whether they won (we might want to use latter two variables in our plotting later):

team_name <- "Golden State Warriors"

game_rows <- which(game_df$visitor_team_name == team_name |
                       game_df$home_team_name == team_name)
game_ids <- game_df[game_rows, "game_id"]

# set up game data frame and play-by-play list
df <- data.frame(matrix(NA, ncol = 3, nrow = length(game_ids)))
names(df) <- c("game_id", "home", "win")
df$game_id <- game_ids
df$home <- game_df[game_rows, "home_team_name"] == team_name
df$win <- (df$home & game_df[game_rows, "home_pts"] > game_df[game_rows, "visitor_pts"]) |
    ((!df$home) & game_df[game_rows, "home_pts"] < game_df[game_rows, "visitor_pts"])

Next, we create a list with keys being the game_ids and the value being the play-by-play data frame from the master list we scraped earlier. We standardize the column names so that GSW’s score is labeled team and the opponent’s score is labeled opp. (Note: At the time of writing, BasketBallReference.com had incorrect data for the game 201710200NOP, so I have removed that game from our data.)

pbp_list <- list()
for (i in 1:nrow(df)) {
    id <- game_ids[i]
    pdp <- master_list[[id]]$pdp_df
    if (df[i, "home"]) {
        names(pdp) <- c("time", "opp", "team", "period")
    } else {
        names(pdp) <- c("time", "team", "opp", "period")
    }
    pbp_list[[id]] <- pdp
}

# NOTE: mistake in 201710200NOP. original data on website is wrong
# https://www.basketball-reference.com/boxscores/pbp/201710200NOP.html
# so we remove it
df <- df[-which(df$game_id == "201710200NOP"), ]
pbp_list[["201710200NOP"]] <- NULL

We create a function parse_pbp which helps us reset the score at the beginning of every quarter. To make the lines in our eventual visualization smoother, parse_pbp also adds additional rows for the beginning and end of every quarter.

parse_pbp <- function(pbp) {
    pbp <- rbind(0, pbp)
    new_pbp <- pbp
    
    # get points scored in the period
    last_opp <- 0; last_team <- 0; last_period <- 0 for (i in 2:nrow(pbp)) { if (pbp[i, "period"] > last_period + 1) {
            last_period <- last_period + 1
            last_opp <- pbp[i-1, "opp"]
            last_team <- pbp[i-1, "team"]
        }
        new_pbp$opp[i] <- pbp$opp[i] - last_opp
        new_pbp$team[i] <- pbp$team[i] - last_team
    }
    
    # add extra rows to denote beginning and end of periods
    num_period <- max(new_pbp$period)
    for (i in 1:num_period) {
        end_row <- new_pbp[max(which(new_pbp$period == i)), ]
        end_row[1] <- 12 * 60 * min(i, 4) + 5 * 60 * max(i-4, 0) 
        beg_row <- c(0, 0, 0, i)
        beg_row[1] <- 12 * 60 * min(i-1, 4) + 5 * 60 * max(i-1-4, 0) 
        new_pbp <- rbind(new_pbp, beg_row)
        new_pbp <- rbind(new_pbp, end_row)
    }
    new_pbp <- new_pbp[order(new_pbp$time), ]
    new_pbp$adv <- with(new_pbp, team - opp)
    new_pbp[-1, ]
}

pbp_list <- lapply(pbp_list, parse_pbp)

Finally, we make the list into a dataframe:

pbp_df <- data.frame(matrix(ncol = 6, nrow = 0))
names(pbp_df) <- c("game_id", "home", "win", "time", "period", "adv")
for (i in seq_along(pbp_list)) {
    xx <- pbp_list[[i]]
    xx$game_id <- df$game_id[i]
    xx$home <- df$home[i]
    xx$win <- df$win[i]
    pbp_df <- rbind(pbp_df, xx) } head(pbp_df) #>     time opp team period adv      game_id home   win
#> 124    0   0    0      1   0 201710170GSW TRUE FALSE
#> 2     13   2    0      1  -2 201710170GSW TRUE FALSE
#> 3     55   2    1      1  -1 201710170GSW TRUE FALSE
#> 4     55   2    2      1   0 201710170GSW TRUE FALSE
#> 5     90   2    5      1   3 201710170GSW TRUE FALSE
#> 6    104   5    5      1   0 201710170GSW TRUE FALSE

We can now make the line plot easily. We use the interaction function to group the line plots by both game_id and quarter, color the lines by whether GSW won or not, and manually define the breaks to be at the end of the periods:

periods <- unique(pbp_df$period)
x_value <- ifelse(periods <= 4, 12 * 60 * periods, 
                  12 * 60 * 4 + 5 * 60 * (periods - 4))
x_label <- ifelse(periods <= 4, paste0("Q", periods), 
                  paste0("OT", periods - 4))

ggplot(pbp_df, aes(x = time, y = adv)) +
    geom_line(aes(col = win, group = interaction(game_id, period)), 
              lwd = 0.1) +
    geom_smooth(aes(group = period), col = "black", se = FALSE) +
    scale_x_continuous(breaks = x_value, labels = x_label) +
    scale_color_manual(values = c("#ff6600", "#3366ff")) +
    labs(title = paste("Point Advantage by Quarter,", team_name)) +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5),
          axis.title.x = element_blank(), axis.title.y = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.minor.y = element_blank(),
          legend.position = "bottom")

It may not be so obvious in this plot, but there certainly seems to be a bit of a difference in the third quarter. On average, the Warriors are even or up 1-2 points, but in the third quarter GSW gains almost 5 points. The difference is more obvious in the plot below, where we zoom in on the y-axis. (Note that we have to use coord_cartesian instead of scale_y_continuous to zoom in: scale_y_continuous will remove the points outside the range before plotting, which results in the incorrect result for the geom_smooth layer.

ggplot(pbp_df, aes(x = time, y = adv)) +
    geom_line(aes(col = win, group = interaction(game_id, period)), 
              lwd = 0.1) +
    geom_smooth(aes(group = period), col = "black", se = FALSE) +
    scale_x_continuous(breaks = x_value, labels = x_label) +
    scale_color_manual(values = c("#ff6600", "#3366ff")) +
    coord_cartesian(ylim = c(-10, 10)) +
    labs(title = paste("Point Advantage by Quarter,", team_name)) +
    theme_minimal() +
    theme(plot.title = element_text(face = "bold", hjust = 0.5),
          axis.title.x = element_blank(), axis.title.y = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.minor.y = element_blank(),
          legend.position = "bottom")

The advantage becomes even more obvious when we facet by whether GSW won or not: in wins, they gain almost 10 points in the 3rd quarter, compared to something like 1-3 points in each of the other quarters.

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)