Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

With the recent start of the college basketball season, I thought it would be fun to explore how style of play varies between conferences and reveal how the game has changed over time. Using yearly data for each school found on kenpom.com, a site for advanced college basketball stats, I examined the state of college hoops.

Embed from Getty Images

# Data Pre-Processing

library(tidyverse)
library(rvest)
library(stringi)
library(kableExtra)
library(ggthemr)
library(factoextra)
library(cluster)
library(NbClust)
library(plotly)

## Scrape and Clean Data

#define column names

#scrape kenpom data
seasons <- lapply(paste0('https://kenpom.com/index.php?y=', 2002:2018),
function(url){
url %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame() %>%
select(-c(Var.1, Var.7, Var.9, Var.11, Var.13,
Strength.of.Schedule.1,
Strength.of.Schedule.3,
Strength.of.Schedule.5, NCSOS.1)) %>%
colnames<-(variables) %>%
filter(!is.na(as.numeric(OppO))) %>%
mutate(year = stri_sub(url, from = -4, length = 4))
})

#combine yearly data
ncaa <- do.call(rbind, seasons) %>%
separate(Record, into = c("Wins", "Losses")) #extract wins/losses

ncaa[,3:13] <- lapply(ncaa[,3:13], as.numeric) #convert to numeric values
ncaa$Team <- gsub('[0-9]+', "", ncaa$Team) %>% #remove ncaa/nit tourney rank
trimws("r") #remove trailing spaces
ncaa$Conference <- recode(ncaa$Conference, P10 = "P12") #combine pac10 with pac12

In total, we’ll examine 17 years of data including 5,804 individual team seasons. Each season will be classified according to the year it ends, not the year it starts. For example, the 2017-18 college basketball season will be labeled simply as “2018”. A quick look at how the data is organized is shown below.

glimpse(ncaa)
## Observations: 5,804
## Variables: 14
## $Team "Duke", "Cincinnati", "Maryland", "Kansas", "Okla... ##$ Conference    "ACC", "CUSA", "ACC", "B12", "B12", "B10", "SEC",...
## $Wins 31, 31, 32, 33, 31, 25, 22, 26, 26, 22, 26, 22, 2... ##$ Losses        4, 4, 4, 4, 5, 12, 9, 9, 9, 10, 7, 10, 10, 6, 10,...
## $AdjEM 34.19, 30.19, 29.25, 28.99, 26.04, 24.80, 24.72, ... ##$ AdjO          121.0, 118.1, 119.2, 118.7, 114.9, 114.0, 115.1, ...
## $AdjD 86.8, 87.9, 89.9, 89.7, 88.9, 89.2, 90.4, 92.5, 9... ##$ AdjT          74.5, 67.4, 73.7, 77.3, 66.5, 65.6, 69.6, 68.5, 7...
## $Luck -0.027, 0.002, 0.025, 0.022, 0.043, -0.049, -0.07... ##$ SOS.AdjEM     9.87, 6.58, 9.88, 10.67, 8.77, 14.12, 9.11, 10.60...
## $OppO 109.3, 106.5, 109.4, 110.5, 109.2, 111.1, 108.3, ... ##$ OppD          99.5, 100.0, 99.5, 99.9, 100.4, 96.9, 99.2, 99.4,...
## $NC.SOS.AdjEM 6.66, 3.48, 1.62, 8.32, -0.44, 13.54, -0.56, 6.79... ##$ year          "2002", "2002", "2002", "2002", "2002", "2002", "...

# Pace of Play

## Overall Tempo

To begin, let’s look at how tempo has varied over time. Kenpom defines tempo as a team’s total possessions per 40 minutes (the length of a game minus overtime) adjusted for opponent. So, in general, the greater the tempo, the faster pace a team plays at. The plot below maps how tempo has evolved over time with the red dot representing each year’s mean.

It looks like most teams slowly but surely began slowing down overall pace; tempo was trending downwards right until the 2016 season. So, what happened in 2016? The NCAA reduced the shot clock from 35 seconds to 30 and looks to have made an immediate impact on the way teams play. The average team possessions jumped from 64.4 in 2015 to 68.2 in 2016, the highest since 2002. That’s an increase of about 8 more total possessions per game! Next, let’s break it down on the conference level.

## Conference Tempo

The plot below shows how each of the major conferences transitioned in terms of pace during the five years prior to using a 30 second shot clock and the three years since.

before <- ncaa %>%
filter(Conference %in% c("B10", "B12", "P12", "SEC", "ACC", "BE")) %>%
filter(year < 2016 & year > 2010) %>%
group_by(Conference) %>%

after <- ncaa %>%
filter(Conference %in% c("B10", "B12", "P12", "SEC", "ACC", "BE")) %>%
filter(year > 2015) %>%
group_by(Conference) %>%

tempo <- full_join(before, after, by = "Conference") %>%
gather("Shot Clock", Tempo, 2:3)

tempo$Conference <- recode(tempo$Conference, B10 = "Big 10", B12 = "Big 12", P12 = "Pac-12", BE = "Big East")

The Big East is no longer a slower paced, defensive-minded conference like the Big 10; they lead all conferences in terms of average tempo using a 30 second shot clock.

# Conference Clustering

## High-Major vs. Mid-Major

Each school competes in one of 32 different conferences which are determined primarily by geography and school size. Not all conferences are created equally, however. At the top of the food chain, “Power Six” conferences (ACC, Big 10, Big 12, Big East, Pac-12, SEC) recruit the top talent year in, year out and are the most competitive each year. The rest of the conferences could be considered “mid-major”, but some like to make the distinction to separate into one more category: “high-major.” They are typically a step above the rest of the mid-majors but aren’t as strong top-to-bottom as the Power Six. Let’s use k-means clustering to determine which conferences deserve the high-major distinction.

## K-means

First things first, let’s determine what data to use. Conference realignment kind of throws a wrench into things, so to get a better idea of the current landscape we’ll focus our efforts only on the past five seasons. We’ll use each conference’s median team offensive and defensive efficiency metric defined as the points scored or allowed per 100 possessions (adjusted for opponent) and tempo (described above) to paint a picture of each conference’s overall skill and style of play. Combining those features let’s determine the optimal number of clusters using silhouette coefficients.

set.seed(8675309)

kmeans.data <- ncaa %>%
filter(year > 2013) %>%
group_by(Conference) %>%
summarise_all(median) %>%
column_to_rownames("Conference") %>%
scale() 

The optimal number of centers to use for k-means model is 2. The silhouette score measures the separation of each point between clusters. In short, the higher score, the better the clustering result. The silhouette plot with 2 clusters is shown below.

ncaa.km <- kmeans(kmeans.data, 2, nstart = 33)

The results of our k-means cluster can be visualized on a PCA biplot showing 96.7% (65.6 + 31.1) of the variability in the data.

Congratulations to members of the American, Atlantic 10, Mountain West, Missouri Valley, and West Coast Conferences. You have officially been promoted to High-Major status! Now of those High-Majors, let’s see which schools could make the jump to a Power Six conference based on recent performance.

#define conference type
ncaa <- ncaa %>%
mutate(type = case_when(Conference %in% c("ACC", "B10", "B12", "P12", "SEC", "BE") ~ "Power 6",
Conference %in% c("A10", "Amer", "MWC", "MVC", "WCC") ~ "High-Major",
TRUE ~ "Mid-Major")) 

## The Best of the Rest

Kenpom ranks each team by its efficiency margin which is a team’s offensive minus defensive efficiency. Since 2014, the average efficiency margin for Power Six schools’ is 13.80. During that same time, High-Majors with an overall efficiency margin of that or better are listed below.

Gonzaga 24.764
Wichita St. 23.054
Cincinnati 19.846
SMU 18.098
Saint Mary’s 15.658
San Diego St. 14.752
VCU 13.898

These schools will most likely be given a hard look from Power Six conferences looking to expand. Gonzaga, Wichita St., and VCU have all made the Final Four this decade as well.