College Basketball: Presence in the NBA over Time

November 7, 2013

(This article was first published on Decisions and R, and kindly contributed to R-bloggers)

Interested in practicing a bit of web-scraping, I decided to make use of a nice dataset provided by in order to examine the representation of various college programs in the NBA/ABA over time. This dataset only includes retired players, and ends in 2010, so I decided to only plot data through 2000.

Originally, I was excited to try out a googleVis motion chart using this data, but the result turned out less exciting that I expected.

Here, I've restricted my attention to teams which (at some point) have at least 11 players in the league simultaneously – this turns out limit the inclusion to a handful of programs.

While enthusiasts of NBA history surely will not need this plot to recall these periods of schools' strong presence in the league, I think the plot nicely captures the story behind several programs. It's easy to see the relatively recent emergence of Georgia Tech and Arizona, the slow climb of UNC and Michigan, the powerhouse years of Kentucky (1950s), and UCLA (1980s).

Generating code is below.

# data scrape:
site = ""
# turn off warnings:
options(warn = -1)
# readlines:
tab = readLines(site)
trim = function(x){
temp = substr(x,9,nchar(x)-8)
temp2 = strsplit(temp,split = ">")[[1]][1]
paste("",temp2,sep = "")
sub = tab[81:553]
sites = sapply(sub, trim)
# find lines around players:
dates.grab = function(s1){
temp = readLines(s1)
start = grep("listed separately",temp)
end = grep("font class=foot",temp)
sub = temp[(start+3):(end-2)]
pattern = "[[:digit:]]+-[[:digit:]]+"
m = gregexpr(pattern, sub)
df = data.frame(unlist(lapply(sites,dates.grab)))
names(df) = c("years")
test = rownames(df)[1] = function(name){
temp = strsplit(name,split = ">")[[1]][2]
df$school = unlist(lapply(rownames(df),
rownames(df) = 1:nrow(df)
df$year.start = unlist(lapply(df$years, function(x){substr(x,1,4)}))
df$year.end = unlist(lapply(df$years, function(x){substr(x,6,10)}))
df = df[,2:4]
df$year.start = as.numeric(df$year.start)
df$year.end = as.numeric(df$year.end)
#looking for players in 1946:
was.playing.func = function(years,test.year){
as.numeric(test.year %in% years[1]:years[2])
# 65 years
mat = matrix(rep(NA,nrow(df)*65),ncol = 65)
for(i in 1:65){
mat[,i] = apply(df[,2:3],1,function(x){was.playing.func(x,(i + 1945))})
copy = df
copy = cbind(copy, mat)
names(copy)[4:ncol(copy)] = 1946:2010
mdata = melt(copy, id = "school")
mdata = mdata[-which(mdata$variable %in% c("year.start","year.end")),]
names(mdata) = c("school","year","players")
mdata$year = as.numeric(as.character(mdata$year))
comb = ddply(mdata,.(school,year),summarise,tot.players = sum(players))
# looking at a subset:
comb2 = comb[comb$year<2001,]
top.sub = unique(comb2$school[which(comb2$tot.players > 10)])
df2 = comb2[which(comb2$school %in% top.sub),]
p = ggplot(df2, aes(x = year, y = tot.players, col = school)) + geom_line(lwd = 2) +
print(p + ylab("players in the NBA/ABA")) + opts(strip.text.y = theme_blank())
ggsave(file = "topCollegeNBA.png",height = 8)

To leave a comment for the author, please follow the link and comment on their blog: Decisions and R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)