(This article was first published on

**NYC Data Science Academy » R**, and kindly contributed to R-bloggers) Slides:

**Taipei opendata 20141209**from**Summit Suen**## What comes up with you while talking about SPORTS DATA ANALYSIS?!

## Hot, Hot, Hot: Big Data & Sports (1/2)

## Hot, Hot, Hot: Big Data & Sports (2/2)

In fact, works analyzing baseball data with statistical techniques has already exist: Sabermatrics

## There are plenty of excellent packages that could help us analyzing baseball data.

A simple demo with r-packages: dplyr & lahman

```
library(Lahman)
library(dplyr)
totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
names(totalRS) <- c("yearID", "RUN")
head(totalRS)
```

## Source: local data frame [6 x 2] ## ## yearID RUN ## 1 1871 93.12897 ## 2 1872 95.21474 ## 3 1873 73.15998 ## 4 1874 58.55903 ## 5 1875 70.08774 ## 6 1876 47.01267

And one could easily observe the trend of scoring in MLB history via lahman + ggplot2

```
library(ggplot2)
ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()
```

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Even admire the career numbers of Derek Jeter! LOL

```
library(Lahman)
library(dplyr)
head(filter(Batting, playerID == "jeterde01"))
```

## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR ## 1 jeterde01 1995 1 NYA AL 15 15 48 5 12 4 1 0 ## 2 jeterde01 1996 1 NYA AL 157 157 582 104 183 25 6 10 ## 3 jeterde01 1997 1 NYA AL 159 159 654 116 190 31 7 10 ## 4 jeterde01 1998 1 NYA AL 149 149 626 127 203 25 8 19 ## 5 jeterde01 1999 1 NYA AL 158 158 627 134 219 37 9 24 ## 6 jeterde01 2000 1 NYA AL 148 148 593 119 201 31 4 15 ## RBI SB CS BB SO IBB HBP SH SF GIDP G_old ## 1 7 0 0 3 11 0 0 0 0 0 15 ## 2 78 14 7 48 102 1 9 6 9 13 157 ## 3 70 23 12 74 125 0 10 8 2 14 159 ## 4 84 30 6 57 119 1 5 3 3 13 149 ## 5 102 19 8 91 116 5 12 3 6 12 158 ## 6 73 22 4 68 99 4 12 3 3 14 148

Here comes another awesome package: pitchRx

Which offers APIs accessing daily MLB logs and 2D/3D animations.

```
install.packages("pitchRx")
library(pitchRx)
```

OpenWAR is a open project for calculating WAR – Wins Above Replacement

WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?”

```
install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")
require(devtools)
install_github("openWAR", "beanumber")
```

## After the amazing tour of useful packages, we need to write our OWN crawler/analyzer!!!

## Data Source

## Crawler in R (1/2)

```
## RSelenium + phantomJS
library(XML)
library(RSelenium)
pJS <- phantom()
Sys.sleep(5)
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
url <- 'http://www.cpbl.com.tw/stats_hr.aspx'
remDr$navigate(url)
hr_tables <- list()
```

## Crawler in R (2/2)

```
year <- seq(1990, 2014)
for (i in 1:length(year)) {
option <- remDr$findElement(using = 'xpath', paste0("//*/option[@value = '",year[i],"']"))
option$clickElement()
src <- option$getPageSource()
hr_tables[i] <- readHTMLTable(src[[1]], stringAsFactors = FALSE, encoding = "utf8")
}
# View(hr_tables)
remDr$close()
pJS$stop()
```

## Let’s PLAY DATA!

```
load("hr_tables.RData")
head(hr_tables[[25]])
```

## NUMBER YEAR GID DATE STADIUM BATTER BATTERTEAM PITCHER ## 1 # YEAR GAME NO. DATE STADIUM PLAYER PLAYER TEAM PITCHER ## 2 1 2014 2 2014/3/23 天母 林 威 助 中信兄弟 費 古 洛 ## 3 2 2014 3 2014/3/23 澄清湖 詹 智 堯 Lamigo 林 正 豐 ## 4 3 2014 4 2014/3/25 新莊 張 志 豪 中信兄弟 林 晨 樺 ## 5 4 2014 5 2014/3/26 嘉義市 林 泓 育 Lamigo 林 岳 平 ## 6 5 2014 6 2014/3/27 新莊 張 志 豪 中信兄弟 黃 勝 雄 ## PITCHERTEAM RBI REMARK ## 1 PITCHER TEAM RBI REMARK ## 2 統一7-ELEVEn 1 ## 3 義大 1 ## 4 義大 1 ## 5 統一7-ELEVEn 1 ## 6 義大 1

## Draw some pictures (1/2)

```
par(family = 'Heiti TC Light')
m <- ggplot(hr_table, aes(x = as.numeric(GID)))
m + geom_density(aes(fill = factor(BatterTeam), alpha = 0.01)) +
labs(title = paste0("HR Distribution of ", year[i]," Season 中華職棒", year[i], "賽季逐場全壘打分布")) +
theme(text = element_text(family="Heiti TC Light"), plot.title = element_text(size = 26), legend.text = element_text(size = 20))
```

## Keep going – further analysis

```
load("sc_tables.RData")
# 賽伯計量學的畢達哥拉斯定理
win_pc <- function(RS, RA) {
perc <- RS^2 / (RS^2 + RA^2)
return(perc)
}
```

## Draw some pictures (2/2)

`levels(win_prob$team)`

## [1] "統一7-ELEVEn" "義大" "中信兄弟" "Lamigo"

```
# fig <- ggplot(data = win_prob, aes(x = team, y = real, fill = factor(half)))
# fig + geom_bar(stat = "identity", position="dodge")
ggplot(data = win_prob, aes(x = expd, y = real, color = factor(team))) + stat_smooth() + geom_point() + coord_fixed()
```

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

## An example demonstrating the relationship between “Real Value” v.s. “Market Value”

## Advertising

To

**leave a comment**for the author, please follow the link and comment on their blog:**NYC Data Science Academy » R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...