Play Data, Play Ball! – Exploring Baseball Data with R

December 31, 2014
By

(This article was first published on NYC Data Science Academy » R, and kindly contributed to R-bloggers)

  • Demo day: Dec 9, 2014. No photos available.
  • Contributed by Summit Suen & Wayne Chen.
  • They took the CORP-R 002 class with Vivian Zhang (Taiwan Open data and data science / 臺北國際 OPEN DATA 培訓) in Oct, 2014. This post was based on their final project submission.

  • What comes up with you while talking about SPORTS DATA ANALYSIS?!


    Hot, Hot, Hot: Big Data & Sports (1/2)


    Hot, Hot, Hot: Big Data & Sports (2/2)


    In fact, works analyzing baseball data with statistical techniques has already exist: Sabermatrics


    There are plenty of excellent packages that could help us analyzing baseball data.

    A simple demo with r-packages: dplyr & lahman

    library(Lahman)
    library(dplyr)
    totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
    names(totalRS) <- c("yearID", "RUN")
    head(totalRS)
    ## Source: local data frame [6 x 2]
    ## 
    ##   yearID      RUN
    ## 1   1871 93.12897
    ## 2   1872 95.21474
    ## 3   1873 73.15998
    ## 4   1874 58.55903
    ## 5   1875 70.08774
    ## 6   1876 47.01267


    And one could easily observe the trend of scoring in MLB history via lahman + ggplot2

    library(ggplot2)
    ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()
    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

    plot of chunk unnamed-chunk-2


    Even admire the career numbers of Derek Jeter! LOL

    library(Lahman)
    library(dplyr)
    head(filter(Batting, playerID == "jeterde01"))
    ##    playerID yearID stint teamID lgID   G G_batting  AB   R   H X2B X3B HR
    ## 1 jeterde01   1995     1    NYA   AL  15        15  48   5  12   4   1  0
    ## 2 jeterde01   1996     1    NYA   AL 157       157 582 104 183  25   6 10
    ## 3 jeterde01   1997     1    NYA   AL 159       159 654 116 190  31   7 10
    ## 4 jeterde01   1998     1    NYA   AL 149       149 626 127 203  25   8 19
    ## 5 jeterde01   1999     1    NYA   AL 158       158 627 134 219  37   9 24
    ## 6 jeterde01   2000     1    NYA   AL 148       148 593 119 201  31   4 15
    ##   RBI SB CS BB  SO IBB HBP SH SF GIDP G_old
    ## 1   7  0  0  3  11   0   0  0  0    0    15
    ## 2  78 14  7 48 102   1   9  6  9   13   157
    ## 3  70 23 12 74 125   0  10  8  2   14   159
    ## 4  84 30  6 57 119   1   5  3  3   13   149
    ## 5 102 19  8 91 116   5  12  3  6   12   158
    ## 6  73 22  4 68  99   4  12  3  3   14   148


    Here comes another awesome package: pitchRx
    Which offers APIs accessing daily MLB logs and 2D/3D animations.

    install.packages("pitchRx")
    library(pitchRx)


    OpenWAR is a open project for calculating WAR – Wins Above Replacement
    WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?”

    install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")
    require(devtools)
    install_github("openWAR", "beanumber")


    After the amazing tour of useful packages, we need to write our OWN crawler/analyzer!!!


    Data Source


    Crawler in R (1/2)

    ## RSelenium + phantomJS
    library(XML)
    library(RSelenium)
    
    pJS <- phantom()
    Sys.sleep(5)
    remDr <- remoteDriver(browserName = 'phantomjs')
    remDr$open()
    
    url <- 'http://www.cpbl.com.tw/stats_hr.aspx'
    remDr$navigate(url)
    
    hr_tables <- list()

    Crawler in R (2/2)

    year <- seq(1990, 2014)
    for (i in 1:length(year)) {
      option <- remDr$findElement(using = 'xpath', paste0("//*/option[@value = '",year[i],"']"))
      option$clickElement()
      src <- option$getPageSource()
      hr_tables[i] <- readHTMLTable(src[[1]], stringAsFactors = FALSE, encoding = "utf8")
    }
    
    # View(hr_tables)
    
    remDr$close()
    pJS$stop()

    Let’s PLAY DATA!

    load("hr_tables.RData")
    
    head(hr_tables[[25]])
    ##   NUMBER YEAR      GID      DATE STADIUM   BATTER  BATTERTEAM  PITCHER
    ## 1      # YEAR GAME NO.      DATE STADIUM   PLAYER PLAYER TEAM  PITCHER
    ## 2      1 2014        2 2014/3/23    天母 林 威 助    中信兄弟 費 古 洛
    ## 3      2 2014        3 2014/3/23  澄清湖 詹 智 堯      Lamigo 林 正 豐
    ## 4      3 2014        4 2014/3/25    新莊 張 志 豪    中信兄弟 林 晨 樺
    ## 5      4 2014        5 2014/3/26  嘉義市 林 泓 育      Lamigo 林 岳 平
    ## 6      5 2014        6 2014/3/27    新莊 張 志 豪    中信兄弟 黃 勝 雄
    ##    PITCHERTEAM RBI REMARK
    ## 1 PITCHER TEAM RBI REMARK
    ## 2 統一7-ELEVEn   1       
    ## 3         義大   1       
    ## 4         義大   1       
    ## 5 統一7-ELEVEn   1       
    ## 6         義大   1


    Draw some pictures (1/2)

    par(family = 'Heiti TC Light')
    m <- ggplot(hr_table, aes(x = as.numeric(GID)))
    m + geom_density(aes(fill = factor(BatterTeam), alpha = 0.01)) + 
        labs(title = paste0("HR Distribution of ", year[i]," Season 中華職棒", year[i], "賽季逐場全壘打分布")) + 
        theme(text = element_text(family="Heiti TC Light"), plot.title = element_text(size = 26), legend.text = element_text(size = 20))

    plot of chunk unnamed-chunk-10


    Keep going – further analysis

    load("sc_tables.RData")
    
    # 賽伯計量學的畢達哥拉斯定理
    win_pc <- function(RS, RA) {
      perc <- RS^2 / (RS^2 + RA^2) 
      return(perc)
    }


    Draw some pictures (2/2)

    levels(win_prob$team)
    ## [1] "統一7-ELEVEn" "義大"         "中信兄弟"     "Lamigo"

    # fig <- ggplot(data = win_prob, aes(x = team, y = real, fill = factor(half)))
    # fig + geom_bar(stat = "identity", position="dodge")
    ggplot(data = win_prob, aes(x = expd, y = real, color = factor(team))) + stat_smooth() + geom_point() + coord_fixed()
    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

    plot of chunk unnamed-chunk-13


    An example demonstrating the relationship between “Real Value” v.s. “Market Value”

    example



    Advertising

    To leave a comment for the author, please follow the link and comment on their blog: NYC Data Science Academy » R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)