Play Data, Play Ball! – Exploring Baseball Data with R

[This article was first published on NYC Data Science Academy » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • Demo day: Dec 9, 2014. No photos available.
  • Contributed by Summit Suen & Wayne Chen.
  • They took the CORP-R 002 class with Vivian Zhang (Taiwan Open data and data science / 臺北國際 OPEN DATA 培訓) in Oct, 2014. This post was based on their final project submission.

  • What comes up with you while talking about SPORTS DATA ANALYSIS?!


    Hot, Hot, Hot: Big Data & Sports (1/2)


    Hot, Hot, Hot: Big Data & Sports (2/2)


    In fact, works analyzing baseball data with statistical techniques has already exist: Sabermatrics


    There are plenty of excellent packages that could help us analyzing baseball data.

    A simple demo with r-packages: dplyr & lahman

    library(<span class="pl-vo">Lahman</span>)
    library(<span class="pl-vo">dplyr</span>)
    <span class="pl-vo">totalRS</span> <span class="pl-k"><-</span> <span class="pl-vo">Teams</span> %<span class="pl-k">></span>% select(<span class="pl-vo">yearID</span>, <span class="pl-vo">R</span>, <span class="pl-vo">G</span>) %<span class="pl-k">></span>% mutate(<span class="pl-v">AvgRperG</span> <span class="pl-k">=</span> <span class="pl-vo">R</span><span class="pl-k">/</span><span class="pl-vo">G</span>) %<span class="pl-k">></span>% group_by(<span class="pl-vo">yearID</span>) %<span class="pl-k">></span>% summarise(sum(<span class="pl-vo">AvgRperG</span>))
    names(<span class="pl-vo">totalRS</span>) <span class="pl-k"><-</span> c(<span class="pl-s1"><span class="pl-pds">"</span>yearID<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>RUN<span class="pl-pds">"</span></span>)
    head(<span class="pl-vo">totalRS</span>)

    ## Source: local data frame [6 x 2]
    ## 
    ##   yearID      RUN
    ## 1   1871 93.12897
    ## 2   1872 95.21474
    ## 3   1873 73.15998
    ## 4   1874 58.55903
    ## 5   1875 70.08774
    ## 6   1876 47.01267


    And one could easily observe the trend of scoring in MLB history via lahman + ggplot2

    library(<span class="pl-vo">ggplot2</span>)
    ggplot(<span class="pl-v">data</span> <span class="pl-k">=</span> <span class="pl-vo">totalRS</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-vo">yearID</span>, <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-vo">RUN</span>)) <span class="pl-k">+</span> stat_smooth() <span class="pl-k">+</span> geom_line()

    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

    plot of chunk unnamed-chunk-2


    Even admire the career numbers of Derek Jeter! LOL

    library(<span class="pl-vo">Lahman</span>)
    library(<span class="pl-vo">dplyr</span>)
    head(filter(<span class="pl-vo">Batting</span>, <span class="pl-vo">playerID</span> <span class="pl-k">==</span> <span class="pl-s1"><span class="pl-pds">"</span>jeterde01<span class="pl-pds">"</span></span>))

    ##    playerID yearID stint teamID lgID   G G_batting  AB   R   H X2B X3B HR
    ## 1 jeterde01   1995     1    NYA   AL  15        15  48   5  12   4   1  0
    ## 2 jeterde01   1996     1    NYA   AL 157       157 582 104 183  25   6 10
    ## 3 jeterde01   1997     1    NYA   AL 159       159 654 116 190  31   7 10
    ## 4 jeterde01   1998     1    NYA   AL 149       149 626 127 203  25   8 19
    ## 5 jeterde01   1999     1    NYA   AL 158       158 627 134 219  37   9 24
    ## 6 jeterde01   2000     1    NYA   AL 148       148 593 119 201  31   4 15
    ##   RBI SB CS BB  SO IBB HBP SH SF GIDP G_old
    ## 1   7  0  0  3  11   0   0  0  0    0    15
    ## 2  78 14  7 48 102   1   9  6  9   13   157
    ## 3  70 23 12 74 125   0  10  8  2   14   159
    ## 4  84 30  6 57 119   1   5  3  3   13   149
    ## 5 102 19  8 91 116   5  12  3  6   12   158
    ## 6  73 22  4 68  99   4  12  3  3   14   148


    Here comes another awesome package: pitchRx
    Which offers APIs accessing daily MLB logs and 2D/3D animations.

    install.packages(<span class="pl-s1"><span class="pl-pds">"</span>pitchRx<span class="pl-pds">"</span></span>)
    library(<span class="pl-vo">pitchRx</span>)


    OpenWAR is a open project for calculating WAR – Wins Above Replacement
    WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?”

    install.packages(<span class="pl-s1"><span class="pl-pds">"</span>Sxslt<span class="pl-pds">"</span></span>, <span class="pl-v">repos</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>http://www.omegahat.org/R<span class="pl-pds">"</span></span>, <span class="pl-v">type</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>source<span class="pl-pds">"</span></span>)
    require(<span class="pl-vo">devtools</span>)
    install_github(<span class="pl-s1"><span class="pl-pds">"</span>openWAR<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>beanumber<span class="pl-pds">"</span></span>)


    After the amazing tour of useful packages, we need to write our OWN crawler/analyzer!!!


    Data Source


    Crawler in R (1/2)

    <span class="pl-c">## RSelenium + phantomJS</span>
    library(<span class="pl-vo">XML</span>)
    library(<span class="pl-vo">RSelenium</span>)
    
    <span class="pl-vo">pJS</span> <span class="pl-k"><-</span> phantom()
    Sys.sleep(<span class="pl-c1">5</span>)
    <span class="pl-vo">remDr</span> <span class="pl-k"><-</span> remoteDriver(<span class="pl-v">browserName</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>phantomjs<span class="pl-pds">'</span></span>)
    <span class="pl-vo">remDr</span><span class="pl-k">$</span>open()
    
    <span class="pl-vo">url</span> <span class="pl-k"><-</span> <span class="pl-s1"><span class="pl-pds">'</span>http://www.cpbl.com.tw/stats_hr.aspx<span class="pl-pds">'</span></span>
    <span class="pl-vo">remDr</span><span class="pl-k">$</span>navigate(<span class="pl-vo">url</span>)
    
    <span class="pl-vo">hr_tables</span> <span class="pl-k"><-</span> <span class="pl-st">list</span>()

    Crawler in R (2/2)

    <span class="pl-vo">year</span> <span class="pl-k"><-</span> seq(<span class="pl-c1">1990</span>, <span class="pl-c1">2014</span>)
    <span class="pl-k">for</span> (<span class="pl-vo">i</span> <span class="pl-k">in</span> <span class="pl-c1">1</span><span class="pl-k">:</span>length(<span class="pl-vo">year</span>)) {
      <span class="pl-vo">option</span> <span class="pl-k"><-</span> <span class="pl-vo">remDr</span><span class="pl-k">$</span>findElement(<span class="pl-v">using</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>xpath<span class="pl-pds">'</span></span>, paste0(<span class="pl-s1"><span class="pl-pds">"</span>//*/option[@value = '<span class="pl-pds">"</span></span>,<span class="pl-vo">year</span>[<span class="pl-vo">i</span>],<span class="pl-s1"><span class="pl-pds">"</span>']<span class="pl-pds">"</span></span>))
      <span class="pl-vo">option</span><span class="pl-k">$</span>clickElement()
      <span class="pl-vo">src</span> <span class="pl-k"><-</span> <span class="pl-vo">option</span><span class="pl-k">$</span>getPageSource()
      <span class="pl-vo">hr_tables</span>[<span class="pl-vo">i</span>] <span class="pl-k"><-</span> readHTMLTable(<span class="pl-vo">src</span>[[<span class="pl-c1">1</span>]], <span class="pl-v">stringAsFactors</span> <span class="pl-k">=</span> <span class="pl-c1">FALSE</span>, <span class="pl-v">encoding</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">"</span>utf8<span class="pl-pds">"</span></span>)
    }
    
    <span class="pl-c"># View(hr_tables)</span>
    
    <span class="pl-vo">remDr</span><span class="pl-k">$</span>close()
    <span class="pl-vo">pJS</span><span class="pl-k">$</span>stop()

    Let’s PLAY DATA!

    load(<span class="pl-s1"><span class="pl-pds">"</span>hr_tables.RData<span class="pl-pds">"</span></span>)
    
    head(<span class="pl-vo">hr_tables</span>[[<span class="pl-c1">25</span>]])

    ##   NUMBER YEAR      GID      DATE STADIUM   BATTER  BATTERTEAM  PITCHER
    ## 1      # YEAR GAME NO.      DATE STADIUM   PLAYER PLAYER TEAM  PITCHER
    ## 2      1 2014        2 2014/3/23    天母 林 威 助    中信兄弟 費 古 洛
    ## 3      2 2014        3 2014/3/23  澄清湖 詹 智 堯      Lamigo 林 正 豐
    ## 4      3 2014        4 2014/3/25    新莊 張 志 豪    中信兄弟 林 晨 樺
    ## 5      4 2014        5 2014/3/26  嘉義市 林 泓 育      Lamigo 林 岳 平
    ## 6      5 2014        6 2014/3/27    新莊 張 志 豪    中信兄弟 黃 勝 雄
    ##    PITCHERTEAM RBI REMARK
    ## 1 PITCHER TEAM RBI REMARK
    ## 2 統一7-ELEVEn   1       
    ## 3         義大   1       
    ## 4         義大   1       
    ## 5 統一7-ELEVEn   1       
    ## 6         義大   1


    Draw some pictures (1/2)

    par(<span class="pl-v">family</span> <span class="pl-k">=</span> <span class="pl-s1"><span class="pl-pds">'</span>Heiti TC Light<span class="pl-pds">'</span></span>)
    <span class="pl-vo">m</span> <span class="pl-k"><-</span> ggplot(<span class="pl-vo">hr_table</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> as.numeric(<span class="pl-vo">GID</span>)))
    <span class="pl-vo">m</span> <span class="pl-k">+</span> geom_density(aes(<span class="pl-v">fill</span> <span class="pl-k">=</span> <span class="pl-st">factor</span>(<span class="pl-vo">BatterTeam</span>), <span class="pl-v">alpha</span> <span class="pl-k">=</span> <span class="pl-c1">0.01</span>)) <span class="pl-k">+</span> 
        labs(<span class="pl-v">title</span> <span class="pl-k">=</span> paste0(<span class="pl-s1"><span class="pl-pds">"</span>HR Distribution of <span class="pl-pds">"</span></span>, <span class="pl-vo">year</span>[<span class="pl-vo">i</span>],<span class="pl-s1"><span class="pl-pds">"</span> Season 中華職棒<span class="pl-pds">"</span></span>, <span class="pl-vo">year</span>[<span class="pl-vo">i</span>], <span class="pl-s1"><span class="pl-pds">"</span>賽季逐場全壘打分布<span class="pl-pds">"</span></span>)) <span class="pl-k">+</span> 
        theme(<span class="pl-v">text</span> <span class="pl-k">=</span> element_text(<span class="pl-v">family</span><span class="pl-k">=</span><span class="pl-s1"><span class="pl-pds">"</span>Heiti TC Light<span class="pl-pds">"</span></span>), <span class="pl-v">plot.title</span> <span class="pl-k">=</span> element_text(<span class="pl-v">size</span> <span class="pl-k">=</span> <span class="pl-c1">26</span>), <span class="pl-v">legend.text</span> <span class="pl-k">=</span> element_text(<span class="pl-v">size</span> <span class="pl-k">=</span> <span class="pl-c1">20</span>))

    plot of chunk unnamed-chunk-10


    Keep going – further analysis

    load(<span class="pl-s1"><span class="pl-pds">"</span>sc_tables.RData<span class="pl-pds">"</span></span>)
    
    <span class="pl-c"># 賽伯計量學的畢達哥拉斯定理</span>
    <span class="pl-en">win_pc</span> <span class="pl-k"><-</span> <span class="pl-k">function</span>(<span class="pl-vo">RS</span>, <span class="pl-vo">RA</span>) {
      <span class="pl-vo">perc</span> <span class="pl-k"><-</span> <span class="pl-vo">RS</span><span class="pl-k">^</span><span class="pl-c1">2</span> <span class="pl-k">/</span> (<span class="pl-vo">RS</span><span class="pl-k">^</span><span class="pl-c1">2</span> <span class="pl-k">+</span> <span class="pl-vo">RA</span><span class="pl-k">^</span><span class="pl-c1">2</span>) 
      <span class="pl-k">return</span>(<span class="pl-vo">perc</span>)
    }


    Draw some pictures (2/2)

    levels(<span class="pl-vo">win_prob</span><span class="pl-k">$</span><span class="pl-vo">team</span>)

    ## [1] "統一7-ELEVEn" "義大"         "中信兄弟"     "Lamigo"

    <span class="pl-c"># fig <- ggplot(data = win_prob, aes(x = team, y = real, fill = factor(half)))</span>
    <span class="pl-c"># fig + geom_bar(stat = "identity", position="dodge")</span>
    ggplot(<span class="pl-v">data</span> <span class="pl-k">=</span> <span class="pl-vo">win_prob</span>, aes(<span class="pl-v">x</span> <span class="pl-k">=</span> <span class="pl-vo">expd</span>, <span class="pl-v">y</span> <span class="pl-k">=</span> <span class="pl-vo">real</span>, <span class="pl-v">color</span> <span class="pl-k">=</span> <span class="pl-st">factor</span>(<span class="pl-vo">team</span>))) <span class="pl-k">+</span> stat_smooth() <span class="pl-k">+</span> geom_point() <span class="pl-k">+</span> coord_fixed()

    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

    plot of chunk unnamed-chunk-13


    An example demonstrating the relationship between “Real Value” v.s. “Market Value”

    example



    Advertising

    To leave a comment for the author, please follow the link and comment on their blog: NYC Data Science Academy » R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)