Fantasy Hockey with R

[This article was first published on Max Humber, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rvest and purrr are wonderful bedfellows. The packages share the underlying tidyverse API. And it feels simple and almost natural to combine them when scraping the web.

Here is a slimmed down and worked recipe of how to leverage rvest and purrr in Fantasy Hockey.


Step 0. Load packages.

<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rvest</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span>

Step 1. Find a data source.

I’m going to use Fantasy Sports Portal for this example.

Step 2. Figure out the CSS selectors for the data.

SelectorGadget makes this dead simple, like so:

Step 3. Fetch the data elements.

I like to put everything in tibble as soon as possible and use stringr to adjust the url for the different position pages. You’ll notice that I’m only grabbing name and goals. Feel free to grab whatever!

<span class="n">p_fetch</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"C"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_c</span><span class="p">(</span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
        </span><span class="s2">"https://www.fantasysp.com/projections/hockey/weekly/"</span><span class="p">,</span><span class="w">
        </span><span class="n">position</span><span class="p">)</span><span class="w">

    </span><span class="n">page</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">

    </span><span class="n">names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">page</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"td:nth-child(2)"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">html_text</span><span class="p">()</span><span class="w">

    </span><span class="n">goals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">page</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"td:nth-child(4)"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">html_text</span><span class="p">()</span><span class="w">

    </span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">names</span><span class="p">,</span><span class="w"> </span><span class="n">goals</span><span class="p">)</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

Step 4. Iterate through each page.

Instead of writing a for loop, I like to use pmap from purrr to iterate through the Centre, Left-Wing, Right-Wing and Defense position projection pages (I left out the Goalies for obvious reasons).

<span class="n">p_pull</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">params</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"C"</span><span class="p">,</span><span class="w"> </span><span class="s2">"LW"</span><span class="p">,</span><span class="w"> </span><span class="s2">"RW"</span><span class="p">,</span><span class="w"> </span><span class="s2">"D"</span><span class="p">))</span><span class="w">

    </span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">pmap</span><span class="p">(</span><span class="n">p_fetch</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">bind_rows</span><span class="p">()</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

Step 5. Clean and format the projection data.

This is a pretty janky use of separate but it works to get everything into a format that I like.

<span class="n">p_clean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">p_pull</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">separate</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w">
            </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"junk"</span><span class="p">,</span><span class="w"> </span><span class="s2">"first"</span><span class="p">,</span><span class="w"> </span><span class="s2">"last"</span><span class="p">,</span><span class="w"> </span><span class="s2">"meta"</span><span class="p">),</span><span class="w">
            </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])"</span><span class="p">,</span><span class="w">
            </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">,</span><span class="w">
            </span><span class="n">extra</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"merge"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">separate</span><span class="p">(</span><span class="n">meta</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"team"</span><span class="p">,</span><span class="w"> </span><span class="s2">"position"</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\\s"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">mutate</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_c</span><span class="p">(</span><span class="n">first</span><span class="p">,</span><span class="w"> </span><span class="n">last</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">mutate</span><span class="p">(</span><span class="n">goals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">goals</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">drop_na</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">mutate</span><span class="p">(</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_length</span><span class="p">(</span><span class="n">team</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">filter</span><span class="p">(</span><span class="n">length</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">select</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">team</span><span class="p">,</span><span class="w"> </span><span class="n">position</span><span class="p">,</span><span class="w"> </span><span class="n">goals</span><span class="p">)</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">p_clean</span><span class="p">()</span><span class="w">
</span>

At this point you should have a nice clean tibble/dataframe with every player, their position, team, and their projected goals for this week. I could stop here, but I wanted to go a little further with a value over replacement player (VORP) calculation.

Step 6. Calculate a replacement player for each scoring position.

I’m using pmap again to pump through each position to get the mean value for the top X players. It’s a little overkill, but really flexible.

<span class="n">p_replacement</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pos</span><span class="p">,</span><span class="w"> </span><span class="n">slots</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="n">rp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">filter</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">pos</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">goals</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">filter</span><span class="p">(</span><span class="n">row_number</span><span class="p">()</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">slots</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">group_by</span><span class="p">(</span><span class="n">position</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">summarise</span><span class="p">(</span><span class="n">goals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">goals</span><span class="p">))</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">rp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">p_vorp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">

    </span><span class="c1"># slots depend on how many position players start for each team</span><span class="w">
    </span><span class="c1"># if there are 10 teams and 2 LW per team then slots -> 10 * 2 = 20</span><span class="w">

    </span><span class="n">params</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tribble</span><span class="p">(</span><span class="w">
        </span><span class="o">~</span><span class="n">pos</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="n">slots</span><span class="p">,</span><span class="w">
        </span><span class="s2">"C"</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
        </span><span class="s2">"LW"</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
        </span><span class="s2">"RW"</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
        </span><span class="s2">"D"</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">

    </span><span class="n">rp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">pmap</span><span class="p">(</span><span class="n">p_replacement</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">bind_rows</span><span class="p">()</span><span class="w">

    </span><span class="nf">return</span><span class="p">(</span><span class="n">rp</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

Step 7. Calculate the VORP for each player.

Simple join at this point…

<span class="n">replacement</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">p_vorp</span><span class="p">()</span><span class="w">

</span><span class="c1"># calculate value over replacement player</span><span class="w">

</span><span class="n">vorp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">left_join</span><span class="p">(</span><span class="n">replacement</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"position"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">goals_vorp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">goals.x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">goals.y</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">rename</span><span class="p">(</span><span class="n">goals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">goals.x</span><span class="p">,</span><span class="w"> </span><span class="n">goals_rp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">goals.y</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">goals_rp</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">goals_vorp</span><span class="p">))</span><span class="w">
</span>

And that’s it!

To leave a comment for the author, please follow the link and comment on their blog: Max Humber.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)