Intro to {polite} Web Scraping of Soccer Data with R!

[This article was first published on R by R(yo), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fans of soccer/football have been left bereft of their prime form of
entertainment these past few months and I’ve seen a huge uptick in the
amount of casual fans and bloggers turning to learning programming
languages such as R or Python to augment their analytical toolkits. Free
and easily accessible data can be hard to find if you only just started
down this path and even when you do, you’ll find that eventually
dragging your mouse around and copying stuff into Excel just isn’t time
efficient or possible. The solution to this is web scraping! However, I
feel like a lot of people aren’t aware of the ethical conundrums
surrounding web scraping (especially if you’re coming from outside of a
data science/programming/etc. background …and even if you are I might
add). I am by no means an expert but since I started learning about
all it I’ve tried to “web scrape responsibly” and this tenet will be
emphasized throughout this blog post. I will be going over examples to
scrape soccer data from
Wikipedia,
soccerway.com, and
transfermarkt.com. Do note this is
focused on the web-scraping part and won’t cover the visualization,
links to the viz code will be given at the end of each section and you
can always check out my
soccer_ggplot Github repo
for more soccer viz goodness!

Anyway, let’s get started!

Web Scraping Responsibly

When we think about R and web scraping, we normally just think straight
to loading {rvest} and going right on our merry way. However, there
are quite a lot of things you should know about web scraping practices
before you start diving in. “Just because you can, doesn’t mean you
should.” robots.txt is a file in websites that describe the
permissions/access privileges for any bots and crawlers that come across
the site. Certain parts of the website may not be accessible for certain
bots (say Twitter or Google), some may not be available at all, and in
the most extreme case, web scraping may even be prohibited. However, do
note that just because there is no robots.txt file or that it is
permissive of web scraping does not automatically mean you are
allowed to scrape. You should always check the website’s “Terms of
Use”
or similar pages.

Web scraping takes up bandwidth for a host, especially if it houses lots
of data. So writing web scraping bots and functions that are polite and
respectful of the hosting site is necessary so that we don’t
inconvenience websites that are doing us a service by making the
data available for us for free! There’s a lot of things we take for
granted, especially regarding free soccer data, so let’s make sure we
can keep it that way.

In R there are a number of different packages that facilitates
responsible web scraping packages, including:

  • {robotstxt} is a package
    created by Peter Meissner and
    provides functions to parse robots.txt files in a clean way.

  • {ratelimitr} created by
    Tarak Shah provides ways to limit the
    rate which functions are called. You can define a certain n calls
    per period of time to any function wrapped in
    ratelimitr::limit_rate().

The {polite} package takes a lot of
the things previously mentioned into one neat package that flows
seamlessly with the {rvest} API. I’ve been using this package almost
since its first release and it’s terrific! I got to see the package
author (Dmytro Perepolkin) do a
presentation on it at UseR
2019
you can
find the video recording
here.
This blog post will mainly focus on using {rvest} in combination with
the {polite} package.

Single web-page (Wikipedia)

<span class="n">library</span><span class="p">(</span><span class="n">rvest</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">polite</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">glue</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">rlang</span><span class="p">)</span><span class="w">
</span>

For the first example, let’s start with scraping soccer data from
Wikipedia, specifically the top goal scorers of the Asian Cup.

We use polite::bow() to pass the URL for the Wikipedia article to get
a polite session object. This object will tell you about the
robots.txt, the recommended crawl delay between scraping attempts, and
tells you whether you are allowed to scrape this URL or not. You can
also add your own user name in the user_agent argument to introduce
yourself to the website.

<span class="n">topg_url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://en.wikipedia.org/wiki/AFC_Asian_Cup_records_and_statistics"</span><span class="w">

</span><span class="n">session</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bow</span><span class="p">(</span><span class="n">topg_url</span><span class="p">,</span><span class="w">
               </span><span class="n">user_agent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Ryo's R Webscraping Tutorial"</span><span class="p">)</span><span class="w">

</span><span class="n">session</span><span class="w">
</span>
## <polite session> https://en.wikipedia.org/wiki/AFC_Asian_Cup_records_and_statistics
##     User-agent: Ryo's R Webscraping Tutorial
##     robots.txt: 454 rules are defined for 33 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

Of course, just to make sure, remember to read the “Terms of Use” page
as well. When it comes to Wikipedia though, you could just download all
of Wikipedia’s
data

yourself and do a text-search through those files but that’s
out-of-scope for this blog post, maybe another time!

Now to actually get the data from the webpage. You’ve got different
options depending on what browser you’re using but on Google Chrome or
Mozilla Firefox you can find the exact HTML element by right clicking on
it and then clicking on “Inspect” or “Inspect Element” in the pop-up
menu. By doing so, a new view will open up showing you the full HTML
content of the webpage with the element you chose highlighted. (See first two pics)

You might also want to try using a handy JavaScript tool called SelectorGadget,
you can learn how to use it
here. It
allows you to click on different elements of the web page and
the gadget will try to ascertain the exact CSS Selector in the HTML. (See bottom pic)

Do be warned that web pages can change suddenly and the CSS Selector you
used in the past might not work anymore. I’ve had this happen more than a few times
as pages get updated with more info from new tournaments and such. This
is why you really should try to scrape from a more stable website, but a
lot of times for “simple” data Wikipedia is the easiest and best place
to scrape.

From here you can right-click again on the highlighted HTML code to
“Copy”, and then you can choose one of “CSS Selector”, “CSS Path”, or
“XPath”. I normally use “CSS Selector” and it will be the one I will use
throughout this tutorial. This is the exact reference within the HTML
code of the webpage of the object you want. I make sure to choose the
CSS Selector for the table itself and not just the info inside the
table.

With this copied, you can go to your R script/RMD/etc. After running the
polite::scrape() function on your bow object, paste in the CSS
Selector/Path/XPath you just copied into html_nodes(). The bow
object already has the recommended scrape delay as stipulated in a
website’s robots.txt so you don’t have to input it manually when you
scrape.

<span class="n">ac_top_scorers_node</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"table.wikitable:nth-child(44)"</span><span class="p">)</span><span class="w">
</span>

Grabbing a HTML table is the easiest way to get data as you usually
don’t have to do too much work to reshape the data afterwards. We can do
that with the html_table() function. As the HTML object returns as a
list, we have to flatten it out one level using purrr::flatten_df() .
Finish cleaning it up by taking out the unnecessary “Ref” column with
select() and renaming the column names with set_names().

<span class="n">ac_top_scorers</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ac_top_scorers_node</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_table</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">flatten_df</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Ref.</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">set_names</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"total_goals"</span><span class="p">,</span><span class="w"> </span><span class="s2">"player"</span><span class="p">,</span><span class="w"> </span><span class="s2">"country"</span><span class="p">))</span><span class="w">
</span>

After adding some flag and soccer ball images to the data.frame we get
this:

Do note that the image itself is from before the 2019 Asian Cup but
the data we scraped in the code above is updated. As a visualization
challenge try to create a similar viz with the updated data! You can
take a look at my Asian Cup 2019 blog
post
for how
I did it. Alternatively you can try doing the same as above except with
the
Euros.
Try grabbing the top goal scorer table from that page and make your own
graph!

Single-page (Transfermarkt)

So now let’s try a soccer-specific website as that’s really the goal of
this blog post. This time we’ll go for one of the most famous soccer
websites around, transfermarkt.com. A website used as a data source
from your humble footy blogger to big news sites such as the Financial
Times
and
the BBC.

The example we’ll try is from an Age-Value graph for the J-League I made
around 2 years ago when I just started doing soccer data viz (how times
flies…).

<span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://www.transfermarkt.com/j-league-division-1/startseite/wettbewerb/JAP1/saison_id/2017"</span><span class="w">

</span><span class="n">session</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bow</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">

</span><span class="n">session</span><span class="w">
</span>
## <polite session> https://www.transfermarkt.com/j-league-division-1/startseite/wettbewerb/JAP1/saison_id/2017
##     User-agent: polite R package - https://github.com/dmi3kno/polite
##     robots.txt: 1 rules are defined for 1 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

The basic steps are the same as before but I’ve found that it can be
quite tricky to find the right nodes on transfermarkt even with the
CSS Selector Gadget or other methods we described in previous sections.
After a while you’ll get used to the quirks of how the website is
structured and know what certain assets (tables, columns, images) are
called easily. This is a website where the SelectorGadget really comes
in handy!

This time around I won’t be grabbing an entire table like I did with
Wikipedia but a number of elements from the webpage. You definitely
can scrape for the table like I showed above with html_table() but
in this case I didn’t because the table output was rather messy, gave me
way more info than I actually needed, and I wasn’t very good at
regex/stringr to clean the text 2 years ago. Try doing it the way below
and also by grabbing the entire table for more practice.

The way I did it back then also works out for this blog post because I
can show you a few other html_*() {rvest} functions:

  • html_table(): Get data from a HTML table
  • html_text(): Extract text from HTML
  • html_attr(): Extract attributes from HTML ("src" for image
    filename, "href" for URL link address)
<span class="n">team_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"#yw1 > table > tbody > tr > td.zentriert.no-border-rechts > a > img"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_attr</span><span class="p">(</span><span class="s2">"alt"</span><span class="p">)</span><span class="w">

</span><span class="c1"># average age</span><span class="w">
</span><span class="n">avg_age</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"tbody .hide-for-pad:nth-child(5)"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w">

</span><span class="c1"># average value</span><span class="w">
</span><span class="n">avg_value</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"tbody .rechts+ .hide-for-pad"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_text</span><span class="p">()</span><span class="w">

</span><span class="c1"># team image</span><span class="w">
</span><span class="n">team_img</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"#yw1 > table > tbody > tr > td.zentriert.no-border-rechts > a > img"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_attr</span><span class="p">(</span><span class="s2">"src"</span><span class="p">)</span><span class="w">
</span>

With each element collected we can put them into a list and reshape it
into a nice data frame.

<span class="c1"># combine above into one list</span><span class="w">
</span><span class="n">resultados</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">team_name</span><span class="p">,</span><span class="w"> </span><span class="n">avg_age</span><span class="p">,</span><span class="w"> </span><span class="n">avg_value</span><span class="p">,</span><span class="w"> </span><span class="n">team_img</span><span class="p">)</span><span class="w">

</span><span class="c1"># specify column names</span><span class="w">
</span><span class="n">col_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"team"</span><span class="p">,</span><span class="w"> </span><span class="s2">"avg_age"</span><span class="p">,</span><span class="w"> </span><span class="s2">"avg_value"</span><span class="p">,</span><span class="w"> </span><span class="s2">"img"</span><span class="p">)</span><span class="w">

</span><span class="c1"># Combine into one dataframe</span><span class="w">
</span><span class="n">j_league_age_value_raw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">resultados</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">reduce</span><span class="p">(</span><span class="n">cbind</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble</span><span class="o">::</span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">set_names</span><span class="p">(</span><span class="n">col_name</span><span class="p">)</span><span class="w">

</span><span class="n">glimpse</span><span class="p">(</span><span class="n">j_league_age_value_raw</span><span class="p">)</span><span class="w">
</span>
## Rows: 18
## Columns: 4
## $ team      <chr> "Vissel Kobe", "Urawa Red Diamonds", "Kawasaki Frontale",...
## $ avg_age   <chr> "25.9", "26.3", "25.5", "24.1", "25.4", "25.0", "25.0", "...
## $ avg_value <chr> "€1.02m", "€698Th.", "€577Th.", "€477Th.", "€524Th.", "€5...
## $ img       <chr> "https://tmssl.akamaized.net/images/wappen/tiny/3958.png?...

With some more cleaning and {ggplot2} magic (see
here,
start from line 53) you will then get:

Some other examples by scraping single web pages:

Multiple Web-pages (Soccerway, Transfermarkt, etc.)

The previous examples looked at scraping from a single web page but
usually you want to collect data for each team in a league, each player
from each team, or each player from each team in every league, etc. This
is where the added complexity of web-scraping multiple pages comes in.
The most efficient way is to be able to programatically scrape across
multiple pages in one go instead of running the same scraping function
on different teams’/players’ URL link over and over again.


Thinking About How to Scrape

  • Understand the website structure: How it organizes its pages, check
    out what the CSS Selector/XPaths are like, etc.
  • Get a list of links: Team page links from league page, player page
    links from team page, etc.
  • Create your own R functions: Pinpoint exactly what you want to
    scrape as well as some cleaning steps post-scraping in one function
    or multiple functions.
  • Start small, then scale up: Test your scraping function on one
    player/team, then do entire team/league.
  • Iterate over a set of URL links: Use {purrr}, for loops,
    lapply() (whatever your preference).

Look at the URL link for each web page you want to gather. What are the
similarities? What are the differences? If it’s a proper website than
the web page for a certain data view for each team should be exactly the
same, as you’d expect it to contain exactly the same type of info just
for a different team. For this example each “squad view” page for each
Premier League team on soccerway.com are structured similarly:
https://us.soccerway.com/teams/england/
and then the “team name/”, the “team number/” and finally the name of
the web page, “squad/”. So what we need to do here is to find out the
“team name” and “team number” for each of the teams and store them. We
can then feed each pair of these values in one at a time to scrape the
information for each team.

<span class="n">url</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://us.soccerway.com/national/england/premier-league/20182019/regular-season/r48730/"</span><span class="w">

</span><span class="n">session</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bow</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="w">

</span><span class="n">session</span><span class="w">
</span>
## <polite session> https://us.soccerway.com/national/england/premier-league/20182019/regular-season/r48730/
##     User-agent: polite R package - https://github.com/dmi3kno/polite
##     robots.txt: 4 rules are defined for 3 bots
##    Crawl delay: 5 sec
##   The path is scrapable for this user-agent

To find these elements we could just click on the link for each team and
jot them down … but wait we can just scrape those too! We use the
html_attr() function to grab the “href” part of the HTML, which
contains the hyperlink of that element. The left picture is looking at
the URL link of one of the buttons to a team’s page via “Inspect”. The
right picture is selecting every team’s link via the SelectorGadget.

<span class="n">team_links</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"#page_competition_1_block_competition_tables_8_block_competition_league_table_1_table .large-link a"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">html_attr</span><span class="p">(</span><span class="s2">"href"</span><span class="p">)</span><span class="w">

</span><span class="n">team_links</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span>
## [1] "/teams/england/manchester-city-football-club/676/"

The URL given in the href of the HTML for the team buttons
unfortunately aren’t the full URL needed to access these pages. So
we have to cut out the important bits and re-create them ourselves. We
can use the {glue} package to combine the “team_name” and “team_num”
for each team in the incomplete URL into a complete URL in a new column
we’ll call link.

<span class="n">team_links_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">team_links</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">tibble</span><span class="o">::</span><span class="n">enframe</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="c1">## separate out each component of the URL by / and give them a name</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">separate</span><span class="p">(</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="s2">"team_name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"team_num"</span><span class="p">),</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="c1">## glue together the "team_name" and "team_num" into a complete URL</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="p">(</span><span class="s2">"https://us.soccerway.com/teams/england/{team_name}/{team_num}/squad/"</span><span class="p">))</span><span class="w">

</span><span class="n">glimpse</span><span class="p">(</span><span class="n">team_links_df</span><span class="p">)</span><span class="w">
</span>
## Rows: 20
## Columns: 3
## $ team_name <chr> "manchester-city-football-club", "liverpool-fc", "chelsea...
## $ team_num  <chr> "676", "663", "661", "675", "660", "662", "680", "674", "...
## $ link      <glue> "https://us.soccerway.com/teams/england/manchester-city-...

Fantastic! Now we have the proper URL links for each team. Next we have
to actually look into one of the web pages itself to figure out what
exactly we need to scrape from the web page. This assumes that each web
page and the CSS Selector for the various elements we want to grab are
the same for every team. As this is for a very simple goal contribution
plot all we need to gather from each team’s page is the “player name”,
“number of goals”, and “number of assists”. Use the Inspect element or
the SelectorGadget tool to grab the HTML code for those stats.

Below, I’ve split each into its own mini-scraper function. When you’re
working on this part, you should try to use the URL link from one team
and build your scraper functions from that link (I usually use Liverpool
as my test example when scraping Premier League teams). Note that all
three of the mini-functions below could just be chucked into one large
function but I like keeping things compartmentalized.

<span class="n">player_name_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  
  </span><span class="n">player_name_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"#page_team_1_block_team_squad_3-table .name.large-link"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_text</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">num_goals_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">num_goals_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">".goals"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_text</span><span class="p">()</span><span class="w">
  
  </span><span class="c1">## first value is blank so remove it</span><span class="w">
  </span><span class="n">num_goals_info_clean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num_goals_info</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">num_assists_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="n">num_assists_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scrape</span><span class="p">(</span><span class="n">session</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">".assists"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">html_text</span><span class="p">()</span><span class="w">
  
  </span><span class="c1">## first value is blank so remove it</span><span class="w">
  </span><span class="n">num_assists_info_clean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num_assists_info</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

Now that we have scrapers for each stat, we can combine these into a
larger function that will then gather them all up into a nice data frame
for each team that we want to scrape. If you input any one of the team
URLs from team_links_df, it will collect the “player name”, “number of
goals”, and “number of assists” for that team.

<span class="n">premier_stats_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">link</span><span class="p">,</span><span class="w"> </span><span class="n">team_name</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  
  </span><span class="n">team_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rlang</span><span class="o">::</span><span class="n">enquo</span><span class="p">(</span><span class="n">team_name</span><span class="p">)</span><span class="w">
  </span><span class="c1">## `bow()` for every URL link</span><span class="w">
  </span><span class="n">session</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bow</span><span class="p">(</span><span class="n">link</span><span class="p">)</span><span class="w">
  
  </span><span class="c1">## scrape different stats</span><span class="w">
  </span><span class="n">player_name</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">player_name_info</span><span class="p">(</span><span class="n">session</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">session</span><span class="p">)</span><span class="w">

  </span><span class="n">num_goals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num_goals_info</span><span class="p">(</span><span class="n">session</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">session</span><span class="p">)</span><span class="w">

  </span><span class="n">num_assists</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num_assists_info</span><span class="p">(</span><span class="n">session</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">session</span><span class="p">)</span><span class="w">
  
  </span><span class="c1">## combine stats into a data frame</span><span class="w">
  </span><span class="n">resultados</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">player_name</span><span class="p">,</span><span class="w"> </span><span class="n">num_goals</span><span class="p">,</span><span class="w"> </span><span class="n">num_assists</span><span class="p">)</span><span class="w">
  </span><span class="n">col_names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span><span class="w"> </span><span class="s2">"goals"</span><span class="p">,</span><span class="w"> </span><span class="s2">"assists"</span><span class="p">)</span><span class="w"> 
  
  </span><span class="n">premier_stats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">resultados</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">reduce</span><span class="p">(</span><span class="n">cbind</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">as_tibble</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">set_names</span><span class="p">(</span><span class="n">col_names</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">mutate</span><span class="p">(</span><span class="n">team</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">!!</span><span class="n">team_name</span><span class="p">)</span><span class="w">
  
  </span><span class="c1">## A little message to keep track of how the function is progressing:</span><span class="w">
  </span><span class="c1"># cat(team_name, " done!")</span><span class="w">
  
  </span><span class="nf">return</span><span class="p">(</span><span class="n">premier_stats</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

OK, so now we have a function that can scrape the data for ONE team
but it would be extremely ponderous to re-run it another NINETEEN times
for all the other teams… so what can we do? This is where the
purrr::map() family of functions and iteration comes in! The map()
family of functions allows you to apply a function (an existing one from
a package or one that you’ve created yourself) to each element of a list
or vector that you pass as an argument to the mapping function. For our purposes, this
means we can use mapping functions to pass along a list of URLs (for
whatever number of players and/or teams) along with a scraping function
so that it scrapes it altogether in one go.

In addition, we can use purrr::safely() to wrap any function
(including custom made ones). This makes these functions return a list
with the components result and error. This is extremely useful for
debugging complicated functions as the function won’t just error out and
give you nothing, but at least the result of the parts of the function
that worked in result with what didn’t work in error.

So for example, say you are scraping data from the webpage of each team
in the Premier League (by iterating a single scraping function over each
teams’ web page) and by some weird quirk in the HTML of the web page or
in your code, the data from one team errors out (while the other 19
teams’ data are gathered without problems). Normally, this will mean the
data you gathered from all other web pages that did work
won’t be returned, which can be extremely frustrating. With a
safely() wrapped function, the data from the 19 teams’ data that the
function was able to scrape is returned in result component of the
list object while the one errored team and error message is returned in
the error component. This makes it very easy to debug when you know
exactly which iteration of the function failed.

<span class="n">safe_premier_stats_info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">safely</span><span class="p">(</span><span class="n">premier_stats_info</span><span class="p">)</span><span class="w">
</span>

We already have a nice list of team URL links in the data frame
team_links_df, specifically in the “link” column
(team_links_df$link). So we pass that along as an argument to map2()
(which is just a version of map() but for two argument inputs) and our
premier_stats_info() function so that the function will be applied to
each team’s URL link. This part may take a while depending on your
internet connection and/or if you put a large value for the crawl delay.

<span class="n">goal_contribution_df_ALL</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map2</span><span class="p">(</span><span class="n">.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">team_links_df</span><span class="o">$</span><span class="n">link</span><span class="p">,</span><span class="w"> </span><span class="n">.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">team_links_df</span><span class="o">$</span><span class="n">team_name</span><span class="p">,</span><span class="w">
                             </span><span class="o">~</span><span class="w"> </span><span class="n">safe_premier_stats_info</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.x</span><span class="p">,</span><span class="w"> </span><span class="n">team_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.y</span><span class="p">))</span><span class="w">

</span><span class="c1">## check out the first 4 results:</span><span class="w">
</span><span class="n">glimpse</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">goal_contribution_df_ALL</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span>

As you can see (the results/errors for the first four teams scraped),
for each team there is a list holding a “result” and “error” element.
For the first four, at least, it looks like everything was scraped
properly into a nice data.frame. We can check if any of the twenty teams
had an error by purrr::discard()-ing any elements of the list that
come out as NULL and seeing if there’s anything left.

<span class="c1">## check to see if any failed:</span><span class="w">
</span><span class="n">goal_contribution_df_ALL</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">map</span><span class="p">(</span><span class="s2">"error"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">purrr</span><span class="o">::</span><span class="n">discard</span><span class="p">(</span><span class="o">~</span><span class="nf">is.null</span><span class="p">(</span><span class="n">.</span><span class="p">))</span><span class="w">
</span>
## list()

It comes out as a empty list which means were no errors in the “error”
elements. Now we can squish and combine individual team data.frames into
one data.frame using dplyr::bind_rows().

<span class="n">goal_contribution_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">goal_contribution_df_ALL</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">map</span><span class="p">(</span><span class="s2">"result"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">bind_rows</span><span class="p">()</span><span class="w">

</span><span class="n">glimpse</span><span class="p">(</span><span class="n">goal_contribution_df</span><span class="p">)</span><span class="w">
</span>
## Rows: 622
## Columns: 4
## $ name    <chr> "C. Bravo", "Ederson Moraes", "S. Carson", "K. Walker", "J....
## $ goals   <chr> "0", "0", "0", "1", "0", "0", "0", "0", "0", "2", "0", "0",...
## $ assists <chr> "0", "0", "0", "2", "0", "0", "0", "2", "0", "0", "0", "0",...
## $ team    <chr> "manchester-city-football-club", "manchester-city-football-...

With that we can clean the data a bit and finally get on to the plotting! You can find the code in the original
gist

to see how I created the plot below. I really would like to go into
detail especially as I use one of my favorite plotting packages,
{ggforce}, here but it deserves its own separate blog post.

As you can see, this one was for the 2018-2019 season. I made a similar
one but using xG per 90 and xA per 90 for the 2019-2020 season (as
per January 1st, 2020 at least) using FBRef data
here. You can find the
code for it
here.
However, I did not web scrape it as from their Terms of
Use
page, FBRef (or
any of the SportsRef websites) do not allow web scraping
(“spidering”, “robots”). Thankfully, they make it very easy to
access their data as downloadable .csv files by just clicking on a few
buttons, so getting their data isn’t really a problem!

For practice, try doing it for a different season or for a different
league altogether!

For other examples of scraping multiple pages:

Conclusion

This blog post went over web-scraping, focusing on getting soccer data
from soccer websites in a responsibly fashion. After a brief overview of
responsible scraping practices with R I went over several examples of
getting soccer data from various websites. I make no claims that its the
most efficient way, but importantly, it gets the job done and in a
polite way. More industrial-scale scraping over hundreds and thousands
of web pages is a bit out of scope for an introductory blog post and
it’s not something I’ve really done either, so I will pass along the
torch to someone else who wants to write about that. There are other
ways to scrape websites using R, especially websites that have dynamic
web pages, using R Selenium,
Headless Chrome (crrri), and other
tools.

In regards to FBRef, as it is now a really popular
website to use (especially with their partnership with StatsBomb), there
is a blog post out there detailing a way of using R Selenium to get around
the terms stipulated and the reasoning seems OK but I am still not 100%
sure. This goes again into how a lot of web scraping can be in a rather
grey area at times, as for all the clear warnings on some websites you
have a lot more ambiguity and ability to use some expedient
interpretation in others. At the end of the day, you just have to do
your due diligence, ask permission directly if possible, and be
{polite} about it.

Some other web-scraping tutorials you might be interested in:

As always, you can find more of my soccer-related stuff on this website
or on soccer_ggplots Github
repo!

Happy (responsible) Web-scraping!

To leave a comment for the author, please follow the link and comment on their blog: R by R(yo).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)