An introduction to web scraping: locating Spanish schools

February 10, 2020
By

[This article was first published on R on Coding Club UC3M, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Jorge Cimentada

Introduction

Whenever a new paper is released using some type of scraped data, most of my peers in the social science community get baffled at how researchers can do this. In fact, many social scientists can’t even think of research questions that can be addressed with this type of data simply because they don’t know it’s even possible. As the old saying goes, when you have a hammer, every problem looks like a nail.

With the increasing amount of data being collected on a daily basis, it is eminent that scientists start getting familiar with new technologies that can help answer old questions. Moreover, we need to be adventurous about cutting edge data sources as they can also allow us to ask new questions which weren’t even thought of in the past.

In this tutorial I’ll be guiding you through the basics of web scraping using R and the xml2 package. I’ll begin with a simple example using fake data and elaborate further by trying to scrape the location of a sample of schools in Spain.

Basic steps

For web scraping in R, you can fulfill almost all of your needs with the xml2 package. As you wander through the web, you’ll see many examples using the rvest package. xml2 and rvest are very similar so don’t feel you’re lacking behind for learning one and not the other. In addition to these two packages, we’ll need some other libraries for plotting locations on a map (ggplot2, sf, rnaturalearth), identifying who we are when we scrape (httr) and wrangling data (tidyverse).

Additionally, we’ll also need the package scrapex. In the real-world example that we’ll be doing below, we’ll be scraping data from the website www.buscocolegio.com to locate a sample of schools in Spain. However, throughout the tutorial we won’t be scraping the data directly from their real-website. What would happen to this tutorial if 6 months from now www.buscocolegio.com updates the design of their website? Everything from our real-world example would be lost.

Web scraping tutorials are usually very unstable precisely because of this. To circumvent that problem, I’ve saved a random sample of websites from some schools in www.buscocolegio.com into an R package called scrapex. Although the links we’ll be working on will be hosted locally on your machine, the HTML of the website should be very similar to the one hosted on the website (with the exception of some images/icons which were deleted on purpose to make the package lightweight).

You can install the package with:

# install.packages("devtools")
devtools::install_github("cimentadaj/scrapex")

Now, let’s move on the fake data example and load all of our packages with:

library(xml2)
library(httr)
library(tidyverse)
library(sf)
library(rnaturalearth)
library(ggplot2)
library(scrapex)

Let’s begin with a simple example. Below we define an XML string and look at its structure:

xml_test <- "

  
    
      
        Jason
      
    
    
        Bourne
    
    
      Spy
    
  


  
    
      
        Carol
      
    
    
        Kalp
    
    
      Scientist
    
  


"

cat(xml_test)
## 
## 
##   
##     
##       
##         Jason
##       
##     
##     
##         Bourne
##     
##     
##       Spy
##     
##   
## 
## 
##   
##     
##       
##         Carol
##       
##     
##     
##         Kalp
##     
##     
##       Scientist
##     
##   
## 
## 

In XML and HTML the basic building blocks are something called tags. For example, the first tag in the structure shown above is . This tag is matched by at the end of the string:

If you pay close attention, you’ll see that each tag in the XML structure has a beginning (signaled by <>) and an end (signaled by ). For example, the next tag after is and right before the tag is the end of the jason tag .

Similarly, you’ll find that the tag is also matched by a finishing tag.

In theory, tags can have whatever meaning you attach to them (such as or ). However, in practice there are hundreds of tags which are standard in websites (for example, here). If you’re just getting started, there’s no need for you to learn them but as you progress in web scraping, you’ll start to recognize them (one brief example is which simply bolds text in a website).

The xml2 package was designed to read XML strings and to navigate the tree structure to extract information. For example, let’s read in the XML data from our fake example and look at its general structure:

xml_raw <- read_xml(xml_test)
xml_structure(xml_raw)
## 
##   
##     
##       
##         
##           {text}
##       
##         {text}
##       
##         {text}
##   
##     
##       
##         
##           {text}
##       
##         {text}
##       
##         {text}

You can see that the structure is tree-based, meaning that tags such as and are nested within the tag. In XML jargon, is the root node, whereas and are the child nodes from .

In more detail, the structure is as follows:

  • The root node is
  • The child nodes are and
  • Then each child node has nodes , , and nested within them.

Put another way, if something is nested within a node, then the nested node is a child of the upper-level node. In our example, the root node is so we can check which are its children:

# xml_child returns only one child (specified in search)
# Here, jason is the first child
xml_child(xml_raw, search = 1)
## {xml_node}
## 
## [1] \n  \n    \n        Ja ...
# Here, carol is the second child
xml_child(xml_raw, search = 2)
## {xml_node}
## 
## [1] \n  \n    \n        Carol\n ...
# Use xml_children to extract **all** children
child_xml <- xml_children(xml_raw)

child_xml
## {xml_nodeset (2)}
## [1] \n  \n    \n      \n  \n    \n      \n ...

Tags can also have different attributes which are usually specified as and ended as usual with . If you look at the XML structure of our example, you’ll notice that each tag has an attribute called type. As you’ll see in our real-world example, extracting these attributes is often the aim of our scraping adventure. Using xml2, we can extract all attributes that match a specific name with xml_attrs.

# Extract the attribute type from all nodes
xml_attrs(child_xml, "type")
## [[1]]
## named character(0)
##
## [[2]]
## named character(0)

Wait, why didn’t this work? Well, if you look at the output of child_xml, we have two nodes on which are for and .

child_xml
## {xml_nodeset (2)}
## [1] \n  \n    \n      \n  \n    \n      \n ...

Do these tags have an attribute? No, because if they did, they would have something like . What we need is to look down at the tag within and and extract the attribute from .

Does this sound familiar? Both and have an associated tag below them, making them their children. We can just go down one level by running xml_children on these tags and extract them.

# We go down one level of children
person_nodes <- xml_children(child_xml)

#  is now the main node, so we can extract attributes
person_nodes
## {xml_nodeset (2)}
## [1] \n  \n    \n        Ja ...
## [2] \n  \n    \n        Carol\n ...
# Both type attributes
xml_attrs(person_nodes, "type")
## [[1]]
##        type
## "fictional"
##
## [[2]]
##   type
## "real"

Using the xml_path function you can even find the ‘address’ of these nodes to retrieve specific tags without having to write down xml_children many times. For example:

# Specific address of each person tag for the whole xml tree
# only using the `person_nodes`
xml_path(person_nodes)
## [1] "/people/jason/person" "/people/carol/person"

We have the ‘address’ of specific tags in the tree but how do we extract them automatically? To extract specific ‘addresses’ of this XML tree, the main function we’ll use is xml_find_all. This function accepts the XML tree and an ‘address’ string. We can use very simple strings, such as the one given by xml_path:

# You can use results from xml_path like directories
xml_find_all(xml_raw, "/people/jason/person")
## {xml_nodeset (1)}
## [1] \n  \n    \n        Ja ...

The expression above is asking for the node "/people/jason/person". This will return the same as saying xml_raw %>% xml_child(search = 1). For deeply nested trees, xml_find_all will be many times much cleaner than calling xml_child recursively many times.

However, in most cases the ‘addresses’ used in xml_find_all come from a separate language called XPath (in fact, the ‘address’ we’ve been looking at is XPath). XPath is a complex language (such as regular expressions for strings) which is beyond this brief tutorial. However, with the examples we’ve seen so far, we can use some basic XPath which we’ll need later on.

To extract all the tags in a document, we can use //name_of_tag.

# Search for all 'married' nodes
xml_find_all(xml_raw, "//married")
## {xml_nodeset (2)}
## [1] \n        Jason\n      
## [2] \n        Carol\n      

With the previous XPath, we’re searching for all married tags within the complete XML tree. The result returns all married nodes (I use the words tags and nodes interchangeably) in the complete tree structure. Another example would be finding all tags:

xml_find_all(xml_raw, "//occupation")
## {xml_nodeset (2)}
## [1] \n      Spy\n    
## [2] \n      Scientist\n    

If you want to find any other tag you can replace "//occupation" with your tag of interest and xml_find_all will find all of them.

If you wanted to find all tags below your current node, you only need to add a . at the beginning: ".//occupation". For example, if we dived into the tag and we wanted his tag, "//occupation" will returns all tags. Instead, ".//occupation" will return only the found tags below the current tag. For example:

xml_raw %>%
  # Dive only into Jason's tag
  xml_child(search = 1) %>%
  xml_find_all(".//occupation")
## {xml_nodeset (1)}
## [1] \n      Spy\n    
# Instead, the wrong way would have been:
xml_raw %>%
  # Dive only into Jason's tag
  xml_child(search = 1) %>%
  # Here we get both occupation tags
  xml_find_all("//occupation")
## {xml_nodeset (2)}
## [1] \n      Spy\n    
## [2] \n      Scientist\n    

The first example only returns ’s occupation whereas the second returned all occupations, regardless of where you are in the tree.

XPath also allows you to identify tags that contain only one specific attribute, such as the one’s we saw earlier. For example, to filter all tags with the attribute filter set to fictional, we could do it with:

# Give me all the tags 'person' that have an attribute type='fictional'
xml_raw %>%
  xml_find_all("//person[@type='fictional']")
## {xml_nodeset (1)}
## [1] \n  \n    \n        Ja ...

If you wanted to do the same but for the tags below your current nodes, the same trick we learned earlier would work: ".//person[@type='fictional']". These are just some primers that can help you jump easily to using XPath, but I encourage you to look at other examples on the web, as complex websites often require complex XPath expressions.

Before we begin our real-word example, you might be asking yourself how you can actually extract the text/numeric data from these nodes. Well, that’s easy: xml_text.

xml_raw %>%
  xml_find_all(".//occupation") %>%
  xml_text()
## [1] "\n      Spy\n    "       "\n      Scientist\n    "

Once you’ve narrowed down your tree-based search to one single piece of text or numbers, xml_text() will extract that for you (there’s also xml_double and xml_integer for extracting numbers). As I said, XPath is really a huge language. If you’re interested, this XPath cheat sheets have helped me a lot to learn tricks for easy scraping.

Real-world example

We’re interested in making a list of many schools in Spain and visualizing their location. This can be useful for many things such as matching population density of children across different regions to school locations. The website www.buscocolegio.com contains a database of schools similar to what we’re looking for. As described at the beginning, instead we’re going to use scrapex which has the function spanish_schools_ex() containing the links to a sample of websites from different schools saved locally on your computer.

Let’s look at an example for one school.

school_links <- spanish_schools_ex()

# Keep only the HTML file of one particular school.
school_url <- school_links[13]

school_url
## [1] "/usr/local/lib/R/site-library/scrapex/extdata/spanish_schools_ex/school_3006839.html"

If you’re interested in looking at the website interactively in your browser, you can do it with browseURL(prep_browser(school_url)). Let’s read the HTML (XML and HTML are usually interchangeable, so here we use read_html).

# Here we use `read_html` because `read_xml` is throwing an error
# when attempting to read. However, everything we've discussed
# should be the same.
school_raw <- read_html(school_url) %>% xml_child()

school_raw
## {html_node}
## 
##  [1] Aquí encontrarás toda la información necesaria sobre CEIP SA ...
##  [2] <meta charset="utf-8">\n
##  [3] <meta name="viewport" content="width=device-width, initial-scale=1, ...
##  [4] <meta http-equiv="x-ua-compatible" content="ie=edge">\n
##  [5] <meta name="author" content="BuscoColegio">\n
##  [6] <meta name="description" content="Encuentra toda la información nec ...
##  [7] <meta name="keywords" content="opiniones SANCHIS GUARNER, contacto  ...
##  [8] <link rel="shortcut icon" href="/favicon.ico">\n
##  [9] <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Robo ...
## [10] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [11] <link rel="stylesheet" href="/assets/vendor/icon-awesome/css/font-a ...
## [12] <link rel="stylesheet" href="/assets/vendor/icon-line/css/simple-li ...
## [13] <link rel="stylesheet" href="/assets/vendor/icon-line-pro/style.css ...
## [14] <link rel="stylesheet" href="/assets/vendor/icon-hs/style.css">\n
## [15] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [16] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [17] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [18] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [19] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [20] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## ...</code></pre>
<p>Web scraping strategies are very specific to the website you’re after. You have to get very familiar with the website you’re interested to be able to match perfectly the information you’re looking for. In many cases, scraping two websites will require vastly different strategies. For this particular example, we’re only interested in figuring out the <strong>location</strong> of each school so we only have to extract its location.</p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>In the image above you’ll find a typical school’s website in <code>wwww.buscocolegio.com</code>. The website has a lot of information, but we’re only interested in the button that is circled by the orange rectangle. If you can’t find it easily, it’s below the Google Maps on the right which says “Buscar colegio cercano”.</p>
<p>When you click on this button, this actually points you towards the coordinates of the school so we just have to find a way of figuring out how to click this button or figure out how to get its information. All browsers allow you to do this if you press CTRL + SHIFT + c at the same time (Firefox and Chrome support this hotkey). If a window on the right popped in full of code, then you’re on the right track:</p>
<p></p>
<p><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Here we can search the source code of the website. If you place your mouse pointer over the lines of code from this right-most window, you’ll see sections of the website being highlighted in blue. This indicates which parts of the code refer to which parts of the website. Luckily for us, we don’t have to search the complete source code to find that specific location. We can approximate our search by typing the text we’re looking for in the search bar at the top of the right window:</p>
<p></p>
<p><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>After we click enter, we’ll be automatically directed to the tag that has the information that we want.</p>
<p></p>
<p><img src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>More specifically, we can see that the latitude and longitude of schools are found in an attributed called <code>href</code> in a tag <code><a></code>:</p>
<p></p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Can you see the latitude and longitude fields in the text highlighted blue? It’s hidden in-between words. That is precisely the type of information we’re after. Extracting all <code><a></code> tags from the website (hint: XPath similar to <code>"//a"</code>) will yield hundreds of matches because <code><a></code> is a very common tag. Moreover, refining the search to <code><a></code> tags which have an <code>href</code> attribute will also yield hundreds of matches because <code>href</code> is the standard attribute to attach links within websites. We need to narrow down our search within the website.</p>
<p>One strategy is to find the ‘father’ or ‘grandfather’ node of this particular <code><a></code> tag and then match a node which has that same sequence of grandfather -> father -> child node. By looking at the structure of this small XML snippet from the right-most window, we see that the ‘grandfather’ of this <code><a></code> tag is <code><p class="d-flex align-items-baseline g-mt-5'></code> which has a particularly long attribute named <code>class</code>.</p>
<p></p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Don’t be intimidated by these tag names and long attributes. I also don’t know what any of these attributes mean. But what I do know is that this is the ‘grandfather’ of the <code><a></code> tag I’m interested in. So using our XPath skills, let’s search for that <code><p></code> tag and see if we get only one match.</p>
<pre class="r"><code># Search for all <p> tags with that class in the document
school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']")</code></pre>
<pre><code>## {xml_nodeset (1)}
## [1] <p class="d-flex align-items-baseline g-mt-5">\r\n\t                 ...</code></pre>
<p>Only one match, so this is good news. This means that we can uniquely identify this particular <code><p></code> tag. Let’s refine the search to say: Find all <code><a></code> tags which are children of that specific <code><p></code> tag. This only means I’ll add a <code>"//a"</code> to the previous expression. Since there is only one <code><p></code> tag with the class, we’re interested in checking whether there is more than one <code><a></code> tag below this <code><p></code> tag.</p>
<pre class="r"><code>school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a")</code></pre>
<pre><code>## {xml_nodeset (1)}
## [1] <a href="/Colegio/buscar-colegios-cercanos.action?colegio.latitud=38 rel=" target="_blank"></pre>
<p>There we go! We can see the specific <code>href</code> that contains the latitude and longitude data we’re interested in. How do we extract the <code>href</code> attribute? Using <code>xml_attr</code> as we did before!</p>
<pre class="r"><code>location_str <-
  school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a") %>%
  xml_attr(attr = "href")

location_str</code></pre>
<pre><code>## [1] "/Colegio/buscar-colegios-cercanos.action?colegio.latitud=38.8274492&colegio.longitud=0.0221681"</code></pre>
<p>Ok, now we need some regex skills to get only the latitude and longitude (regex expressions are used to search for patterns inside a string, such as for example a date. See <a href="https://www.jumpingrivers.com/blog/regular-expressions-every-r-programmer-should-know/" rel="nofollow" target="_blank">here</a> for some examples):</p>
<pre class="r"><code>location <-
  location_str %>%
  str_extract_all("=.+$") %>%
  str_replace_all("=|colegio\\.longitud", "") %>%
  str_split("&") %>%
  .[[1]]

location</code></pre>
<pre><code>## [1] "38.8274492" "0.0221681"</code></pre>
<p>Ok, so we got the information we needed for one single school. Let’s turn that into a function so we can pass only the school’s link and get the coordinates back.</p>
<p>Before we do that, I will set something called my <code>User-Agent</code>. In short, the <code>User-Agent</code> is <strong>who</strong> you are. It is good practice to identify the person who is scraping the website because if you’re causing any trouble on the website, the website can directly identify who is causing problems. You can figure out your user agent <a href="https://www.google.com/search?client=ubuntu&channel=fs&q=what%27s+my+user+agent&ie=utf-8&oe=utf-8" rel="nofollow" target="_blank">here</a> and paste it in the string below. In addition, I will add a time sleep of 5 seconds to the function because we want to make sure we don’t cause any troubles to the website we’re scraping due to an overload of requests.</p>
<pre class="r"><code># This sets your `User-Agent` globally so that all requests are
# identified with this `User-Agent`
set_config(
  user_agent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0")
)

# Collapse all of the code from above into one function called
# school grabber

school_grabber <- function(school_url) {
  # We add a time sleep of 5 seconds to avoid
  # sending too many quick requests to the website
  Sys.sleep(5)

  school_raw <- read_html(school_url) %>% xml_child()

  location_str <-
    school_raw %>%
    xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a") %>%
    xml_attr(attr = "href")

  location <-
    location_str %>%
    str_extract_all("=.+$") %>%
    str_replace_all("=|colegio\\.longitud", "") %>%
    str_split("&") %>%
    .[[1]]

  # Turn into a data frame
  data.frame(
    latitude = location[1],
    longitude = location[2],
    stringsAsFactors = FALSE
  )
}


school_grabber(school_url)</code></pre>
<pre><code>##     latitude longitude
## 1 38.8274492 0.0221681</code></pre>
<p>Ok, so it’s working. The only thing left is to extract this for many schools. As shown earlier, <code>scrapex</code> contains a list of 27 school links that we can automatically scrape. Let’s loop over those, get the information of coordinates for each and collapse all of them into a data frame.</p>
<pre class="r"><code>res <- map_dfr(school_links, school_grabber)
res</code></pre>
<pre><code>##    latitude  longitude
## 1  42.72779 -8.6567935
## 2  43.24439 -8.8921645
## 3  38.95592 -1.2255769
## 4  39.18657 -1.6225903
## 5  40.38245 -3.6410388
## 6  40.22929 -3.1106322
## 7  40.43860 -3.6970366
## 8  40.33514 -3.5155669
## 9  40.50546 -3.3738441
## 10 40.63826 -3.4537107
## 11 40.38543 -3.6639500
## 12 37.76485 -1.5030467
## 13 38.82745  0.0221681
## 14 40.99434 -5.6224391
## 15 40.99434 -5.6224391
## 16 40.56037 -5.6703725
## 17 40.99434 -5.6224391
## 18 40.99434 -5.6224391
## 19 41.13593  0.9901905
## 20 41.26155  1.1670507
## 21 41.22851  0.5461471
## 22 41.14580  0.8199749
## 23 41.18341  0.5680564
## 24 42.07820  1.8203155
## 25 42.25245  1.8621546
## 26 41.73767  1.8383666
## 27 41.62345  2.0013628</code></pre>
<p>So now that we have the locations of these schools, let’s plot them:</p>
<pre class="r"><code>res <- mutate_all(res, as.numeric)

sp_sf <-
  ne_countries(scale = "large", country = "Spain", returnclass = "sf") %>%
  st_transform(crs = 4326)

ggplot(sp_sf) +
  geom_sf() +
  geom_point(data = res, aes(x = longitude, y = latitude)) +
  coord_sf(xlim = c(-20, 10), ylim = c(25, 45)) +
  theme_minimal() +
  ggtitle("Sample of schools in Spain")</code></pre>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p>There we go! We went from literally no information at the beginning of this tutorial to interpretable and summarized information only using web data. We can see some schools in Madrid (center) as well in other regions of Spain, including Catalonia and Galicia.</p>
<p>This marks the end of our scraping adventure but before we finish, I want to mention some of the ethical guidelines for web scraping. Scraping is extremely useful for us but can give headaches to other people maintaining the website of interest. Here’s a list of ethical guidelines you should always follow:</p>
<ul>
<li>
<p>Read the terms and services: many websites prohibit web scraping and you could be in a breach of privacy by scraping the data. <a href="https://fortune.com/2016/05/18/okcupid-data-research/" rel="nofollow" target="_blank">One</a> famous example.</p>
</li>
<li>
<p>Check the <code>robots.txt</code> file. This is a file that most websites have (<code>www.buscocolegio.com</code> does <strong>not</strong>) which tell you which specific paths inside the website are scrapable and which are not. See <a href="https://www.robotstxt.org/robotstxt.html" rel="nofollow" target="_blank">here</a> for an explanation of what robots.txt look like and where to find them.</p>
</li>
<li>
<p>Some websites are supported by very big servers, which means you can send 4-5 website requests per second. Others, such as <code>www.buscocolegio.com</code> are not. It’s good practice to always put a time sleep between your requests. In our example, I set it to 5 seconds because this is a small website and we don’t want to crash their servers.</p>
</li>
<li>
<p>When making requests, there are computational ways of identifying yourself. For example, every request (such as the one’s we do) can have something called a <code>User-Agent</code>. It is good practice to include yourself in as the <code>User-Agent</code> (as we did in our code) because the admin of the server can directly identify if someone’s causing problems due to their web scraping.</p>
</li>
<li>
<p>Limit your scraping to non-busy hours such as overnight. This can help reduce the chances of collapsing the website since fewer people are visiting websites in the evening.</p>
</li>
</ul>
<p>You can read more about these ethical issues <a href="http://robertorocha.info/on-the-ethics-of-web-scraping/" rel="nofollow" target="_blank">here</a>.</p>
</div>
<div id="wrap-up" class="section level2">
<h2>Wrap up</h2>
<p>This tutorial introduced you to basic concepts in web scraping and applied them in a real-world setting. Web scraping is a vast field in computer science (you can find entire books on the subject such as <a href="https://www.apress.com/gp/book/9781484235812" rel="nofollow" target="_blank">this</a>). We covered some basic techniques which I think can take you a long way but there’s definitely more to learn. For those curious about where to turn, I’m looking forward to the upcoming book <a href="https://rud.is/b/books/" rel="nofollow" target="_blank">“A Field Guide for Web Scraping and Accessing APIs with R”</a> by Bob Rudis, which should be released in the near future. Now go scrape some websites ethically!</p>
</div>

		<script type='text/javascript'>
		  var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' };

		  (function(d, t) {
			var s = d.createElement(t); s.type = 'text/javascript'; s.async = true;
			s.src = '//cdn.viglink.com/api/vglnk.js';
			var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r);
		  }(document, 'script'));
		</script>		
		
<div id='jp-relatedposts' class='jp-relatedposts' >
	<h3 class="jp-relatedposts-headline"><em>Related</em></h3>
</div><aside class="mashsb-container mashsb-main mashsb-stretched"><div class="mashsb-box"><div class="mashsb-buttons"><a class="mashicon-facebook mash-small mash-center mashsb-noshadow" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Share</span></a><a class="mashicon-twitter mash-small mash-center mashsb-noshadow" href="https://twitter.com/intent/tweet?text=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools&url=https://www.r-bloggers.com/an-introduction-to-web-scraping-locating-spanish-schools/&via=Rbloggers" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Tweet</span></a><div class="onoffswitch2 mash-small mashsb-noshadow" style="display:none;"></div></div>
            </div>
                <div style="clear:both;"></div></aside>
            <!-- Share buttons by mashshare.net - Version: 3.6.9-->
<p class="syndicated-attribution"><div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://codingclubuc3m.rbind.io/post/2020-02-11/"> R on Coding Club UC3M</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div></p><hr /><hr />
<div style="border: 1px solid #EB9349; background: none repeat scroll 0 0 #FDEADA; text-align: center; margin: 10px; font-size: 16px;">
If you got this far, why not <strong><u>subscribe for updates</u> </strong>from the site?  Choose your flavor: <a href="http://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">e-mail</a>, <a href="https://twitter.com/#!/rbloggers" rel="nofollow">twitter</a>, <a href="http://feeds.feedburner.com/RBloggers" rel="nofollow">RSS</a>, or <a href="http://www.facebook.com/pages/R-bloggers/191414254890" rel="nofollow">facebook</a>...
</div><div class="social4i" style="height:29px;"><div class="social4in" style="height:29px;float: left;"><div class="socialicons s4fblike" style="float:left;margin-right: 10px;"><div class="fb-like" data-href="https://www.r-bloggers.com/an-introduction-to-web-scraping-locating-spanish-schools/" data-send="true"  data-layout="button_count" data-width="100" data-height="21"  data-show-faces="false"></div></div><div class="socialicons s4twitter" style="float:left;margin-right: 10px;"><a href="https://twitter.com/share" data-url="https://www.r-bloggers.com/an-introduction-to-web-scraping-locating-spanish-schools/" data-counturl="https://www.r-bloggers.com/an-introduction-to-web-scraping-locating-spanish-schools/" data-text="An introduction to web scraping: locating Spanish schools" class="twitter-share-button" data-count="horizontal" data-via="rbloggers"></a></div><div class="socialicons s4linkedin" style="float:left;margin-right: 10px;"><script type="in/share" data-url="https://www.r-bloggers.com/an-introduction-to-web-scraping-locating-spanish-schools/" data-counter="right"></script></div></div><div style="clear:both"></div></div></div>

	</div><!-- #post-## -->



   				<div id="comments">


<div id="comment-user-details">





		<p class="nocomments">Comments are closed.</p>



</div>

</div><!-- #comments -->

       
	</div>
    <!-- begin second sidebar -->
    <div id="secondsidebar">

		
		<div class="widget_text side-widget"><h2>Search R-bloggers</h2><div class="textwidget custom-html-widget">

<div class="top-search" style="padding-left: 15px;">
	<form id="searchform" action="http://www.google.com/cse" target="_blank">
		<div>
			<input type="hidden" name="cx" value="005359090438081006639:paz69t-s8ua" />
			<input type="hidden" name="ie" value="UTF-8" />
			<input type="text" value="" name="q" id="q" autocomplete="on" style="font-size:16px;" placeholder="Search.." />
			<input type="submit" id="searchsubmit" name="sa" value="Go" style="font-size:16px;" />
		</div>
	</form>

</div>
<!-- thanks: https://stackoverflow.com/questions/14981575/google-cse-with-a-custom-form 
https://stackoverflow.com/questions/10363674/change-size-of-text-in-text-input-tag
--></div></div><div class="side-widget"><h2>Most visited articles of the week</h2>
<ol class='wppp_list'>
	<li><a href='https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/' title='5 Ways to Subset a Data Frame in R'>5 Ways to Subset a Data Frame in R</a></li>
	<li><a href='https://www.r-bloggers.com/covid-19-interactive-map-using-r-with-shiny-leaflet-and-dplyr/' title='Covid-19 interactive map (using R with shiny, leaflet and dplyr)'>Covid-19 interactive map (using R with shiny, leaflet and dplyr)</a></li>
	<li><a href='https://www.r-bloggers.com/how-to-write-the-first-for-loop-in-r/' title='How to write the first for loop in R'>How to write the first for loop in R</a></li>
	<li><a href='https://www.r-bloggers.com/flatten-the-covid-19-curve/' title='Flatten the COVID-19 curve'>Flatten the COVID-19 curve</a></li>
	<li><a href='https://www.r-bloggers.com/survey-results-what-degree-is-best-for-data-science/' title='Survey Results: What Degree is Best for Data Science?'>Survey Results: What Degree is Best for Data Science?</a></li>
	<li><a href='https://www.r-bloggers.com/google-big-query-with-r/' title='Google Big Query with R'>Google Big Query with R</a></li>
	<li><a href='https://www.r-bloggers.com/simulating-covid-19-interventions-with-r/' title='Simulating COVID-19 interventions with R'>Simulating COVID-19 interventions with R</a></li>
	<li><a href='https://www.r-bloggers.com/covid-19-the-case-of-germany/' title='COVID-19: The Case of Germany'>COVID-19: The Case of Germany</a></li>
	<li><a href='https://www.r-bloggers.com/date-formats-in-r/' title='Date Formats in R'>Date Formats in R</a></li>
</ol>
</div><div class="side-widget"><h2>Sponsors</h2>			<div class="textwidget"><script data-cfasync="false" type="text/javascript">
// https://support.cloudflare.com/hc/en-us/articles/200169436-How-can-I-have-Rocket-Loader-ignore-my-script-s-in-Automatic-Mode-
// this must be placed higher. Otherwise it doesn't work.
// data-cfasync="false" is for making sure cloudflares' rocketcache doesn't interfeare with this
// in this case it only works because it was used at the original script in the text widget


function createCookie(name,value,days) {
    var expires = "";
    if (days) {
        var date = new Date();
        date.setTime(date.getTime() + (days*24*60*60*1000));
        expires = "; expires=" + date.toUTCString();
    }
    document.cookie = name + "=" + value + expires + "; path=/";
}

function readCookie(name) {
    var nameEQ = name + "=";
    var ca = document.cookie.split(';');
    for(var i=0;i < ca.length;i++) {
        var c = ca[i];
        while (c.charAt(0)==' ') c = c.substring(1,c.length);
        if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
    }
    return null;
}

function eraseCookie(name) {
    createCookie(name,"",-1);
}


async function readTextFile(file)
{
	// Helps people browse between pages without the need to keep downloading the same 
	// ads txt page everytime. This way, it allows them to use their browser's cache.
	var random_number = readCookie("ad_random_number_cookie");
	if(random_number == null) {
		var random_number = Math.floor(Math.random()*100*(new Date().getTime()/10000000000));
		createCookie("ad_random_number_cookie",random_number,1)
	}
	
    file += '?t='+random_number;
    var rawFile = new XMLHttpRequest();
    rawFile.onreadystatechange = function ()
    {
        if(rawFile.readyState === 4)
        {
            if(rawFile.status === 200 || rawFile.status == 0)
            {
                // var allText = rawFile.responseText;
                // document.write(allText);
                document.write(rawFile.responseText);
            }
        }
    }
    rawFile.open("GET", file, false);
    rawFile.send(null);
}

// readTextFile('https://raw.githubusercontent.com/Raynos/file-store/master/temp.txt');

readTextFile("https://www.r-bloggers.com/wp-content/uploads/text-widget_anti-cache.txt");

</script>




</div>
		</div><div class="side-widget"><h2><a class="rsswidget" href="https://feeds.feedburner.com/Rjobs"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://www.r-users.com/">Jobs for R users</a></h2><ul><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/irA3U4TAXas/'>Senior Enterprise Advocate (Sales – New Business)</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/lXxvEQsZ_5I/'>Major Accounts Executive</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/fNlPOaIav84/'>Solutions Engineer</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/YQ-ypWZYe_k/'>Postdoctoral fellow @ Belfast, Northern Ireland, U.K.</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/8iJar8_FaEw/'>Statistician/Data Analyst – microdata @ Rome, Laz., Italy</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/ZJOJk6AsNjE/'>Data Science Summer Internship @ Raleigh, North Carolina, U.S.</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/i5-X6o5jrMo/'>Technical Support Engineer</a></li></ul></div><div class="side-widget"><h2><a class="rsswidget" href="https://feeds.feedburner.com/Python-bloggers"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://python-bloggers.com/">python-bloggers.com (python/data-science news)</a></h2><ul><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/RChYaChzslg/'>Online R, Python & Git Training!</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/BBIN8vRh5J4/'>Import data into the querier (now on Pypi), a query language for Data Frames</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/Cpw6Bqiqrb8/'>Version 0.4.0 of nnetsauce, with fruits and breast cancer classification</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/rXqROkWVd7s/'>Data Science in Manufacturing: An Overview</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/Yw3-sw7okyg/'>Building a realistic Reddit AI that get upvoted in Python</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/f4ji4g0fvcs/'>Learn Julia for Data Science</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/V7oF4verbZk/'>New improved cdata instructional video</a></li></ul></div><div class="side-widget">			<div class="textwidget"><strong><a href="https://www.r-bloggers.com/blogs-list/">Full list of contributing R-bloggers</a></strong></div>
		</div>    </div>
<!-- end second sidebar --></div>
<!-- begin footer -->
<div id="footer">
	<strong><a href="https://www.r-bloggers.com">R-bloggers</a></strong> was founded by <a href="http://www.r-statistics.com/about/">Tal Galili</a>, with gratitude to the <a href="http://www.r-project.org/">R</a> community.  <br />
    Is powered by <a href="http://www.wordpress.org">WordPress</a> using a <a href="http://themes.bavotasan.com" rel="nofollow">bavotasan.com</a> design.<br />
    Copyright © 2020 <strong>R-bloggers</strong>. All Rights Reserved. <a href="http://www.r-bloggers.com/terms/">Terms and Conditions</a> for this website<br />
</div>

<!--
TPC! Memory Usage (http://webjawns.com)
Memory Usage: 91496576
Memory Peak Usage: 92424376
WP Memory Limit: 820M
PHP Memory Limit: 800M
Checkpoints: 9
-->

		<div class="wpusb wpusb-buttons wpusb-fixed-right   wpusb-fixed wpusb-layout-buttons-content wpusb-fixed-position_fixed"
		     id="wpusb-container-fixed"
		     data-element-url="https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share"
		     data-element-title="An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools"
		     data-attr-reference="193096"
		     data-is-term="0"
		     data-element="fixed"
		     data-attr-nonce="48323d3fd1"
		      data-disabled-share-counts="1" data-wpusb-component="counter-social-share">

			<div data-element="buttons" class="wpusb-fixed-right-container ">
						<div class="wpusb-item wpusb-facebook ">
				<a href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share" target="_blank"
				   data-action="open-popup"
				   class="wpusb-layout-buttons wpusb-button wpusb-btn "
				   title="Share on Facebook"
				   
				   
				   rel="nofollow"
				>
				   			<svg class="wpusb-svg wpusb-facebook-buttons ">
				<use xlink:href="#wpusb-facebook" />
			</svg>
				</a>
			</div>			<div class="wpusb-item wpusb-twitter ">
				<a href="https://twitter.com/share?url=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share&text=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools #rstats #datascience&via=rbloggers" target="_blank"
				   data-action="open-popup"
				   class="wpusb-layout-buttons wpusb-button wpusb-btn "
				   title="Tweet"
				   
				   
				   rel="nofollow"
				>
				   			<svg class="wpusb-svg wpusb-twitter-buttons ">
				<use xlink:href="#wpusb-twitter" />
			</svg>
				</a>
			</div>			<div class="wpusb-item wpusb-linkedin ">
				<a href="https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share&title=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools" target="_blank"
				   data-action="open-popup"
				   class="wpusb-layout-buttons wpusb-button wpusb-btn "
				   title="Share on Linkedin"
				   
				   
				   rel="nofollow"
				>
				   			<svg class="wpusb-svg wpusb-linkedin-buttons ">
				<use xlink:href="#wpusb-linkedin" />
			</svg>
				</a>
			</div>			<div class="wpusb-item wpusb-email ">
				<a href="mailto:?subject=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools&body=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share
%0A%0AAn%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools%0A%0Aby%20Jorge%20Cimentada%0A%20%20%20%20%20%20%20%20%0A%0A%0A%0AIntroduction%0AWhenever%20a%20new%20paper%20is%20released%20using%20some%20type%20of%20scraped%20data%2C%20most%20of%20my%20peers%20in%20the%20social%20science%20community%20get%20baffled%20at%20how%20researchers%20can%20do%20this.%20In%20fact%2C%20many%20social%20scientists%20can%E2%80%99t%20even%20think%20of%20research%20questions%20that%20can%20be%20addressed%20with%20this%20type%20of%20data%20simply%20because%20they%20don%E2%80%99t%20know%20it%E2%80%99s%20even%20possible.%20As%20the%20old%20saying%20goes%2C%20when%20you%20have%20a%20hammer%2C%20every%20problem%20looks%20like%20a%20nail.%0AWith%20the%20increasing%20amount%20of%20data%20being%20collected%20on%20a%20daily%20basis%2C%20it%20is%20eminent%20that%20scientists%20start%20getting%20familiar%20with%20new%20technologies%20that%20can%20help%20answer%20old%20questions.%20Moreover%2C%20we%20need%20to%20be%20adventurous%20about%20cutting%20edge%20data%20sources%20as%20they%20can%20also%20allow%20us%20to%20ask%20new%20questions%20which%20weren%E2%80%99t%20even%20thought%20of%20in%20the%20past.%0AIn%20this%20tutorial%20I%E2%80%99ll%20be%20guiding%20you%20through%20the%20basics%20of%20web%20scraping%20using%20R%20and%20the%20xml2%20package.%20I%E2%80%99ll%20begin%20with%20a%20simple%20example%20using%20fake%20data%20and%20elaborate%20further%20by%20trying%20to%20scrape%20the%20location%20of%20a%20sample%20of%20schools%20in%20Spain.%0A%0A%0ABasic%20steps%0AFor%20web%20scraping%20in%20R%2C%20you%20can%20fulfill%20almost%20all%20of%20your%20needs%20with%20the%20xml2%20package.%20As%20you%20wander%20through%20the%20web%2C%20you%E2%80%99ll%20see%20many%20examples%20using%20the%20rvest%20package.%20xml2%20and%20rvest%20are%20very%20similar%20so%20don%E2%80%99t%20feel%20you%E2%80%99re%20lacking%20behind%20for%20learning%20one%20and%20not%20the%20other.%20In%20addition%20to%20these%20two%20packages%2C%20we%E2%80%99ll%20need%20some%20other%20libraries%20for%20plotting%20locations%20on%20a%20map%20%28ggplot2%2C%20sf%2C%20rnaturalearth%29%2C%20identifying%20who%20we%20are%20when%20we%20scrape%20%28httr%29%20and%20wrangling%20data%20%28tidyverse%29.%0AAdditionally%2C%20we%E2%80%99ll%20also%20need%20the%20package%20scrapex.%20In%20the%20real-world%20example%20that%20we%E2%80%99ll%20be%20doing%20below%2C%20we%E2%80%99ll%20be%20scraping%20data%20from%20the%20website%20www.buscocolegio.com%20to%20locate%20a%20sample%20of%20schools%20in%20Spain.%20However%2C%20throughout%20the%20tutorial%20we%20won%E2%80%99t%20be%20scraping%20the%20data%20directly%20from%20their%20real-website.%20What%20would%20happen%20to%20this%20tutorial%20if%206%20months%20from%20now%20www.buscocolegio.com%20updates%20the%20design%20of%20their%20website%3F%20Everything%20from%20our%20real-world%20example%20would%20be%20lost.%0AWeb%20scraping%20tutorials%20are%20usually%20very%20unstable%20precisely%20because%20of%20this.%20To%20circumvent%20that%20problem%2C%20I%E2%80%99ve%20saved%20a%20random%20sample%20of%20websites%20from%20some%20schools%20in%20www.buscocolegio.com%20into%20an%20R%20package%20called%20scrapex.%20Although%20the%20links%20we%E2%80%99ll%20be%20working%20on%20will%20be%20hosted%20locally%20on%20your%20machine%2C%20the%20HTML%20of%20the%20website%20should%20be%20very%20similar%20to%20the%20one%20hosted%20on%20the%20website%20%28with%20the%20exception%20of%20some%20images%2Ficons%20which%20were%20deleted%20on%20purpose%20to%20make%20the%20package%20lightweight%29.%0AYou%20can%20install%20the%20package%20with%3A%0A%23%20install.packages%28%26quot%3Bdevtools%26quot%3B%29%0Adevtools%3A%3Ainstall_github%28%26quot%3Bcimentadaj%2Fscrapex%26quot%3B%29%0ANow%2C%20let%E2%80%99s%20move%20on%20the%20fake%20data%20example%20and%20load%20all%20of%20our%20packages%20with%3A%0Alibrary%28xml2%29%0Alibrary%28httr%29%0Alibrary%28tidyverse%29%0Alibrary%28sf%29%0Alibrary%28rnaturalearth%29%0Alibrary%28ggplot2%29%0Alibrary%28scrapex%29%0ALet%E2%80%99s%20begin%20with%20a%20simple%20example.%20Below%20we%20define%20an%20XML%20string%20and%20look%20at%20its%20structure%3A%0Axml_test%20%26lt%3B-%20%26quot%3B%26lt%3Bpeople%26gt%3B%0A%26lt%3Bjason%26gt%3B%0A%20%20%26lt%3Bperson%20type%3D%26%2339%3Bfictional%26%2339%3B%26gt%3B%0A%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%20%20%20%20%20%20%20%20Jason%0A%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%20%20%20%20%20%20%20%20Bourne%0A%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%20%20%20%20%20%20Spy%0A%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%20%20%26lt%3B%2Fperson%26gt%3B%0A%26lt%3B%2Fjason%26gt%3B%0A%26lt%3Bcarol%26gt%3B%0A%20%20%26lt%3Bperson%20type%3D%26%2339%3Breal%26%2339%3B%26gt%3B%0A%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%20%20%20%20%20%20%20%20Carol%0A%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%20%20%20%20%20%20%20%20Kalp%0A%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%20%20%20%20%20%20Scientist%0A%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%20%20%26lt%3B%2Fperson%26gt%3B%0A%26lt%3B%2Fcarol%26gt%3B%0A%26lt%3B%2Fpeople%26gt%3B%0A%26quot%3B%0A%0Acat%28xml_test%29%0A%23%23%20%26lt%3Bpeople%26gt%3B%0A%23%23%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%20%26lt%3Bperson%20type%3D%26%2339%3Bfictional%26%2339%3B%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Jason%0A%23%23%20%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Bourne%0A%23%23%20%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20Spy%0A%23%23%20%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%20%26lt%3B%2Fperson%26gt%3B%0A%23%23%20%26lt%3B%2Fjason%26gt%3B%0A%23%23%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%20%26lt%3Bperson%20type%3D%26%2339%3Breal%26%2339%3B%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Carol%0A%23%23%20%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Kalp%0A%23%23%20%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20Scientist%0A%23%23%20%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%20%26lt%3B%2Fperson%26gt%3B%0A%23%23%20%26lt%3B%2Fcarol%26gt%3B%0A%23%23%20%26lt%3B%2Fpeople%26gt%3B%0AIn%20XML%20and%20HTML%20the%20basic%20building%20blocks%20are%20something%20called%20tags.%20For%20example%2C%20the%20first%20tag%20in%20the%20structure%20shown%20above%20is%20%26lt%3Bpeople%26gt%3B.%20This%20tag%20is%20matched%20by%20%26lt%3B%2Fpeople%26gt%3B%20at%20the%20end%20of%20the%20string%3A%0A%0AIf%20you%20pay%20close%20attention%2C%20you%E2%80%99ll%20see%20that%20each%20tag%20in%20the%20XML%20structure%20has%20a%20beginning%20%28signaled%20by%20%26lt%3B%26gt%3B%29%20and%20an%20end%20%28signaled%20by%20%26lt%3B%2F%26gt%3B%29.%20For%20example%2C%20the%20next%20tag%20after%20%26lt%3Bpeople%26gt%3B%20is%20%26lt%3Bjason%26gt%3B%20and%20right%20before%20the%20tag%20%26lt%3Bcarol%26gt%3B%20is%20the%20end%20of%20the%20jason%20tag%20%26lt%3B%2Fjason%26gt%3B.%0A%0ASimilarly%2C%20you%E2%80%99ll%20find%20that%20the%20%26lt%3Bcarol%26gt%3B%20tag%20is%20also%20matched%20by%20a%20%26lt%3B%2Fcarol%26gt%3B%20finishing%20tag.%0A%0AIn%20theory%2C%20tags%20can%20have%20whatever%20meaning%20you%20attach%20to%20them%20%28such%20as%20%26lt%3Bpeople%26gt%3B%20or%20%26lt%3Boccupation%26gt%3B%29.%20However%2C%20in%20practice%20there%20are%20hundreds%20of%20tags%20which%20are%20standard%20in%20websites%20%28for%20example%2C%20here%29.%20If%20you%E2%80%99re%20just%20getting%20started%2C%20there%E2%80%99s%20no%20need%20for%20you%20to%20learn%20them%20but%20as%20you%20progress%20in%20web%20scraping%2C%20you%E2%80%99ll%20start%20to%20recognize%20them%20%28one%20brief%20example%20is%20%26lt%3Bstrong%26gt%3B%20which%20simply%20bolds%20text%20in%20a%20website%29.%0AThe%20xml2%20package%20was%20designed%20to%20read%20XML%20strings%20and%20to%20navigate%20the%20tree%20structure%20to%20extract%20information.%20For%20example%2C%20let%E2%80%99s%20read%20in%20the%20XML%20data%20from%20our%20fake%20example%20and%20look%20at%20its%20general%20structure%3A%0Axml_raw%20%26lt%3B-%20read_xml%28xml_test%29%0Axml_structure%28xml_raw%29%0A%23%23%20%26lt%3Bpeople%26gt%3B%0A%23%23%20%20%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bperson%20%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bperson%20%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0AYou%20can%20see%20that%20the%20structure%20is%20tree-based%2C%20meaning%20that%20tags%20such%20as%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20are%20nested%20within%20the%20%26lt%3Bpeople%26gt%3B%20tag.%20In%20XML%20jargon%2C%20%26lt%3Bpeople%26gt%3B%20is%20the%20root%20node%2C%20whereas%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20are%20the%20child%20nodes%20from%20%26lt%3Bpeople%26gt%3B.%0AIn%20more%20detail%2C%20the%20structure%20is%20as%20follows%3A%0A%0AThe%20root%20node%20is%20%26lt%3Bpeople%26gt%3B%0AThe%20child%20nodes%20are%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%0AThen%20each%20child%20node%20has%20nodes%20%26lt%3Bfirst_name%26gt%3B%2C%20%26lt%3Bmarried%26gt%3B%2C%20%26lt%3Blast_name%26gt%3B%20and%20%26lt%3Boccupation%26gt%3B%20nested%20within%20them.%0A%0APut%20another%20way%2C%20if%20something%20is%20nested%20within%20a%20node%2C%20then%20the%20nested%20node%20is%20a%20child%20of%20the%20upper-level%20node.%20In%20our%20example%2C%20the%20root%20node%20is%20%26lt%3Bpeople%26gt%3B%20so%20we%20can%20check%20which%20are%20its%20children%3A%0A%23%20xml_child%20returns%20only%20one%20child%20%28specified%20in%20search%29%0A%23%20Here%2C%20jason%20is%20the%20first%20child%0Axml_child%28xml_raw%2C%20search%20%3D%201%29%0A%23%23%20%7Bxml_node%7D%0A%23%23%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0A%23%20Here%2C%20carol%20is%20the%20second%20child%0Axml_child%28xml_raw%2C%20search%20%3D%202%29%0A%23%23%20%7Bxml_node%7D%0A%23%23%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20...%0A%23%20Use%20xml_children%20to%20extract%20%2A%2Aall%2A%2A%20children%0Achild_xml%20%26lt%3B-%20xml_children%28xml_raw%29%0A%0Achild_xml%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bjason%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarri%20...%0A%23%23%20%20%26lt%3Bcarol%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20...%0ATags%20can%20also%20have%20different%20attributes%20which%20are%20usually%20specified%20as%20%26lt%3Bfake_tag%20attribute%3D%27fake%27%26gt%3B%20and%20ended%20as%20usual%20with%20%26lt%3B%2Ffake_tag%26gt%3B.%20If%20you%20look%20at%20the%20XML%20structure%20of%20our%20example%2C%20you%E2%80%99ll%20notice%20that%20each%20%26lt%3Bperson%26gt%3B%20tag%20has%20an%20attribute%20called%20type.%20As%20you%E2%80%99ll%20see%20in%20our%20real-world%20example%2C%20extracting%20these%20attributes%20is%20often%20the%20aim%20of%20our%20scraping%20adventure.%20Using%20xml2%2C%20we%20can%20extract%20all%20attributes%20that%20match%20a%20specific%20name%20with%20xml_attrs.%0A%23%20Extract%20the%20attribute%20type%20from%20all%20nodes%0Axml_attrs%28child_xml%2C%20%26quot%3Btype%26quot%3B%29%0A%23%23%20%0A%23%23%20named%20character%280%29%0A%23%23%0A%23%23%20%0A%23%23%20named%20character%280%29%0AWait%2C%20why%20didn%E2%80%99t%20this%20work%3F%20Well%2C%20if%20you%20look%20at%20the%20output%20of%20child_xml%2C%20we%20have%20two%20nodes%20on%20which%20are%20for%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B.%0Achild_xml%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bjason%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarri%20...%0A%23%23%20%20%26lt%3Bcarol%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20...%0ADo%20these%20tags%20have%20an%20attribute%3F%20No%2C%20because%20if%20they%20did%2C%20they%20would%20have%20something%20like%20%26lt%3Bjason%20type%3D%27fake_tag%27%26gt%3B.%20What%20we%20need%20is%20to%20look%20down%20at%20the%20%26lt%3Bperson%26gt%3B%20tag%20within%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20and%20extract%20the%20attribute%20from%20%26lt%3Bperson%26gt%3B.%0ADoes%20this%20sound%20familiar%3F%20Both%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20have%20an%20associated%20%26lt%3Bperson%26gt%3B%20tag%20below%20them%2C%20making%20them%20their%20children.%20We%20can%20just%20go%20down%20one%20level%20by%20running%20xml_children%20on%20these%20tags%20and%20extract%20them.%0A%23%20We%20go%20down%20one%20level%20of%20children%0Aperson_nodes%20%26lt%3B-%20xml_children%28child_xml%29%0A%0A%23%20%26lt%3Bperson%26gt%3B%20is%20now%20the%20main%20node%2C%20so%20we%20can%20extract%20attributes%0Aperson_nodes%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20...%0A%23%20Both%20type%20attributes%0Axml_attrs%28person_nodes%2C%20%26quot%3Btype%26quot%3B%29%0A%23%23%20%0A%23%23%20%20%20%20%20%20%20%20type%0A%23%23%20%26quot%3Bfictional%26quot%3B%0A%23%23%0A%23%23%20%0A%23%23%20%20%20type%0A%23%23%20%26quot%3Breal%26quot%3B%0AUsing%20the%20xml_path%20function%20you%20can%20even%20find%20the%20%E2%80%98address%E2%80%99%20of%20these%20nodes%20to%20retrieve%20specific%20tags%20without%20having%20to%20write%20down%20xml_children%20many%20times.%20For%20example%3A%0A%23%20Specific%20address%20of%20each%20person%20tag%20for%20the%20whole%20xml%20tree%0A%23%20only%20using%20the%20%60person_nodes%60%0Axml_path%28person_nodes%29%0A%23%23%20%20%26quot%3B%2Fpeople%2Fjason%2Fperson%26quot%3B%20%26quot%3B%2Fpeople%2Fcarol%2Fperson%26quot%3B%0AWe%20have%20the%20%E2%80%98address%E2%80%99%20of%20specific%20tags%20in%20the%20tree%20but%20how%20do%20we%20extract%20them%20automatically%3F%20To%20extract%20specific%20%E2%80%98addresses%E2%80%99%20of%20this%20XML%20tree%2C%20the%20main%20function%20we%E2%80%99ll%20use%20is%20xml_find_all.%20This%20function%20accepts%20the%20XML%20tree%20and%20an%20%E2%80%98address%E2%80%99%20string.%20We%20can%20use%20very%20simple%20strings%2C%20such%20as%20the%20one%20given%20by%20xml_path%3A%0A%23%20You%20can%20use%20results%20from%20xml_path%20like%20directories%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2Fpeople%2Fjason%2Fperson%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0AThe%20expression%20above%20is%20asking%20for%20the%20node%20%22%2Fpeople%2Fjason%2Fperson%22.%20This%20will%20return%20the%20same%20as%20saying%20xml_raw%20%25%26gt%3B%25%20xml_child%28search%20%3D%201%29.%20For%20deeply%20nested%20trees%2C%20xml_find_all%20will%20be%20many%20times%20much%20cleaner%20than%20calling%20xml_child%20recursively%20many%20times.%0AHowever%2C%20in%20most%20cases%20the%20%E2%80%98addresses%E2%80%99%20used%20in%20xml_find_all%20come%20from%20a%20separate%20language%20called%20XPath%20%28in%20fact%2C%20the%20%E2%80%98address%E2%80%99%20we%E2%80%99ve%20been%20looking%20at%20is%20XPath%29.%20XPath%20is%20a%20complex%20language%20%28such%20as%20regular%20expressions%20for%20strings%29%20which%20is%20beyond%20this%20brief%20tutorial.%20However%2C%20with%20the%20examples%20we%E2%80%99ve%20seen%20so%20far%2C%20we%20can%20use%20some%20basic%20XPath%20which%20we%E2%80%99ll%20need%20later%20on.%0ATo%20extract%20all%20the%20tags%20in%20a%20document%2C%20we%20can%20use%20%2F%2Fname_of_tag.%0A%23%20Search%20for%20all%20%26%2339%3Bmarried%26%2339%3B%20nodes%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2F%2Fmarried%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Jason%5Cn%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0AWith%20the%20previous%20XPath%2C%20we%E2%80%99re%20searching%20for%20all%20married%20tags%20within%20the%20complete%20XML%20tree.%20The%20result%20returns%20all%20married%20nodes%20%28I%20use%20the%20words%20tags%20and%20nodes%20interchangeably%29%20in%20the%20complete%20tree%20structure.%20Another%20example%20would%20be%20finding%20all%20%26lt%3Boccupation%26gt%3B%20tags%3A%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0AIf%20you%20want%20to%20find%20any%20other%20tag%20you%20can%20replace%20%22%2F%2Foccupation%22%20with%20your%20tag%20of%20interest%20and%20xml_find_all%20will%20find%20all%20of%20them.%0AIf%20you%20wanted%20to%20find%20all%20tags%20below%20your%20current%20node%2C%20you%20only%20need%20to%20add%20a%20.%20at%20the%20beginning%3A%20%22.%2F%2Foccupation%22.%20For%20example%2C%20if%20we%20dived%20into%20the%20%26lt%3Bjason%26gt%3B%20tag%20and%20we%20wanted%20his%20%26lt%3Boccupation%26gt%3B%20tag%2C%20%22%2F%2Foccupation%22%20will%20returns%20all%20%26lt%3Boccupation%26gt%3B%20tags.%20Instead%2C%20%22.%2F%2Foccupation%22%20will%20return%20only%20the%20found%20tags%20below%20the%20current%20tag.%20For%20example%3A%0Axml_raw%20%25%26gt%3B%25%0A%20%20%23%20Dive%20only%20into%20Jason%26%2339%3Bs%20tag%0A%20%20xml_child%28search%20%3D%201%29%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B.%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%20Instead%2C%20the%20wrong%20way%20would%20have%20been%3A%0Axml_raw%20%25%26gt%3B%25%0A%20%20%23%20Dive%20only%20into%20Jason%26%2339%3Bs%20tag%0A%20%20xml_child%28search%20%3D%201%29%20%25%26gt%3B%25%0A%20%20%23%20Here%20we%20get%20both%20occupation%20tags%0A%20%20xml_find_all%28%26quot%3B%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0AThe%20first%20example%20only%20returns%20%26lt%3Bjason%26gt%3B%E2%80%99s%20occupation%20whereas%20the%20second%20returned%20all%20occupations%2C%20regardless%20of%20where%20you%20are%20in%20the%20tree.%0AXPath%20also%20allows%20you%20to%20identify%20tags%20that%20contain%20only%20one%20specific%20attribute%2C%20such%20as%20the%20one%E2%80%99s%20we%20saw%20earlier.%20For%20example%2C%20to%20filter%20all%20%26lt%3Bperson%26gt%3B%20tags%20with%20the%20attribute%20filter%20set%20to%20fictional%2C%20we%20could%20do%20it%20with%3A%0A%23%20Give%20me%20all%20the%20tags%20%26%2339%3Bperson%26%2339%3B%20that%20have%20an%20attribute%20type%3D%26%2339%3Bfictional%26%2339%3B%0Axml_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fperson%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0AIf%20you%20wanted%20to%20do%20the%20same%20but%20for%20the%20tags%20below%20your%20current%20nodes%2C%20the%20same%20trick%20we%20learned%20earlier%20would%20work%3A%20%22.%2F%2Fperson%22.%20These%20are%20just%20some%20primers%20that%20can%20help%20you%20jump%20easily%20to%20using%20XPath%2C%20but%20I%20encourage%20you%20to%20look%20at%20other%20examples%20on%20the%20web%2C%20as%20complex%20websites%20often%20require%20complex%20XPath%20expressions.%0ABefore%20we%20begin%20our%20real-word%20example%2C%20you%20might%20be%20asking%20yourself%20how%20you%20can%20actually%20extract%20the%20text%2Fnumeric%20data%20from%20these%20nodes.%20Well%2C%20that%E2%80%99s%20easy%3A%20xml_text.%0Axml_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B.%2F%2Foccupation%26quot%3B%29%20%25%26gt%3B%25%0A%20%20xml_text%28%29%0A%23%23%20%20%26quot%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26quot%3B%20%20%20%20%20%20%20%26quot%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26quot%3B%0AOnce%20you%E2%80%99ve%20narrowed%20down%20your%20tree-based%20search%20to%20one%20single%20piece%20of%20text%20or%20numbers%2C%20xml_text%28%29%20will%20extract%20that%20for%20you%20%28there%E2%80%99s%20also%20xml_double%20and%20xml_integer%20for%20extracting%20numbers%29.%20As%20I%20said%2C%20XPath%20is%20really%20a%20huge%20language.%20If%20you%E2%80%99re%20interested%2C%20this%20XPath%20cheat%20sheets%20have%20helped%20me%20a%20lot%20to%20learn%20tricks%20for%20easy%20scraping.%0A%0A%0AReal-world%20example%0AWe%E2%80%99re%20interested%20in%20making%20a%20list%20of%20many%20schools%20in%20Spain%20and%20visualizing%20their%20location.%20This%20can%20be%20useful%20for%20many%20things%20such%20as%20matching%20population%20density%20of%20children%20across%20different%20regions%20to%20school%20locations.%20The%20website%20www.buscocolegio.com%20contains%20a%20database%20of%20schools%20similar%20to%20what%20we%E2%80%99re%20looking%20for.%20As%20described%20at%20the%20beginning%2C%20instead%20we%E2%80%99re%20going%20to%20use%20scrapex%20which%20has%20the%20function%20spanish_schools_ex%28%29%20containing%20the%20links%20to%20a%20sample%20of%20websites%20from%20different%20schools%20saved%20locally%20on%20your%20computer.%0ALet%E2%80%99s%20look%20at%20an%20example%20for%20one%20school.%0Aschool_links%20%26lt%3B-%20spanish_schools_ex%28%29%0A%0A%23%20Keep%20only%20the%20HTML%20file%20of%20one%20particular%20school.%0Aschool_url%20%26lt%3B-%20school_links%0A%0Aschool_url%0A%23%23%20%20%26quot%3B%2Fusr%2Flocal%2Flib%2FR%2Fsite-library%2Fscrapex%2Fextdata%2Fspanish_schools_ex%2Fschool_3006839.html%26quot%3B%0AIf%20you%E2%80%99re%20interested%20in%20looking%20at%20the%20website%20interactively%20in%20your%20browser%2C%20you%20can%20do%20it%20with%20browseURL%28prep_browser%28school_url%29%29.%20Let%E2%80%99s%20read%20the%20HTML%20%28XML%20and%20HTML%20are%20usually%20interchangeable%2C%20so%20here%20we%20use%20read_html%29.%0A%23%20Here%20we%20use%20%60read_html%60%20because%20%60read_xml%60%20is%20throwing%20an%20error%0A%23%20when%20attempting%20to%20read.%20However%2C%20everything%20we%26%2339%3Bve%20discussed%0A%23%20should%20be%20the%20same.%0Aschool_raw%20%26lt%3B-%20read_html%28school_url%29%20%25%26gt%3B%25%20xml_child%28%29%0A%0Aschool_raw%0A%23%23%20%7Bhtml_node%7D%0A%23%23%20%26lt%3Bhead%26gt%3B%0A%23%23%20%20%20%26lt%3Btitle%26gt%3BAqu%C3%AD%20encontrar%C3%A1s%20toda%20la%20informaci%C3%B3n%20necesaria%20sobre%20CEIP%20SA%20...%0A%23%23%20%20%20%26lt%3Bmeta%20charset%3D%26quot%3Butf-8%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bviewport%26quot%3B%20content%3D%26quot%3Bwidth%3Ddevice-width%2C%20initial-scale%3D1%2C%20...%0A%23%23%20%20%20%26lt%3Bmeta%20http-equiv%3D%26quot%3Bx-ua-compatible%26quot%3B%20content%3D%26quot%3Bie%3Dedge%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bauthor%26quot%3B%20content%3D%26quot%3BBuscoColegio%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bdescription%26quot%3B%20content%3D%26quot%3BEncuentra%20toda%20la%20informaci%C3%B3n%20nec%20...%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bkeywords%26quot%3B%20content%3D%26quot%3Bopiniones%20SANCHIS%20GUARNER%2C%20contacto%20%20...%0A%23%23%20%20%20%26lt%3Blink%20rel%3D%26quot%3Bshortcut%20icon%26quot%3B%20href%3D%26quot%3B%2Ffavicon.ico%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2F%2Ffonts.googleapis.com%2Fcss%3Ffamily%3DRobo%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-awesome%2Fcss%2Ffont-a%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-line%2Fcss%2Fsimple-li%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-line-pro%2Fstyle.css%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-hs%2Fstyle.css%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20...%0AWeb%20scraping%20strategies%20are%20very%20specific%20to%20the%20website%20you%E2%80%99re%20after.%20You%20have%20to%20get%20very%20familiar%20with%20the%20website%20you%E2%80%99re%20interested%20to%20be%20able%20to%20match%20perfectly%20the%20information%20you%E2%80%99re%20looking%20for.%20In%20many%20cases%2C%20scraping%20two%20websites%20will%20require%20vastly%20different%20strategies.%20For%20this%20particular%20example%2C%20we%E2%80%99re%20only%20interested%20in%20figuring%20out%20the%20location%20of%20each%20school%20so%20we%20only%20have%20to%20extract%20its%20location.%0A%0A%0AIn%20the%20image%20above%20you%E2%80%99ll%20find%20a%20typical%20school%E2%80%99s%20website%20in%20wwww.buscocolegio.com.%20The%20website%20has%20a%20lot%20of%20information%2C%20but%20we%E2%80%99re%20only%20interested%20in%20the%20button%20that%20is%20circled%20by%20the%20orange%20rectangle.%20If%20you%20can%E2%80%99t%20find%20it%20easily%2C%20it%E2%80%99s%20below%20the%20Google%20Maps%20on%20the%20right%20which%20says%20%E2%80%9CBuscar%20colegio%20cercano%E2%80%9D.%0AWhen%20you%20click%20on%20this%20button%2C%20this%20actually%20points%20you%20towards%20the%20coordinates%20of%20the%20school%20so%20we%20just%20have%20to%20find%20a%20way%20of%20figuring%20out%20how%20to%20click%20this%20button%20or%20figure%20out%20how%20to%20get%20its%20information.%20All%20browsers%20allow%20you%20to%20do%20this%20if%20you%20press%20CTRL%20%2B%20SHIFT%20%2B%20c%20at%20the%20same%20time%20%28Firefox%20and%20Chrome%20support%20this%20hotkey%29.%20If%20a%20window%20on%20the%20right%20popped%20in%20full%20of%20code%2C%20then%20you%E2%80%99re%20on%20the%20right%20track%3A%0A%0A%0A%0AHere%20we%20can%20search%20the%20source%20code%20of%20the%20website.%20If%20you%20place%20your%20mouse%20pointer%20over%20the%20lines%20of%20code%20from%20this%20right-most%20window%2C%20you%E2%80%99ll%20see%20sections%20of%20the%20website%20being%20highlighted%20in%20blue.%20This%20indicates%20which%20parts%20of%20the%20code%20refer%20to%20which%20parts%20of%20the%20website.%20Luckily%20for%20us%2C%20we%20don%E2%80%99t%20have%20to%20search%20the%20complete%20source%20code%20to%20find%20that%20specific%20location.%20We%20can%20approximate%20our%20search%20by%20typing%20the%20text%20we%E2%80%99re%20looking%20for%20in%20the%20search%20bar%20at%20the%20top%20of%20the%20right%20window%3A%0A%0A%0A%0AAfter%20we%20click%20enter%2C%20we%E2%80%99ll%20be%20automatically%20directed%20to%20the%20tag%20that%20has%20the%20information%20that%20we%20want.%0A%0A%0A%0AMore%20specifically%2C%20we%20can%20see%20that%20the%20latitude%20and%20longitude%20of%20schools%20are%20found%20in%20an%20attributed%20called%20href%20in%20a%20tag%20%26lt%3Ba%26gt%3B%3A%0A%0A%0A%0ACan%20you%20see%20the%20latitude%20and%20longitude%20fields%20in%20the%20text%20highlighted%20blue%3F%20It%E2%80%99s%20hidden%20in-between%20words.%20That%20is%20precisely%20the%20type%20of%20information%20we%E2%80%99re%20after.%20Extracting%20all%20%26lt%3Ba%26gt%3B%20tags%20from%20the%20website%20%28hint%3A%20XPath%20similar%20to%20%22%2F%2Fa%22%29%20will%20yield%20hundreds%20of%20matches%20because%20%26lt%3Ba%26gt%3B%20is%20a%20very%20common%20tag.%20Moreover%2C%20refining%20the%20search%20to%20%26lt%3Ba%26gt%3B%20tags%20which%20have%20an%20href%20attribute%20will%20also%20yield%20hundreds%20of%20matches%20because%20href%20is%20the%20standard%20attribute%20to%20attach%20links%20within%20websites.%20We%20need%20to%20narrow%20down%20our%20search%20within%20the%20website.%0AOne%20strategy%20is%20to%20find%20the%20%E2%80%98father%E2%80%99%20or%20%E2%80%98grandfather%E2%80%99%20node%20of%20this%20particular%20%26lt%3Ba%26gt%3B%20tag%20and%20then%20match%20a%20node%20which%20has%20that%20same%20sequence%20of%20grandfather%20-%26gt%3B%20father%20-%26gt%3B%20child%20node.%20By%20looking%20at%20the%20structure%20of%20this%20small%20XML%20snippet%20from%20the%20right-most%20window%2C%20we%20see%20that%20the%20%E2%80%98grandfather%E2%80%99%20of%20this%20%26lt%3Ba%26gt%3B%20tag%20is%20%26lt%3Bp%20class%3D%22d-flex%20align-items-baseline%20g-mt-5%27%26gt%3B%20which%20has%20a%20particularly%20long%20attribute%20named%20class.%0A%0A%0A%0ADon%E2%80%99t%20be%20intimidated%20by%20these%20tag%20names%20and%20long%20attributes.%20I%20also%20don%E2%80%99t%20know%20what%20any%20of%20these%20attributes%20mean.%20But%20what%20I%20do%20know%20is%20that%20this%20is%20the%20%E2%80%98grandfather%E2%80%99%20of%20the%20%26lt%3Ba%26gt%3B%20tag%20I%E2%80%99m%20interested%20in.%20So%20using%20our%20XPath%20skills%2C%20let%E2%80%99s%20search%20for%20that%20%26lt%3Bp%26gt%3B%20tag%20and%20see%20if%20we%20get%20only%20one%20match.%0A%23%20Search%20for%20all%20%26lt%3Bp%26gt%3B%20tags%20with%20that%20class%20in%20the%20document%0Aschool_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bp%20class%3D%26quot%3Bd-flex%20align-items-baseline%20g-mt-5%26quot%3B%26gt%3B%5Cr%5Cn%5Ct%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20...%0AOnly%20one%20match%2C%20so%20this%20is%20good%20news.%20This%20means%20that%20we%20can%20uniquely%20identify%20this%20particular%20%26lt%3Bp%26gt%3B%20tag.%20Let%E2%80%99s%20refine%20the%20search%20to%20say%3A%20Find%20all%20%26lt%3Ba%26gt%3B%20tags%20which%20are%20children%20of%20that%20specific%20%26lt%3Bp%26gt%3B%20tag.%20This%20only%20means%20I%E2%80%99ll%20add%20a%20%22%2F%2Fa%22%20to%20the%20previous%20expression.%20Since%20there%20is%20only%20one%20%26lt%3Bp%26gt%3B%20tag%20with%20the%20class%2C%20we%E2%80%99re%20interested%20in%20checking%20whether%20there%20is%20more%20than%20one%20%26lt%3Ba%26gt%3B%20tag%20below%20this%20%26lt%3Bp%26gt%3B%20tag.%0Aschool_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Ba%20href%3D%26quot%3B%2FColegio%2Fbuscar-colegios-cercanos.action%3Fcolegio.latitud%3D38%20...%0AThere%20we%20go%21%20We%20can%20see%20the%20specific%20href%20that%20contains%20the%20latitude%20and%20longitude%20data%20we%E2%80%99re%20interested%20in.%20How%20do%20we%20extract%20the%20href%20attribute%3F%20Using%20xml_attr%20as%20we%20did%20before%21%0Alocation_str%20%26lt%3B-%0A%20%20school_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%20%25%26gt%3B%25%0A%20%20xml_attr%28attr%20%3D%20%26quot%3Bhref%26quot%3B%29%0A%0Alocation_str%0A%23%23%20%20%26quot%3B%2FColegio%2Fbuscar-colegios-cercanos.action%3Fcolegio.latitud%3D38.8274492%26amp%3Bcolegio.longitud%3D0.0221681%26quot%3B%0AOk%2C%20now%20we%20need%20some%20regex%20skills%20to%20get%20only%20the%20latitude%20and%20longitude%20%28regex%20expressions%20are%20used%20to%20search%20for%20patterns%20inside%20a%20string%2C%20such%20as%20for%20example%20a%20date.%20See%20here%20for%20some%20examples%29%3A%0Alocation%20%26lt%3B-%0A%20%20location_str%20%25%26gt%3B%25%0A%20%20str_extract_all%28%26quot%3B%3D.%2B%24%26quot%3B%29%20%25%26gt%3B%25%0A%20%20str_replace_all%28%26quot%3B%3D%7Ccolegio%5C%5C.longitud%26quot%3B%2C%20%26quot%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20str_split%28%26quot%3B%26amp%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20.%0A%0Alocation%0A%23%23%20%20%26quot%3B38.8274492%26quot%3B%20%26quot%3B0.0221681%26quot%3B%0AOk%2C%20so%20we%20got%20the%20information%20we%20needed%20for%20one%20single%20school.%20Let%E2%80%99s%20turn%20that%20into%20a%20function%20so%20we%20can%20pass%20only%20the%20school%E2%80%99s%20link%20and%20get%20the%20coordinates%20back.%0ABefore%20we%20do%20that%2C%20I%20will%20set%20something%20called%20my%20User-Agent.%20In%20short%2C%20the%20User-Agent%20is%20who%20you%20are.%20It%20is%20good%20practice%20to%20identify%20the%20person%20who%20is%20scraping%20the%20website%20because%20if%20you%E2%80%99re%20causing%20any%20trouble%20on%20the%20website%2C%20the%20website%20can%20directly%20identify%20who%20is%20causing%20problems.%20You%20can%20figure%20out%20your%20user%20agent%20here%20and%20paste%20it%20in%20the%20string%20below.%20In%20addition%2C%20I%20will%20add%20a%20time%20sleep%20of%205%20seconds%20to%20the%20function%20because%20we%20want%20to%20make%20sure%20we%20don%E2%80%99t%20cause%20any%20troubles%20to%20the%20website%20we%E2%80%99re%20scraping%20due%20to%20an%20overload%20of%20requests.%0A%23%20This%20sets%20your%20%60User-Agent%60%20globally%20so%20that%20all%20requests%20are%0A%23%20identified%20with%20this%20%60User-Agent%60%0Aset_config%28%0A%20%20user_agent%28%26quot%3BMozilla%2F5.0%20%28X11%3B%20Ubuntu%3B%20Linux%20x86_64%3B%20rv%3A70.0%29%20Gecko%2F20100101%20Firefox%2F70.0%26quot%3B%29%0A%29%0A%0A%23%20Collapse%20all%20of%20the%20code%20from%20above%20into%20one%20function%20called%0A%23%20school%20grabber%0A%0Aschool_grabber%20%26lt%3B-%20function%28school_url%29%20%7B%0A%20%20%23%20We%20add%20a%20time%20sleep%20of%205%20seconds%20to%20avoid%0A%20%20%23%20sending%20too%20many%20quick%20requests%20to%20the%20website%0A%20%20Sys.sleep%285%29%0A%0A%20%20school_raw%20%26lt%3B-%20read_html%28school_url%29%20%25%26gt%3B%25%20xml_child%28%29%0A%0A%20%20location_str%20%26lt%3B-%0A%20%20%20%20school_raw%20%25%26gt%3B%25%0A%20%20%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20xml_attr%28attr%20%3D%20%26quot%3Bhref%26quot%3B%29%0A%0A%20%20location%20%26lt%3B-%0A%20%20%20%20location_str%20%25%26gt%3B%25%0A%20%20%20%20str_extract_all%28%26quot%3B%3D.%2B%24%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20str_replace_all%28%26quot%3B%3D%7Ccolegio%5C%5C.longitud%26quot%3B%2C%20%26quot%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20str_split%28%26quot%3B%26amp%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20.%0A%0A%20%20%23%20Turn%20into%20a%20data%20frame%0A%20%20data.frame%28%0A%20%20%20%20latitude%20%3D%20location%2C%0A%20%20%20%20longitude%20%3D%20location%2C%0A%20%20%20%20stringsAsFactors%20%3D%20FALSE%0A%20%20%29%0A%7D%0A%0A%0Aschool_grabber%28school_url%29%0A%23%23%20%20%20%20%20latitude%20longitude%0A%23%23%201%2038.8274492%200.0221681%0AOk%2C%20so%20it%E2%80%99s%20working.%20The%20only%20thing%20left%20is%20to%20extract%20this%20for%20many%20schools.%20As%20shown%20earlier%2C%20scrapex%20contains%20a%20list%20of%2027%20school%20links%20that%20we%20can%20automatically%20scrape.%20Let%E2%80%99s%20loop%20over%20those%2C%20get%20the%20information%20of%20coordinates%20for%20each%20and%20collapse%20all%20of%20them%20into%20a%20data%20frame.%0Ares%20%26lt%3B-%20map_dfr%28school_links%2C%20school_grabber%29%0Ares%0A%23%23%20%20%20%20latitude%20%20longitude%0A%23%23%201%20%2042.72779%20-8.6567935%0A%23%23%202%20%2043.24439%20-8.8921645%0A%23%23%203%20%2038.95592%20-1.2255769%0A%23%23%204%20%2039.18657%20-1.6225903%0A%23%23%205%20%2040.38245%20-3.6410388%0A%23%23%206%20%2040.22929%20-3.1106322%0A%23%23%207%20%2040.43860%20-3.6970366%0A%23%23%208%20%2040.33514%20-3.5155669%0A%23%23%209%20%2040.50546%20-3.3738441%0A%23%23%2010%2040.63826%20-3.4537107%0A%23%23%2011%2040.38543%20-3.6639500%0A%23%23%2012%2037.76485%20-1.5030467%0A%23%23%2013%2038.82745%20%200.0221681%0A%23%23%2014%2040.99434%20-5.6224391%0A%23%23%2015%2040.99434%20-5.6224391%0A%23%23%2016%2040.56037%20-5.6703725%0A%23%23%2017%2040.99434%20-5.6224391%0A%23%23%2018%2040.99434%20-5.6224391%0A%23%23%2019%2041.13593%20%200.9901905%0A%23%23%2020%2041.26155%20%201.1670507%0A%23%23%2021%2041.22851%20%200.5461471%0A%23%23%2022%2041.14580%20%200.8199749%0A%23%23%2023%2041.18341%20%200.5680564%0A%23%23%2024%2042.07820%20%201.8203155%0A%23%23%2025%2042.25245%20%201.8621546%0A%23%23%2026%2041.73767%20%201.8383666%0A%23%23%2027%2041.62345%20%202.0013628%0ASo%20now%20that%20we%20have%20the%20locations%20of%20these%20schools%2C%20let%E2%80%99s%20plot%20them%3A%0Ares%20%26lt%3B-%20mutate_all%28res%2C%20as.numeric%29%0A%0Asp_sf%20%26lt%3B-%0A%20%20ne_countries%28scale%20%3D%20%26quot%3Blarge%26quot%3B%2C%20country%20%3D%20%26quot%3BSpain%26quot%3B%2C%20returnclass%20%3D%20%26quot%3Bsf%26quot%3B%29%20%25%26gt%3B%25%0A%20%20st_transform%28crs%20%3D%204326%29%0A%0Aggplot%28sp_sf%29%20%2B%0A%20%20geom_sf%28%29%20%2B%0A%20%20geom_point%28data%20%3D%20res%2C%20aes%28x%20%3D%20longitude%2C%20y%20%3D%20latitude%29%29%20%2B%0A%20%20coord_sf%28xlim%20%3D%20c%28-20%2C%2010%29%2C%20ylim%20%3D%20c%2825%2C%2045%29%29%20%2B%0A%20%20theme_minimal%28%29%20%2B%0A%20%20ggtitle%28%26quot%3BSample%20of%20schools%20in%20Spain%26quot%3B%29%0A%0AThere%20we%20go%21%20We%20went%20from%20literally%20no%20information%20at%20the%20beginning%20of%20this%20tutorial%20to%20interpretable%20and%20summarized%20information%20only%20using%20web%20data.%20We%20can%20see%20some%20schools%20in%20Madrid%20%28center%29%20as%20well%20in%20other%20regions%20of%20Spain%2C%20including%20Catalonia%20and%20Galicia.%0AThis%20marks%20the%20end%20of%20our%20scraping%20adventure%20but%20before%20we%20finish%2C%20I%20want%20to%20mention%20some%20of%20the%20ethical%20guidelines%20for%20web%20scraping.%20Scraping%20is%20extremely%20useful%20for%20us%20but%20can%20give%20headaches%20to%20other%20people%20maintaining%20the%20website%20of%20interest.%20Here%E2%80%99s%20a%20list%20of%20ethical%20guidelines%20you%20should%20always%20follow%3A%0A%0ARead%20the%20terms%20and%20services%3A%20many%20websites%20prohibit%20web%20scraping%20and%20you%20could%20be%20in%20a%20breach%20of%20privacy%20by%20scraping%20the%20data.%20One%20famous%20example.%0ACheck%20the%20robots.txt%20file.%20This%20is%20a%20file%20that%20most%20websites%20have%20%28www.buscocolegio.com%20does%20not%29%20which%20tell%20you%20which%20specific%20paths%20inside%20the%20website%20are%20scrapable%20and%20which%20are%20not.%20See%20here%20for%20an%20explanation%20of%20what%20robots.txt%20look%20like%20and%20where%20to%20find%20them.%0ASome%20websites%20are%20supported%20by%20very%20big%20servers%2C%20which%20means%20you%20can%20send%204-5%20website%20requests%20per%20second.%20Others%2C%20such%20as%20www.buscocolegio.com%20are%20not.%20It%E2%80%99s%20good%20practice%20to%20always%20put%20a%20time%20sleep%20between%20your%20requests.%20In%20our%20example%2C%20I%20set%20it%20to%205%20seconds%20because%20this%20is%20a%20small%20website%20and%20we%20don%E2%80%99t%20want%20to%20crash%20their%20servers.%0AWhen%20making%20requests%2C%20there%20are%20computational%20ways%20of%20identifying%20yourself.%20For%20example%2C%20every%20request%20%28such%20as%20the%20one%E2%80%99s%20we%20do%29%20can%20have%20something%20called%20a%20User-Agent.%20It%20is%20good%20practice%20to%20include%20yourself%20in%20as%20the%20User-Agent%20%28as%20we%20did%20in%20our%20code%29%20because%20the%20admin%20of%20the%20server%20can%20directly%20identify%20if%20someone%E2%80%99s%20causing%20problems%20due%20to%20their%20web%20scraping.%0ALimit%20your%20scraping%20to%20non-busy%20hours%20such%20as%20overnight.%20This%20can%20help%20reduce%20the%20chances%20of%20collapsing%20the%20website%20since%20fewer%20people%20are%20visiting%20websites%20in%20the%20evening.%0A%0AYou%20can%20read%20more%20about%20these%20ethical%20issues%20here.%0A%0A%0AWrap%20up%0AThis%20tutorial%20introduced%20you%20to%20basic%20concepts%20in%20web%20scraping%20and%20applied%20them%20in%20a%20real-world%20setting.%20Web%20scraping%20is%20a%20vast%20field%20in%20computer%20science%20%28you%20can%20find%20entire%20books%20on%20the%20subject%20such%20as%20this%29.%20We%20covered%20some%20basic%20techniques%20which%20I%20think%20can%20take%20you%20a%20long%20way%20but%20there%E2%80%99s%20definitely%20more%20to%20learn.%20For%20those%20curious%20about%20where%20to%20turn%2C%20I%E2%80%99m%20looking%20forward%20to%20the%20upcoming%20book%20%E2%80%9CA%20Field%20Guide%20for%20Web%20Scraping%20and%20Accessing%20APIs%20with%20R%E2%80%9D%20by%20Bob%20Rudis%2C%20which%20should%20be%20released%20in%20the%20near%20future.%20Now%20go%20scrape%20some%20websites%20ethically%21%0A" target="_self"
				   
				   class="wpusb-layout-buttons wpusb-button wpusb-btn "
				   title="Send by email"
				   
				   
				   rel="nofollow"
				>
				   			<svg class="wpusb-svg wpusb-email-buttons ">
				<use xlink:href="#wpusb-email" />
			</svg>
				</a>
			</div>			<div class="wpusb-item wpusb-gmail ">
				<a href="https://mail.google.com/mail/u/0/?view=cm&fs=1&su=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools&body=https%3A%2F%2Fwww.r-bloggers.com%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F%3Futm_source%3Dshare_buttons%26utm_medium%3Dsocial_media%26utm_campaign%3Dsocial_share
%0A%0AAn%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools%0A%0Aby%20Jorge%20Cimentada%0A%20%20%20%20%20%20%20%20%0A%0A%0A%0AIntroduction%0AWhenever%20a%20new%20paper%20is%20released%20using%20some%20type%20of%20scraped%20data%2C%20most%20of%20my%20peers%20in%20the%20social%20science%20community%20get%20baffled%20at%20how%20researchers%20can%20do%20this.%20In%20fact%2C%20many%20social%20scientists%20can%E2%80%99t%20even%20think%20of%20research%20questions%20that%20can%20be%20addressed%20with%20this%20type%20of%20data%20simply%20because%20they%20don%E2%80%99t%20know%20it%E2%80%99s%20even%20possible.%20As%20the%20old%20saying%20goes%2C%20when%20you%20have%20a%20hammer%2C%20every%20problem%20looks%20like%20a%20nail.%0AWith%20the%20increasing%20amount%20of%20data%20being%20collected%20on%20a%20daily%20basis%2C%20it%20is%20eminent%20that%20scientists%20start%20getting%20familiar%20with%20new%20technologies%20that%20can%20help%20answer%20old%20questions.%20Moreover%2C%20we%20need%20to%20be%20adventurous%20about%20cutting%20edge%20data%20sources%20as%20they%20can%20also%20allow%20us%20to%20ask%20new%20questions%20which%20weren%E2%80%99t%20even%20thought%20of%20in%20the%20past.%0AIn%20this%20tutorial%20I%E2%80%99ll%20be%20guiding%20you%20through%20the%20basics%20of%20web%20scraping%20using%20R%20and%20the%20xml2%20package.%20I%E2%80%99ll%20begin%20with%20a%20simple%20example%20using%20fake%20data%20and%20elaborate%20further%20by%20trying%20to%20scrape%20the%20location%20of%20a%20sample%20of%20schools%20in%20Spain.%0A%0A%0ABasic%20steps%0AFor%20web%20scraping%20in%20R%2C%20you%20can%20fulfill%20almost%20all%20of%20your%20needs%20with%20the%20xml2%20package.%20As%20you%20wander%20through%20the%20web%2C%20you%E2%80%99ll%20see%20many%20examples%20using%20the%20rvest%20package.%20xml2%20and%20rvest%20are%20very%20similar%20so%20don%E2%80%99t%20feel%20you%E2%80%99re%20lacking%20behind%20for%20learning%20one%20and%20not%20the%20other.%20In%20addition%20to%20these%20two%20packages%2C%20we%E2%80%99ll%20need%20some%20other%20libraries%20for%20plotting%20locations%20on%20a%20map%20%28ggplot2%2C%20sf%2C%20rnaturalearth%29%2C%20identifying%20who%20we%20are%20when%20we%20scrape%20%28httr%29%20and%20wrangling%20data%20%28tidyverse%29.%0AAdditionally%2C%20we%E2%80%99ll%20also%20need%20the%20package%20scrapex.%20In%20the%20real-world%20example%20that%20we%E2%80%99ll%20be%20doing%20below%2C%20we%E2%80%99ll%20be%20scraping%20data%20from%20the%20website%20www.buscocolegio.com%20to%20locate%20a%20sample%20of%20schools%20in%20Spain.%20However%2C%20throughout%20the%20tutorial%20we%20won%E2%80%99t%20be%20scraping%20the%20data%20directly%20from%20their%20real-website.%20What%20would%20happen%20to%20this%20tutorial%20if%206%20months%20from%20now%20www.buscocolegio.com%20updates%20the%20design%20of%20their%20website%3F%20Everything%20from%20our%20real-world%20example%20would%20be%20lost.%0AWeb%20scraping%20tutorials%20are%20usually%20very%20unstable%20precisely%20because%20of%20this.%20To%20circumvent%20that%20problem%2C%20I%E2%80%99ve%20saved%20a%20random%20sample%20of%20websites%20from%20some%20schools%20in%20www.buscocolegio.com%20into%20an%20R%20package%20called%20scrapex.%20Although%20the%20links%20we%E2%80%99ll%20be%20working%20on%20will%20be%20hosted%20locally%20on%20your%20machine%2C%20the%20HTML%20of%20the%20website%20should%20be%20very%20similar%20to%20the%20one%20hosted%20on%20the%20website%20%28with%20the%20exception%20of%20some%20images%2Ficons%20which%20were%20deleted%20on%20purpose%20to%20make%20the%20package%20lightweight%29.%0AYou%20can%20install%20the%20package%20with%3A%0A%23%20install.packages%28%26quot%3Bdevtools%26quot%3B%29%0Adevtools%3A%3Ainstall_github%28%26quot%3Bcimentadaj%2Fscrapex%26quot%3B%29%0ANow%2C%20let%E2%80%99s%20move%20on%20the%20fake%20data%20example%20and%20load%20all%20of%20our%20packages%20with%3A%0Alibrary%28xml2%29%0Alibrary%28httr%29%0Alibrary%28tidyverse%29%0Alibrary%28sf%29%0Alibrary%28rnaturalearth%29%0Alibrary%28ggplot2%29%0Alibrary%28scrapex%29%0ALet%E2%80%99s%20begin%20with%20a%20simple%20example.%20Below%20we%20define%20an%20XML%20string%20and%20look%20at%20its%20structure%3A%0Axml_test%20%26lt%3B-%20%26quot%3B%26lt%3Bpeople%26gt%3B%0A%26lt%3Bjason%26gt%3B%0A%20%20%26lt%3Bperson%20type%3D%26%2339%3Bfictional%26%2339%3B%26gt%3B%0A%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%20%20%20%20%20%20%20%20Jason%0A%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%20%20%20%20%20%20%20%20Bourne%0A%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%20%20%20%20%20%20Spy%0A%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%20%20%26lt%3B%2Fperson%26gt%3B%0A%26lt%3B%2Fjason%26gt%3B%0A%26lt%3Bcarol%26gt%3B%0A%20%20%26lt%3Bperson%20type%3D%26%2339%3Breal%26%2339%3B%26gt%3B%0A%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%20%20%20%20%20%20%20%20Carol%0A%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%20%20%20%20%20%20%20%20Kalp%0A%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%20%20%20%20%20%20Scientist%0A%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%20%20%26lt%3B%2Fperson%26gt%3B%0A%26lt%3B%2Fcarol%26gt%3B%0A%26lt%3B%2Fpeople%26gt%3B%0A%26quot%3B%0A%0Acat%28xml_test%29%0A%23%23%20%26lt%3Bpeople%26gt%3B%0A%23%23%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%20%26lt%3Bperson%20type%3D%26%2339%3Bfictional%26%2339%3B%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Jason%0A%23%23%20%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Bourne%0A%23%23%20%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20Spy%0A%23%23%20%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%20%26lt%3B%2Fperson%26gt%3B%0A%23%23%20%26lt%3B%2Fjason%26gt%3B%0A%23%23%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%20%26lt%3Bperson%20type%3D%26%2339%3Breal%26%2339%3B%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Carol%0A%23%23%20%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3B%2Ffirst_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20Kalp%0A%23%23%20%20%20%20%20%26lt%3B%2Flast_name%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20Scientist%0A%23%23%20%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%20%26lt%3B%2Fperson%26gt%3B%0A%23%23%20%26lt%3B%2Fcarol%26gt%3B%0A%23%23%20%26lt%3B%2Fpeople%26gt%3B%0AIn%20XML%20and%20HTML%20the%20basic%20building%20blocks%20are%20something%20called%20tags.%20For%20example%2C%20the%20first%20tag%20in%20the%20structure%20shown%20above%20is%20%26lt%3Bpeople%26gt%3B.%20This%20tag%20is%20matched%20by%20%26lt%3B%2Fpeople%26gt%3B%20at%20the%20end%20of%20the%20string%3A%0A%0AIf%20you%20pay%20close%20attention%2C%20you%E2%80%99ll%20see%20that%20each%20tag%20in%20the%20XML%20structure%20has%20a%20beginning%20%28signaled%20by%20%26lt%3B%26gt%3B%29%20and%20an%20end%20%28signaled%20by%20%26lt%3B%2F%26gt%3B%29.%20For%20example%2C%20the%20next%20tag%20after%20%26lt%3Bpeople%26gt%3B%20is%20%26lt%3Bjason%26gt%3B%20and%20right%20before%20the%20tag%20%26lt%3Bcarol%26gt%3B%20is%20the%20end%20of%20the%20jason%20tag%20%26lt%3B%2Fjason%26gt%3B.%0A%0ASimilarly%2C%20you%E2%80%99ll%20find%20that%20the%20%26lt%3Bcarol%26gt%3B%20tag%20is%20also%20matched%20by%20a%20%26lt%3B%2Fcarol%26gt%3B%20finishing%20tag.%0A%0AIn%20theory%2C%20tags%20can%20have%20whatever%20meaning%20you%20attach%20to%20them%20%28such%20as%20%26lt%3Bpeople%26gt%3B%20or%20%26lt%3Boccupation%26gt%3B%29.%20However%2C%20in%20practice%20there%20are%20hundreds%20of%20tags%20which%20are%20standard%20in%20websites%20%28for%20example%2C%20here%29.%20If%20you%E2%80%99re%20just%20getting%20started%2C%20there%E2%80%99s%20no%20need%20for%20you%20to%20learn%20them%20but%20as%20you%20progress%20in%20web%20scraping%2C%20you%E2%80%99ll%20start%20to%20recognize%20them%20%28one%20brief%20example%20is%20%26lt%3Bstrong%26gt%3B%20which%20simply%20bolds%20text%20in%20a%20website%29.%0AThe%20xml2%20package%20was%20designed%20to%20read%20XML%20strings%20and%20to%20navigate%20the%20tree%20structure%20to%20extract%20information.%20For%20example%2C%20let%E2%80%99s%20read%20in%20the%20XML%20data%20from%20our%20fake%20example%20and%20look%20at%20its%20general%20structure%3A%0Axml_raw%20%26lt%3B-%20read_xml%28xml_test%29%0Axml_structure%28xml_raw%29%0A%23%23%20%26lt%3Bpeople%26gt%3B%0A%23%23%20%20%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bperson%20%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%20%20%20%26lt%3Bperson%20%26gt%3B%0A%23%23%20%20%20%20%20%20%20%26lt%3Bfirst_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Blast_name%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0A%23%23%20%20%20%20%20%20%20%26lt%3Boccupation%26gt%3B%0A%23%23%20%20%20%20%20%20%20%20%20%7Btext%7D%0AYou%20can%20see%20that%20the%20structure%20is%20tree-based%2C%20meaning%20that%20tags%20such%20as%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20are%20nested%20within%20the%20%26lt%3Bpeople%26gt%3B%20tag.%20In%20XML%20jargon%2C%20%26lt%3Bpeople%26gt%3B%20is%20the%20root%20node%2C%20whereas%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20are%20the%20child%20nodes%20from%20%26lt%3Bpeople%26gt%3B.%0AIn%20more%20detail%2C%20the%20structure%20is%20as%20follows%3A%0A%0AThe%20root%20node%20is%20%26lt%3Bpeople%26gt%3B%0AThe%20child%20nodes%20are%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%0AThen%20each%20child%20node%20has%20nodes%20%26lt%3Bfirst_name%26gt%3B%2C%20%26lt%3Bmarried%26gt%3B%2C%20%26lt%3Blast_name%26gt%3B%20and%20%26lt%3Boccupation%26gt%3B%20nested%20within%20them.%0A%0APut%20another%20way%2C%20if%20something%20is%20nested%20within%20a%20node%2C%20then%20the%20nested%20node%20is%20a%20child%20of%20the%20upper-level%20node.%20In%20our%20example%2C%20the%20root%20node%20is%20%26lt%3Bpeople%26gt%3B%20so%20we%20can%20check%20which%20are%20its%20children%3A%0A%23%20xml_child%20returns%20only%20one%20child%20%28specified%20in%20search%29%0A%23%20Here%2C%20jason%20is%20the%20first%20child%0Axml_child%28xml_raw%2C%20search%20%3D%201%29%0A%23%23%20%7Bxml_node%7D%0A%23%23%20%26lt%3Bjason%26gt%3B%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0A%23%20Here%2C%20carol%20is%20the%20second%20child%0Axml_child%28xml_raw%2C%20search%20%3D%202%29%0A%23%23%20%7Bxml_node%7D%0A%23%23%20%26lt%3Bcarol%26gt%3B%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20...%0A%23%20Use%20xml_children%20to%20extract%20%2A%2Aall%2A%2A%20children%0Achild_xml%20%26lt%3B-%20xml_children%28xml_raw%29%0A%0Achild_xml%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bjason%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarri%20...%0A%23%23%20%20%26lt%3Bcarol%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20...%0ATags%20can%20also%20have%20different%20attributes%20which%20are%20usually%20specified%20as%20%26lt%3Bfake_tag%20attribute%3D%27fake%27%26gt%3B%20and%20ended%20as%20usual%20with%20%26lt%3B%2Ffake_tag%26gt%3B.%20If%20you%20look%20at%20the%20XML%20structure%20of%20our%20example%2C%20you%E2%80%99ll%20notice%20that%20each%20%26lt%3Bperson%26gt%3B%20tag%20has%20an%20attribute%20called%20type.%20As%20you%E2%80%99ll%20see%20in%20our%20real-world%20example%2C%20extracting%20these%20attributes%20is%20often%20the%20aim%20of%20our%20scraping%20adventure.%20Using%20xml2%2C%20we%20can%20extract%20all%20attributes%20that%20match%20a%20specific%20name%20with%20xml_attrs.%0A%23%20Extract%20the%20attribute%20type%20from%20all%20nodes%0Axml_attrs%28child_xml%2C%20%26quot%3Btype%26quot%3B%29%0A%23%23%20%0A%23%23%20named%20character%280%29%0A%23%23%0A%23%23%20%0A%23%23%20named%20character%280%29%0AWait%2C%20why%20didn%E2%80%99t%20this%20work%3F%20Well%2C%20if%20you%20look%20at%20the%20output%20of%20child_xml%2C%20we%20have%20two%20nodes%20on%20which%20are%20for%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B.%0Achild_xml%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bjason%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarri%20...%0A%23%23%20%20%26lt%3Bcarol%26gt%3B%5Cn%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20...%0ADo%20these%20tags%20have%20an%20attribute%3F%20No%2C%20because%20if%20they%20did%2C%20they%20would%20have%20something%20like%20%26lt%3Bjason%20type%3D%27fake_tag%27%26gt%3B.%20What%20we%20need%20is%20to%20look%20down%20at%20the%20%26lt%3Bperson%26gt%3B%20tag%20within%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20and%20extract%20the%20attribute%20from%20%26lt%3Bperson%26gt%3B.%0ADoes%20this%20sound%20familiar%3F%20Both%20%26lt%3Bjason%26gt%3B%20and%20%26lt%3Bcarol%26gt%3B%20have%20an%20associated%20%26lt%3Bperson%26gt%3B%20tag%20below%20them%2C%20making%20them%20their%20children.%20We%20can%20just%20go%20down%20one%20level%20by%20running%20xml_children%20on%20these%20tags%20and%20extract%20them.%0A%23%20We%20go%20down%20one%20level%20of%20children%0Aperson_nodes%20%26lt%3B-%20xml_children%28child_xml%29%0A%0A%23%20%26lt%3Bperson%26gt%3B%20is%20now%20the%20main%20node%2C%20so%20we%20can%20extract%20attributes%0Aperson_nodes%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Breal%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20...%0A%23%20Both%20type%20attributes%0Axml_attrs%28person_nodes%2C%20%26quot%3Btype%26quot%3B%29%0A%23%23%20%0A%23%23%20%20%20%20%20%20%20%20type%0A%23%23%20%26quot%3Bfictional%26quot%3B%0A%23%23%0A%23%23%20%0A%23%23%20%20%20type%0A%23%23%20%26quot%3Breal%26quot%3B%0AUsing%20the%20xml_path%20function%20you%20can%20even%20find%20the%20%E2%80%98address%E2%80%99%20of%20these%20nodes%20to%20retrieve%20specific%20tags%20without%20having%20to%20write%20down%20xml_children%20many%20times.%20For%20example%3A%0A%23%20Specific%20address%20of%20each%20person%20tag%20for%20the%20whole%20xml%20tree%0A%23%20only%20using%20the%20%60person_nodes%60%0Axml_path%28person_nodes%29%0A%23%23%20%20%26quot%3B%2Fpeople%2Fjason%2Fperson%26quot%3B%20%26quot%3B%2Fpeople%2Fcarol%2Fperson%26quot%3B%0AWe%20have%20the%20%E2%80%98address%E2%80%99%20of%20specific%20tags%20in%20the%20tree%20but%20how%20do%20we%20extract%20them%20automatically%3F%20To%20extract%20specific%20%E2%80%98addresses%E2%80%99%20of%20this%20XML%20tree%2C%20the%20main%20function%20we%E2%80%99ll%20use%20is%20xml_find_all.%20This%20function%20accepts%20the%20XML%20tree%20and%20an%20%E2%80%98address%E2%80%99%20string.%20We%20can%20use%20very%20simple%20strings%2C%20such%20as%20the%20one%20given%20by%20xml_path%3A%0A%23%20You%20can%20use%20results%20from%20xml_path%20like%20directories%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2Fpeople%2Fjason%2Fperson%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0AThe%20expression%20above%20is%20asking%20for%20the%20node%20%22%2Fpeople%2Fjason%2Fperson%22.%20This%20will%20return%20the%20same%20as%20saying%20xml_raw%20%25%26gt%3B%25%20xml_child%28search%20%3D%201%29.%20For%20deeply%20nested%20trees%2C%20xml_find_all%20will%20be%20many%20times%20much%20cleaner%20than%20calling%20xml_child%20recursively%20many%20times.%0AHowever%2C%20in%20most%20cases%20the%20%E2%80%98addresses%E2%80%99%20used%20in%20xml_find_all%20come%20from%20a%20separate%20language%20called%20XPath%20%28in%20fact%2C%20the%20%E2%80%98address%E2%80%99%20we%E2%80%99ve%20been%20looking%20at%20is%20XPath%29.%20XPath%20is%20a%20complex%20language%20%28such%20as%20regular%20expressions%20for%20strings%29%20which%20is%20beyond%20this%20brief%20tutorial.%20However%2C%20with%20the%20examples%20we%E2%80%99ve%20seen%20so%20far%2C%20we%20can%20use%20some%20basic%20XPath%20which%20we%E2%80%99ll%20need%20later%20on.%0ATo%20extract%20all%20the%20tags%20in%20a%20document%2C%20we%20can%20use%20%2F%2Fname_of_tag.%0A%23%20Search%20for%20all%20%26%2339%3Bmarried%26%2339%3B%20nodes%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2F%2Fmarried%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Jason%5Cn%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0A%23%23%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Carol%5Cn%20%20%20%20%20%20%26lt%3B%2Fmarried%26gt%3B%0AWith%20the%20previous%20XPath%2C%20we%E2%80%99re%20searching%20for%20all%20married%20tags%20within%20the%20complete%20XML%20tree.%20The%20result%20returns%20all%20married%20nodes%20%28I%20use%20the%20words%20tags%20and%20nodes%20interchangeably%29%20in%20the%20complete%20tree%20structure.%20Another%20example%20would%20be%20finding%20all%20%26lt%3Boccupation%26gt%3B%20tags%3A%0Axml_find_all%28xml_raw%2C%20%26quot%3B%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0AIf%20you%20want%20to%20find%20any%20other%20tag%20you%20can%20replace%20%22%2F%2Foccupation%22%20with%20your%20tag%20of%20interest%20and%20xml_find_all%20will%20find%20all%20of%20them.%0AIf%20you%20wanted%20to%20find%20all%20tags%20below%20your%20current%20node%2C%20you%20only%20need%20to%20add%20a%20.%20at%20the%20beginning%3A%20%22.%2F%2Foccupation%22.%20For%20example%2C%20if%20we%20dived%20into%20the%20%26lt%3Bjason%26gt%3B%20tag%20and%20we%20wanted%20his%20%26lt%3Boccupation%26gt%3B%20tag%2C%20%22%2F%2Foccupation%22%20will%20returns%20all%20%26lt%3Boccupation%26gt%3B%20tags.%20Instead%2C%20%22.%2F%2Foccupation%22%20will%20return%20only%20the%20found%20tags%20below%20the%20current%20tag.%20For%20example%3A%0Axml_raw%20%25%26gt%3B%25%0A%20%20%23%20Dive%20only%20into%20Jason%26%2339%3Bs%20tag%0A%20%20xml_child%28search%20%3D%201%29%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B.%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%20Instead%2C%20the%20wrong%20way%20would%20have%20been%3A%0Axml_raw%20%25%26gt%3B%25%0A%20%20%23%20Dive%20only%20into%20Jason%26%2339%3Bs%20tag%0A%20%20xml_child%28search%20%3D%201%29%20%25%26gt%3B%25%0A%20%20%23%20Here%20we%20get%20both%20occupation%20tags%0A%20%20xml_find_all%28%26quot%3B%2F%2Foccupation%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%282%29%7D%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0A%23%23%20%20%26lt%3Boccupation%26gt%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26lt%3B%2Foccupation%26gt%3B%0AThe%20first%20example%20only%20returns%20%26lt%3Bjason%26gt%3B%E2%80%99s%20occupation%20whereas%20the%20second%20returned%20all%20occupations%2C%20regardless%20of%20where%20you%20are%20in%20the%20tree.%0AXPath%20also%20allows%20you%20to%20identify%20tags%20that%20contain%20only%20one%20specific%20attribute%2C%20such%20as%20the%20one%E2%80%99s%20we%20saw%20earlier.%20For%20example%2C%20to%20filter%20all%20%26lt%3Bperson%26gt%3B%20tags%20with%20the%20attribute%20filter%20set%20to%20fictional%2C%20we%20could%20do%20it%20with%3A%0A%23%20Give%20me%20all%20the%20tags%20%26%2339%3Bperson%26%2339%3B%20that%20have%20an%20attribute%20type%3D%26%2339%3Bfictional%26%2339%3B%0Axml_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fperson%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bperson%20type%3D%26quot%3Bfictional%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bfirst_name%26gt%3B%5Cn%20%20%20%20%26lt%3Bmarried%26gt%3B%5Cn%20%20%20%20%20%20%20%20Ja%20...%0AIf%20you%20wanted%20to%20do%20the%20same%20but%20for%20the%20tags%20below%20your%20current%20nodes%2C%20the%20same%20trick%20we%20learned%20earlier%20would%20work%3A%20%22.%2F%2Fperson%22.%20These%20are%20just%20some%20primers%20that%20can%20help%20you%20jump%20easily%20to%20using%20XPath%2C%20but%20I%20encourage%20you%20to%20look%20at%20other%20examples%20on%20the%20web%2C%20as%20complex%20websites%20often%20require%20complex%20XPath%20expressions.%0ABefore%20we%20begin%20our%20real-word%20example%2C%20you%20might%20be%20asking%20yourself%20how%20you%20can%20actually%20extract%20the%20text%2Fnumeric%20data%20from%20these%20nodes.%20Well%2C%20that%E2%80%99s%20easy%3A%20xml_text.%0Axml_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B.%2F%2Foccupation%26quot%3B%29%20%25%26gt%3B%25%0A%20%20xml_text%28%29%0A%23%23%20%20%26quot%3B%5Cn%20%20%20%20%20%20Spy%5Cn%20%20%20%20%26quot%3B%20%20%20%20%20%20%20%26quot%3B%5Cn%20%20%20%20%20%20Scientist%5Cn%20%20%20%20%26quot%3B%0AOnce%20you%E2%80%99ve%20narrowed%20down%20your%20tree-based%20search%20to%20one%20single%20piece%20of%20text%20or%20numbers%2C%20xml_text%28%29%20will%20extract%20that%20for%20you%20%28there%E2%80%99s%20also%20xml_double%20and%20xml_integer%20for%20extracting%20numbers%29.%20As%20I%20said%2C%20XPath%20is%20really%20a%20huge%20language.%20If%20you%E2%80%99re%20interested%2C%20this%20XPath%20cheat%20sheets%20have%20helped%20me%20a%20lot%20to%20learn%20tricks%20for%20easy%20scraping.%0A%0A%0AReal-world%20example%0AWe%E2%80%99re%20interested%20in%20making%20a%20list%20of%20many%20schools%20in%20Spain%20and%20visualizing%20their%20location.%20This%20can%20be%20useful%20for%20many%20things%20such%20as%20matching%20population%20density%20of%20children%20across%20different%20regions%20to%20school%20locations.%20The%20website%20www.buscocolegio.com%20contains%20a%20database%20of%20schools%20similar%20to%20what%20we%E2%80%99re%20looking%20for.%20As%20described%20at%20the%20beginning%2C%20instead%20we%E2%80%99re%20going%20to%20use%20scrapex%20which%20has%20the%20function%20spanish_schools_ex%28%29%20containing%20the%20links%20to%20a%20sample%20of%20websites%20from%20different%20schools%20saved%20locally%20on%20your%20computer.%0ALet%E2%80%99s%20look%20at%20an%20example%20for%20one%20school.%0Aschool_links%20%26lt%3B-%20spanish_schools_ex%28%29%0A%0A%23%20Keep%20only%20the%20HTML%20file%20of%20one%20particular%20school.%0Aschool_url%20%26lt%3B-%20school_links%0A%0Aschool_url%0A%23%23%20%20%26quot%3B%2Fusr%2Flocal%2Flib%2FR%2Fsite-library%2Fscrapex%2Fextdata%2Fspanish_schools_ex%2Fschool_3006839.html%26quot%3B%0AIf%20you%E2%80%99re%20interested%20in%20looking%20at%20the%20website%20interactively%20in%20your%20browser%2C%20you%20can%20do%20it%20with%20browseURL%28prep_browser%28school_url%29%29.%20Let%E2%80%99s%20read%20the%20HTML%20%28XML%20and%20HTML%20are%20usually%20interchangeable%2C%20so%20here%20we%20use%20read_html%29.%0A%23%20Here%20we%20use%20%60read_html%60%20because%20%60read_xml%60%20is%20throwing%20an%20error%0A%23%20when%20attempting%20to%20read.%20However%2C%20everything%20we%26%2339%3Bve%20discussed%0A%23%20should%20be%20the%20same.%0Aschool_raw%20%26lt%3B-%20read_html%28school_url%29%20%25%26gt%3B%25%20xml_child%28%29%0A%0Aschool_raw%0A%23%23%20%7Bhtml_node%7D%0A%23%23%20%26lt%3Bhead%26gt%3B%0A%23%23%20%20%20%26lt%3Btitle%26gt%3BAqu%C3%AD%20encontrar%C3%A1s%20toda%20la%20informaci%C3%B3n%20necesaria%20sobre%20CEIP%20SA%20...%0A%23%23%20%20%20%26lt%3Bmeta%20charset%3D%26quot%3Butf-8%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bviewport%26quot%3B%20content%3D%26quot%3Bwidth%3Ddevice-width%2C%20initial-scale%3D1%2C%20...%0A%23%23%20%20%20%26lt%3Bmeta%20http-equiv%3D%26quot%3Bx-ua-compatible%26quot%3B%20content%3D%26quot%3Bie%3Dedge%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bauthor%26quot%3B%20content%3D%26quot%3BBuscoColegio%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bdescription%26quot%3B%20content%3D%26quot%3BEncuentra%20toda%20la%20informaci%C3%B3n%20nec%20...%0A%23%23%20%20%20%26lt%3Bmeta%20name%3D%26quot%3Bkeywords%26quot%3B%20content%3D%26quot%3Bopiniones%20SANCHIS%20GUARNER%2C%20contacto%20%20...%0A%23%23%20%20%20%26lt%3Blink%20rel%3D%26quot%3Bshortcut%20icon%26quot%3B%20href%3D%26quot%3B%2Ffavicon.ico%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2F%2Ffonts.googleapis.com%2Fcss%3Ffamily%3DRobo%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-awesome%2Fcss%2Ffont-a%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-line%2Fcss%2Fsimple-li%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-line-pro%2Fstyle.css%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3B%2Fassets%2Fvendor%2Ficon-hs%2Fstyle.css%26quot%3B%26gt%3B%5Cn%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20%20%26lt%3Blink%20rel%3D%26quot%3Bstylesheet%26quot%3B%20href%3D%26quot%3Bhttps%3A%2F%2Fs3.eu-west-3.amazonaws.com%2Fbus%20...%0A%23%23%20...%0AWeb%20scraping%20strategies%20are%20very%20specific%20to%20the%20website%20you%E2%80%99re%20after.%20You%20have%20to%20get%20very%20familiar%20with%20the%20website%20you%E2%80%99re%20interested%20to%20be%20able%20to%20match%20perfectly%20the%20information%20you%E2%80%99re%20looking%20for.%20In%20many%20cases%2C%20scraping%20two%20websites%20will%20require%20vastly%20different%20strategies.%20For%20this%20particular%20example%2C%20we%E2%80%99re%20only%20interested%20in%20figuring%20out%20the%20location%20of%20each%20school%20so%20we%20only%20have%20to%20extract%20its%20location.%0A%0A%0AIn%20the%20image%20above%20you%E2%80%99ll%20find%20a%20typical%20school%E2%80%99s%20website%20in%20wwww.buscocolegio.com.%20The%20website%20has%20a%20lot%20of%20information%2C%20but%20we%E2%80%99re%20only%20interested%20in%20the%20button%20that%20is%20circled%20by%20the%20orange%20rectangle.%20If%20you%20can%E2%80%99t%20find%20it%20easily%2C%20it%E2%80%99s%20below%20the%20Google%20Maps%20on%20the%20right%20which%20says%20%E2%80%9CBuscar%20colegio%20cercano%E2%80%9D.%0AWhen%20you%20click%20on%20this%20button%2C%20this%20actually%20points%20you%20towards%20the%20coordinates%20of%20the%20school%20so%20we%20just%20have%20to%20find%20a%20way%20of%20figuring%20out%20how%20to%20click%20this%20button%20or%20figure%20out%20how%20to%20get%20its%20information.%20All%20browsers%20allow%20you%20to%20do%20this%20if%20you%20press%20CTRL%20%2B%20SHIFT%20%2B%20c%20at%20the%20same%20time%20%28Firefox%20and%20Chrome%20support%20this%20hotkey%29.%20If%20a%20window%20on%20the%20right%20popped%20in%20full%20of%20code%2C%20then%20you%E2%80%99re%20on%20the%20right%20track%3A%0A%0A%0A%0AHere%20we%20can%20search%20the%20source%20code%20of%20the%20website.%20If%20you%20place%20your%20mouse%20pointer%20over%20the%20lines%20of%20code%20from%20this%20right-most%20window%2C%20you%E2%80%99ll%20see%20sections%20of%20the%20website%20being%20highlighted%20in%20blue.%20This%20indicates%20which%20parts%20of%20the%20code%20refer%20to%20which%20parts%20of%20the%20website.%20Luckily%20for%20us%2C%20we%20don%E2%80%99t%20have%20to%20search%20the%20complete%20source%20code%20to%20find%20that%20specific%20location.%20We%20can%20approximate%20our%20search%20by%20typing%20the%20text%20we%E2%80%99re%20looking%20for%20in%20the%20search%20bar%20at%20the%20top%20of%20the%20right%20window%3A%0A%0A%0A%0AAfter%20we%20click%20enter%2C%20we%E2%80%99ll%20be%20automatically%20directed%20to%20the%20tag%20that%20has%20the%20information%20that%20we%20want.%0A%0A%0A%0AMore%20specifically%2C%20we%20can%20see%20that%20the%20latitude%20and%20longitude%20of%20schools%20are%20found%20in%20an%20attributed%20called%20href%20in%20a%20tag%20%26lt%3Ba%26gt%3B%3A%0A%0A%0A%0ACan%20you%20see%20the%20latitude%20and%20longitude%20fields%20in%20the%20text%20highlighted%20blue%3F%20It%E2%80%99s%20hidden%20in-between%20words.%20That%20is%20precisely%20the%20type%20of%20information%20we%E2%80%99re%20after.%20Extracting%20all%20%26lt%3Ba%26gt%3B%20tags%20from%20the%20website%20%28hint%3A%20XPath%20similar%20to%20%22%2F%2Fa%22%29%20will%20yield%20hundreds%20of%20matches%20because%20%26lt%3Ba%26gt%3B%20is%20a%20very%20common%20tag.%20Moreover%2C%20refining%20the%20search%20to%20%26lt%3Ba%26gt%3B%20tags%20which%20have%20an%20href%20attribute%20will%20also%20yield%20hundreds%20of%20matches%20because%20href%20is%20the%20standard%20attribute%20to%20attach%20links%20within%20websites.%20We%20need%20to%20narrow%20down%20our%20search%20within%20the%20website.%0AOne%20strategy%20is%20to%20find%20the%20%E2%80%98father%E2%80%99%20or%20%E2%80%98grandfather%E2%80%99%20node%20of%20this%20particular%20%26lt%3Ba%26gt%3B%20tag%20and%20then%20match%20a%20node%20which%20has%20that%20same%20sequence%20of%20grandfather%20-%26gt%3B%20father%20-%26gt%3B%20child%20node.%20By%20looking%20at%20the%20structure%20of%20this%20small%20XML%20snippet%20from%20the%20right-most%20window%2C%20we%20see%20that%20the%20%E2%80%98grandfather%E2%80%99%20of%20this%20%26lt%3Ba%26gt%3B%20tag%20is%20%26lt%3Bp%20class%3D%22d-flex%20align-items-baseline%20g-mt-5%27%26gt%3B%20which%20has%20a%20particularly%20long%20attribute%20named%20class.%0A%0A%0A%0ADon%E2%80%99t%20be%20intimidated%20by%20these%20tag%20names%20and%20long%20attributes.%20I%20also%20don%E2%80%99t%20know%20what%20any%20of%20these%20attributes%20mean.%20But%20what%20I%20do%20know%20is%20that%20this%20is%20the%20%E2%80%98grandfather%E2%80%99%20of%20the%20%26lt%3Ba%26gt%3B%20tag%20I%E2%80%99m%20interested%20in.%20So%20using%20our%20XPath%20skills%2C%20let%E2%80%99s%20search%20for%20that%20%26lt%3Bp%26gt%3B%20tag%20and%20see%20if%20we%20get%20only%20one%20match.%0A%23%20Search%20for%20all%20%26lt%3Bp%26gt%3B%20tags%20with%20that%20class%20in%20the%20document%0Aschool_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Bp%20class%3D%26quot%3Bd-flex%20align-items-baseline%20g-mt-5%26quot%3B%26gt%3B%5Cr%5Cn%5Ct%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20...%0AOnly%20one%20match%2C%20so%20this%20is%20good%20news.%20This%20means%20that%20we%20can%20uniquely%20identify%20this%20particular%20%26lt%3Bp%26gt%3B%20tag.%20Let%E2%80%99s%20refine%20the%20search%20to%20say%3A%20Find%20all%20%26lt%3Ba%26gt%3B%20tags%20which%20are%20children%20of%20that%20specific%20%26lt%3Bp%26gt%3B%20tag.%20This%20only%20means%20I%E2%80%99ll%20add%20a%20%22%2F%2Fa%22%20to%20the%20previous%20expression.%20Since%20there%20is%20only%20one%20%26lt%3Bp%26gt%3B%20tag%20with%20the%20class%2C%20we%E2%80%99re%20interested%20in%20checking%20whether%20there%20is%20more%20than%20one%20%26lt%3Ba%26gt%3B%20tag%20below%20this%20%26lt%3Bp%26gt%3B%20tag.%0Aschool_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%0A%23%23%20%7Bxml_nodeset%20%281%29%7D%0A%23%23%20%20%26lt%3Ba%20href%3D%26quot%3B%2FColegio%2Fbuscar-colegios-cercanos.action%3Fcolegio.latitud%3D38%20...%0AThere%20we%20go%21%20We%20can%20see%20the%20specific%20href%20that%20contains%20the%20latitude%20and%20longitude%20data%20we%E2%80%99re%20interested%20in.%20How%20do%20we%20extract%20the%20href%20attribute%3F%20Using%20xml_attr%20as%20we%20did%20before%21%0Alocation_str%20%26lt%3B-%0A%20%20school_raw%20%25%26gt%3B%25%0A%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%20%25%26gt%3B%25%0A%20%20xml_attr%28attr%20%3D%20%26quot%3Bhref%26quot%3B%29%0A%0Alocation_str%0A%23%23%20%20%26quot%3B%2FColegio%2Fbuscar-colegios-cercanos.action%3Fcolegio.latitud%3D38.8274492%26amp%3Bcolegio.longitud%3D0.0221681%26quot%3B%0AOk%2C%20now%20we%20need%20some%20regex%20skills%20to%20get%20only%20the%20latitude%20and%20longitude%20%28regex%20expressions%20are%20used%20to%20search%20for%20patterns%20inside%20a%20string%2C%20such%20as%20for%20example%20a%20date.%20See%20here%20for%20some%20examples%29%3A%0Alocation%20%26lt%3B-%0A%20%20location_str%20%25%26gt%3B%25%0A%20%20str_extract_all%28%26quot%3B%3D.%2B%24%26quot%3B%29%20%25%26gt%3B%25%0A%20%20str_replace_all%28%26quot%3B%3D%7Ccolegio%5C%5C.longitud%26quot%3B%2C%20%26quot%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20str_split%28%26quot%3B%26amp%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20.%0A%0Alocation%0A%23%23%20%20%26quot%3B38.8274492%26quot%3B%20%26quot%3B0.0221681%26quot%3B%0AOk%2C%20so%20we%20got%20the%20information%20we%20needed%20for%20one%20single%20school.%20Let%E2%80%99s%20turn%20that%20into%20a%20function%20so%20we%20can%20pass%20only%20the%20school%E2%80%99s%20link%20and%20get%20the%20coordinates%20back.%0ABefore%20we%20do%20that%2C%20I%20will%20set%20something%20called%20my%20User-Agent.%20In%20short%2C%20the%20User-Agent%20is%20who%20you%20are.%20It%20is%20good%20practice%20to%20identify%20the%20person%20who%20is%20scraping%20the%20website%20because%20if%20you%E2%80%99re%20causing%20any%20trouble%20on%20the%20website%2C%20the%20website%20can%20directly%20identify%20who%20is%20causing%20problems.%20You%20can%20figure%20out%20your%20user%20agent%20here%20and%20paste%20it%20in%20the%20string%20below.%20In%20addition%2C%20I%20will%20add%20a%20time%20sleep%20of%205%20seconds%20to%20the%20function%20because%20we%20want%20to%20make%20sure%20we%20don%E2%80%99t%20cause%20any%20troubles%20to%20the%20website%20we%E2%80%99re%20scraping%20due%20to%20an%20overload%20of%20requests.%0A%23%20This%20sets%20your%20%60User-Agent%60%20globally%20so%20that%20all%20requests%20are%0A%23%20identified%20with%20this%20%60User-Agent%60%0Aset_config%28%0A%20%20user_agent%28%26quot%3BMozilla%2F5.0%20%28X11%3B%20Ubuntu%3B%20Linux%20x86_64%3B%20rv%3A70.0%29%20Gecko%2F20100101%20Firefox%2F70.0%26quot%3B%29%0A%29%0A%0A%23%20Collapse%20all%20of%20the%20code%20from%20above%20into%20one%20function%20called%0A%23%20school%20grabber%0A%0Aschool_grabber%20%26lt%3B-%20function%28school_url%29%20%7B%0A%20%20%23%20We%20add%20a%20time%20sleep%20of%205%20seconds%20to%20avoid%0A%20%20%23%20sending%20too%20many%20quick%20requests%20to%20the%20website%0A%20%20Sys.sleep%285%29%0A%0A%20%20school_raw%20%26lt%3B-%20read_html%28school_url%29%20%25%26gt%3B%25%20xml_child%28%29%0A%0A%20%20location_str%20%26lt%3B-%0A%20%20%20%20school_raw%20%25%26gt%3B%25%0A%20%20%20%20xml_find_all%28%26quot%3B%2F%2Fp%2F%2Fa%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20xml_attr%28attr%20%3D%20%26quot%3Bhref%26quot%3B%29%0A%0A%20%20location%20%26lt%3B-%0A%20%20%20%20location_str%20%25%26gt%3B%25%0A%20%20%20%20str_extract_all%28%26quot%3B%3D.%2B%24%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20str_replace_all%28%26quot%3B%3D%7Ccolegio%5C%5C.longitud%26quot%3B%2C%20%26quot%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20str_split%28%26quot%3B%26amp%3B%26quot%3B%29%20%25%26gt%3B%25%0A%20%20%20%20.%0A%0A%20%20%23%20Turn%20into%20a%20data%20frame%0A%20%20data.frame%28%0A%20%20%20%20latitude%20%3D%20location%2C%0A%20%20%20%20longitude%20%3D%20location%2C%0A%20%20%20%20stringsAsFactors%20%3D%20FALSE%0A%20%20%29%0A%7D%0A%0A%0Aschool_grabber%28school_url%29%0A%23%23%20%20%20%20%20latitude%20longitude%0A%23%23%201%2038.8274492%200.0221681%0AOk%2C%20so%20it%E2%80%99s%20working.%20The%20only%20thing%20left%20is%20to%20extract%20this%20for%20many%20schools.%20As%20shown%20earlier%2C%20scrapex%20contains%20a%20list%20of%2027%20school%20links%20that%20we%20can%20automatically%20scrape.%20Let%E2%80%99s%20loop%20over%20those%2C%20get%20the%20information%20of%20coordinates%20for%20each%20and%20collapse%20all%20of%20them%20into%20a%20data%20frame.%0Ares%20%26lt%3B-%20map_dfr%28school_links%2C%20school_grabber%29%0Ares%0A%23%23%20%20%20%20latitude%20%20longitude%0A%23%23%201%20%2042.72779%20-8.6567935%0A%23%23%202%20%2043.24439%20-8.8921645%0A%23%23%203%20%2038.95592%20-1.2255769%0A%23%23%204%20%2039.18657%20-1.6225903%0A%23%23%205%20%2040.38245%20-3.6410388%0A%23%23%206%20%2040.22929%20-3.1106322%0A%23%23%207%20%2040.43860%20-3.6970366%0A%23%23%208%20%2040.33514%20-3.5155669%0A%23%23%209%20%2040.50546%20-3.3738441%0A%23%23%2010%2040.63826%20-3.4537107%0A%23%23%2011%2040.38543%20-3.6639500%0A%23%23%2012%2037.76485%20-1.5030467%0A%23%23%2013%2038.82745%20%200.0221681%0A%23%23%2014%2040.99434%20-5.6224391%0A%23%23%2015%2040.99434%20-5.6224391%0A%23%23%2016%2040.56037%20-5.6703725%0A%23%23%2017%2040.99434%20-5.6224391%0A%23%23%2018%2040.99434%20-5.6224391%0A%23%23%2019%2041.13593%20%200.9901905%0A%23%23%2020%2041.26155%20%201.1670507%0A%23%23%2021%2041.22851%20%200.5461471%0A%23%23%2022%2041.14580%20%200.8199749%0A%23%23%2023%2041.18341%20%200.5680564%0A%23%23%2024%2042.07820%20%201.8203155%0A%23%23%2025%2042.25245%20%201.8621546%0A%23%23%2026%2041.73767%20%201.8383666%0A%23%23%2027%2041.62345%20%202.0013628%0ASo%20now%20that%20we%20have%20the%20locations%20of%20these%20schools%2C%20let%E2%80%99s%20plot%20them%3A%0Ares%20%26lt%3B-%20mutate_all%28res%2C%20as.numeric%29%0A%0Asp_sf%20%26lt%3B-%0A%20%20ne_countries%28scale%20%3D%20%26quot%3Blarge%26quot%3B%2C%20country%20%3D%20%26quot%3BSpain%26quot%3B%2C%20returnclass%20%3D%20%26quot%3Bsf%26quot%3B%29%20%25%26gt%3B%25%0A%20%20st_transform%28crs%20%3D%204326%29%0A%0Aggplot%28sp_sf%29%20%2B%0A%20%20geom_sf%28%29%20%2B%0A%20%20geom_point%28data%20%3D%20res%2C%20aes%28x%20%3D%20longitude%2C%20y%20%3D%20latitude%29%29%20%2B%0A%20%20coord_sf%28xlim%20%3D%20c%28-20%2C%2010%29%2C%20ylim%20%3D%20c%2825%2C%2045%29%29%20%2B%0A%20%20theme_minimal%28%29%20%2B%0A%20%20ggtitle%28%26quot%3BSample%20of%20schools%20in%20Spain%26quot%3B%29%0A%0AThere%20we%20go%21%20We%20went%20from%20literally%20no%20information%20at%20the%20beginning%20of%20this%20tutorial%20to%20interpretable%20and%20summarized%20information%20only%20using%20web%20data.%20We%20can%20see%20some%20schools%20in%20Madrid%20%28center%29%20as%20well%20in%20other%20regions%20of%20Spain%2C%20including%20Catalonia%20and%20Galicia.%0AThis%20marks%20the%20end%20of%20our%20scraping%20adventure%20but%20before%20we%20finish%2C%20I%20want%20to%20mention%20some%20of%20the%20ethical%20guidelines%20for%20web%20scraping.%20Scraping%20is%20extremely%20useful%20for%20us%20but%20can%20give%20headaches%20to%20other%20people%20maintaining%20the%20website%20of%20interest.%20Here%E2%80%99s%20a%20list%20of%20ethical%20guidelines%20you%20should%20always%20follow%3A%0A%0ARead%20the%20terms%20and%20services%3A%20many%20websites%20prohibit%20web%20scraping%20and%20you%20could%20be%20in%20a%20breach%20of%20privacy%20by%20scraping%20the%20data.%20One%20famous%20example.%0ACheck%20the%20robots.txt%20file.%20This%20is%20a%20file%20that%20most%20websites%20have%20%28www.buscocolegio.com%20does%20not%29%20which%20tell%20you%20which%20specific%20paths%20inside%20the%20website%20are%20scrapable%20and%20which%20are%20not.%20See%20here%20for%20an%20explanation%20of%20what%20robots.txt%20look%20like%20and%20where%20to%20find%20them.%0ASome%20websites%20are%20supported%20by%20very%20big%20servers%2C%20which%20means%20you%20can%20send%204-5%20website%20requests%20per%20second.%20Others%2C%20such%20as%20www.buscocolegio.com%20are%20not.%20It%E2%80%99s%20good%20practice%20to%20always%20put%20a%20time%20sleep%20between%20your%20requests.%20In%20our%20example%2C%20I%20set%20it%20to%205%20seconds%20because%20this%20is%20a%20small%20website%20and%20we%20don%E2%80%99t%20want%20to%20crash%20their%20servers.%0AWhen%20making%20requests%2C%20there%20are%20computational%20ways%20of%20identifying%20yourself.%20For%20example%2C%20every%20request%20%28such%20as%20the%20one%E2%80%99s%20we%20do%29%20can%20have%20something%20called%20a%20User-Agent.%20It%20is%20good%20practice%20to%20include%20yourself%20in%20as%20the%20User-Agent%20%28as%20we%20did%20in%20our%20code%29%20because%20the%20admin%20of%20the%20server%20can%20directly%20identify%20if%20someone%E2%80%99s%20causing%20problems%20due%20to%20their%20web%20scraping.%0ALimit%20your%20scraping%20to%20non-busy%20hours%20such%20as%20overnight.%20This%20can%20help%20reduce%20the%20chances%20of%20collapsing%20the%20website%20since%20fewer%20people%20are%20visiting%20websites%20in%20the%20evening.%0A%0AYou%20can%20read%20more%20about%20these%20ethical%20issues%20here.%0A%0A%0AWrap%20up%0AThis%20tutorial%20introduced%20you%20to%20basic%20concepts%20in%20web%20scraping%20and%20applied%20them%20in%20a%20real-world%20setting.%20Web%20scraping%20is%20a%20vast%20field%20in%20computer%20science%20%28you%20can%20find%20entire%20books%20on%20the%20subject%20such%20as%20this%29.%20We%20covered%20some%20basic%20techniques%20which%20I%20think%20can%20take%20you%20a%20long%20way%20but%20there%E2%80%99s%20definitely%20more%20to%20learn.%20For%20those%20curious%20about%20where%20to%20turn%2C%20I%E2%80%99m%20looking%20forward%20to%20the%20upcoming%20book%20%E2%80%9CA%20Field%20Guide%20for%20Web%20Scraping%20and%20Accessing%20APIs%20with%20R%E2%80%9D%20by%20Bob%20Rudis%2C%20which%20should%20be%20released%20in%20the%20near%20future.%20Now%20go%20scrape%20some%20websites%20ethically%21%0A&tf=1" target="_blank"
				   data-action="open-popup"
				   class="wpusb-layout-buttons wpusb-button wpusb-btn "
				   title="Send by Gmail"
				   
				   
				   rel="nofollow"
				>
				   			<svg class="wpusb-svg wpusb-gmail-buttons ">
				<use xlink:href="#wpusb-gmail" />
			</svg>
				</a>
			</div>				</div>
				<span class="wpusb-toggle" data-action="close-buttons">
								<svg class="wpusb-svg wpusb-angle-double-left ">
				<use xlink:href="#wpusb-angle-double-left" />
			</svg>
								<svg class="wpusb-svg wpusb-angle-double-right ">
				<use xlink:href="#wpusb-angle-double-right" />
			</svg>
				</span>
			</div>    <script>
        var snp_f = [];
        var snp_hostname = new RegExp(location.host);
        var snp_http = new RegExp("^(http|https)://", "i");
        var snp_cookie_prefix = '';
        var snp_separate_cookies = false;
        var snp_ajax_url = 'https://www.r-bloggers.com/wp-admin/admin-ajax.php';
		var snp_ajax_nonce = '19122424ca';
        var snp_ignore_cookies = false;
        var snp_enable_analytics_events = false;
        var snp_enable_mobile = false;
        var snp_use_in_all = false;
        var snp_excluded_urls = [];
        snp_excluded_urls.push('');    </script>
    <div class="snp-root">
        <input type="hidden" id="snp_popup" value="" />
        <input type="hidden" id="snp_popup_id" value="" />
        <input type="hidden" id="snp_popup_theme" value="" />
        <input type="hidden" id="snp_exithref" value="" />
        <input type="hidden" id="snp_exittarget" value="" />
        	<div id="snppopup-welcome" class="snp-pop-109583 snppopup"><input type="hidden" class="snp_open" value="scroll" /><input type="hidden" class="snp_show_on_exit" value="2" /><input type="hidden" class="snp_exit_js_alert_text" value="" /><input type="hidden" class="snp_exit_scroll_down" value="" /><input type="hidden" class="snp_exit_scroll_up" value="" /><input type="hidden" class="snp_open_scroll" value="50" /><input type="hidden" class="snp_optin_redirect_url" value="" /><input type="hidden" class="snp_show_cb_button" value="yes" /><input type="hidden" class="snp_popup_id" value="109583" /><input type="hidden" class="snp_popup_theme" value="theme6" /><input type="hidden" class="snp_overlay" value="disabled" /><input type="hidden" class="snp_cookie_conversion" value="30" /><input type="hidden" class="snp_cookie_close" value="180" /><div class="snp-fb snp-theme6">
    <div class="snp-subscribe-inner">
	<h1 class="snp-header"><i>Never miss an update! </i>
<br/>
<strong>Subscribe to R-bloggers</strong> to receive <br/>e-mails with the latest R posts.<br/>

<small>(You will not see this message again.)</small></h1>	<div class="snp-form">
	    <form action="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" method="post" class="snp-subscribeform snp_subscribeform">
				<fieldset>
		    <div class="snp-field">
			<input type="text" name="email" id="snp_email" placeholder="Your E-mail..." class="snp-field snp-field-email" />		
		    </div>
		    <button type="submit" class="snp-submit">Submit</button>
		</fieldset>
	    </form>
	</div>
	<a href="#" class="snp_nothanks snp-close">Click here to close (This popup will not appear again)</a>    </div>
    </div>
<style>.snp-pop-109583 .snp-theme6 { max-width: 700px;}
.snp-pop-109583 .snp-theme6 h1 {font-size: 17px;}
.snp-pop-109583 .snp-theme6 { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field ::-webkit-input-placeholder { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field :-moz-placeholder { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field :-ms-input-placeholder { color: #a0a4a9;}
.snp-pop-109583  .snp-theme6 .snp-field input { border: 1px solid #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field { color: #000000;}
.snp-pop-109583 .snp-theme6 { background: #f2f2f2;}
</style><script>
jQuery(document).ready(function() {
});
</script>
</div>        <script type="text/javascript">
            var CaptchaCallback = function() {
                jQuery('.g-recaptcha').each(function(index, el) {
                    grecaptcha.render(el, {
                        'sitekey' : ''
                    });
                });
            };
        </script>
    </div>
    <script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shCore.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushAS3.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushBash.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushColdFusion.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushClojure.js?ver=20090602'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCpp.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCSharp.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushCss.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDelphi.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushDiff.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushErlang.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushFSharp.js?ver=20091003'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushGroovy.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJava.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJavaFX.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushJScript.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushLatex.js?ver=20090613'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushMatlabKey.js?ver=20091209'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushObjC.js?ver=20091207'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPerl.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPhp.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPlain.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPowerShell.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushPython.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/third-party-brushes/shBrushR.js?ver=20100919'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushRuby.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushScala.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushSql.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushVb.js?ver=3.0.9b'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/scripts/shBrushXml.js?ver=3.0.9b'></script>
<script type='text/javascript'>
	(function(){
		var corecss = document.createElement('link');
		var themecss = document.createElement('link');
		var corecssurl = "https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.9b";
		if ( corecss.setAttribute ) {
				corecss.setAttribute( "rel", "stylesheet" );
				corecss.setAttribute( "type", "text/css" );
				corecss.setAttribute( "href", corecssurl );
		} else {
				corecss.rel = "stylesheet";
				corecss.href = corecssurl;
		}
		document.head.appendChild( corecss );
		var themecssurl = "https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.9b";
		if ( themecss.setAttribute ) {
				themecss.setAttribute( "rel", "stylesheet" );
				themecss.setAttribute( "type", "text/css" );
				themecss.setAttribute( "href", themecssurl );
		} else {
				themecss.rel = "stylesheet";
				themecss.href = themecssurl;
		}
		document.head.appendChild( themecss );
	})();
	SyntaxHighlighter.config.strings.expandSource = '+ expand source';
	SyntaxHighlighter.config.strings.help = '?';
	SyntaxHighlighter.config.strings.alert = 'SyntaxHighlighter\n\n';
	SyntaxHighlighter.config.strings.noBrush = 'Can\'t find brush for: ';
	SyntaxHighlighter.config.strings.brushNotHtmlScript = 'Brush wasn\'t configured for html-script option: ';
	SyntaxHighlighter.defaults['pad-line-numbers'] = false;
	SyntaxHighlighter.defaults['toolbar'] = false;
	SyntaxHighlighter.all();

	// Infinite scroll support
	if ( typeof( jQuery ) !== 'undefined' ) {
		jQuery( function( $ ) {
			$( document.body ).on( 'post-load', function() {
				SyntaxHighlighter.highlight();
			} );
		} );
	}
</script>
<link rel='stylesheet' id='wpusb-style-css'  href='https://www.r-bloggers.com/wp-content/plugins/wpupper-share-buttons/build/style.css?ver=1569067013' type='text/css' media='all' />
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/jquery.ck.min.js?ver=5.2.1'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/dialog_trigger.js?ver=5.2.1'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/ninjapopups.min.js?ver=5.2.1'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/fancybox2/jquery.fancybox.min.js?ver=5.2.1'></script>
<script type='text/javascript' src='https://c0.wp.com/c/5.2.1/wp-includes/js/comment-reply.min.js'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/themes/magazine-basic/js/effects.js?ver=5.2.1'></script>
<script type='text/javascript' src='https://c0.wp.com/p/jetpack/7.3.2/_inc/build/photon/photon.min.js'></script>
<script type='text/javascript' src='https://s0.wp.com/wp-content/js/devicepx-jetpack.js?ver=202012'></script>
<script type='text/javascript' src='https://c0.wp.com/p/jetpack/7.3.2/_inc/build/lazy-images/js/lazy-images.min.js'></script>
<script type='text/javascript' src='https://c0.wp.com/c/5.2.1/wp-includes/js/wp-embed.min.js'></script>
	<div id="fb-root"></div>
	<script type="text/javascript" src="https://platform.twitter.com/widgets.js"></script><script type="text/javascript" src="//connect.facebook.net/en_US/all.js#xfbml=1"></script><script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script><script type="text/javascript" src="https://platform.linkedin.com/in.js"></script><script type='text/javascript' src='https://stats.wp.com/e-202012.js' async='async' defer='defer'></script>
<script type='text/javascript'>
	_stq = window._stq || [];
	_stq.push([ 'view', {v:'ext',j:'1:7.3.2',blog:'11524731',post:'193096',tz:'-6',srv:'www.r-bloggers.com'} ]);
	_stq.push([ 'clickTrackerInit', '11524731', '193096' ]);
</script>
	<script type="text/javascript">
        jQuery(document).ready(function ($) {
            //$( document ).ajaxStart(function() {
            //});

			
            for (var i = 0; i < document.forms.length; ++i) {
                var form = document.forms[i];
				if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="nAmdgxejI" value="*g4z8ODZ2" />'); }
if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="QpzFJ-m" value="o]8pN0G" />'); }
            }

			
            $(document).on('submit', 'form', function () {
				if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="nAmdgxejI" value="*g4z8ODZ2" />'); }
if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="QpzFJ-m" value="o]8pN0G" />'); }
                return true;
            });

			
            jQuery.ajaxSetup({
                beforeSend: function (e, data) {

                    //console.log(Object.getOwnPropertyNames(data).sort());
                    //console.log(data.type);

                    if (data.type !== 'POST') return;

                    if (typeof data.data === 'object' && data.data !== null) {
						data.data.append("nAmdgxejI", "*g4z8ODZ2");
data.data.append("QpzFJ-m", "o]8pN0G");
                    }
                    else {
                        data.data =  data.data + '&nAmdgxejI=*g4z8ODZ2&QpzFJ-m=o]8pN0G';
                    }
                }
            });

        });
	</script>
	<script type="text/javascript" src="https://www.r-bloggers.com/wp-content/themes/magazine-basic/js/effects.js"></script> 
<script type="text/javascript">
/* <![CDATA[ */
jQuery(function(){
	jQuery("ul.sf-menu").supersubs({ 
		minWidth:    12,
		maxWidth:    27,
		extraWidth:  1
	}).superfish({ 
		delay:       100,
		speed:       250 
	});	});
/* ]]> */
</script>



</body>
</html>
<!-- Dynamic page generated in 0.938 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2020-03-21 08:56:12 -->

<!-- Compression = gzip -->