An introduction to web scraping: locating Spanish schools

[This article was first published on R on Coding Club UC3M, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Jorge Cimentada

Introduction

Whenever a new paper is released using some type of scraped data, most of my peers in the social science community get baffled at how researchers can do this. In fact, many social scientists can’t even think of research questions that can be addressed with this type of data simply because they don’t know it’s even possible. As the old saying goes, when you have a hammer, every problem looks like a nail.

With the increasing amount of data being collected on a daily basis, it is eminent that scientists start getting familiar with new technologies that can help answer old questions. Moreover, we need to be adventurous about cutting edge data sources as they can also allow us to ask new questions which weren’t even thought of in the past.

In this tutorial I’ll be guiding you through the basics of web scraping using R and the xml2 package. I’ll begin with a simple example using fake data and elaborate further by trying to scrape the location of a sample of schools in Spain.

Basic steps

For web scraping in R, you can fulfill almost all of your needs with the xml2 package. As you wander through the web, you’ll see many examples using the rvest package. xml2 and rvest are very similar so don’t feel you’re lacking behind for learning one and not the other. In addition to these two packages, we’ll need some other libraries for plotting locations on a map (ggplot2, sf, rnaturalearth), identifying who we are when we scrape (httr) and wrangling data (tidyverse).

Additionally, we’ll also need the package scrapex. In the real-world example that we’ll be doing below, we’ll be scraping data from the website www.buscocolegio.com to locate a sample of schools in Spain. However, throughout the tutorial we won’t be scraping the data directly from their real-website. What would happen to this tutorial if 6 months from now www.buscocolegio.com updates the design of their website? Everything from our real-world example would be lost.

Web scraping tutorials are usually very unstable precisely because of this. To circumvent that problem, I’ve saved a random sample of websites from some schools in www.buscocolegio.com into an R package called scrapex. Although the links we’ll be working on will be hosted locally on your machine, the HTML of the website should be very similar to the one hosted on the website (with the exception of some images/icons which were deleted on purpose to make the package lightweight).

You can install the package with:

# install.packages("devtools")
devtools::install_github("cimentadaj/scrapex")

Now, let’s move on the fake data example and load all of our packages with:

library(xml2)
library(httr)
library(tidyverse)
library(sf)
library(rnaturalearth)
library(ggplot2)
library(scrapex)

Let’s begin with a simple example. Below we define an XML string and look at its structure:

xml_test <- "

  
    
      
        Jason
      
    
    
        Bourne
    
    
      Spy
    
  


  
    
      
        Carol
      
    
    
        Kalp
    
    
      Scientist
    
  


"

cat(xml_test)
## 
## 
##   
##     
##       
##         Jason
##       
##     
##     
##         Bourne
##     
##     
##       Spy
##     
##   
## 
## 
##   
##     
##       
##         Carol
##       
##     
##     
##         Kalp
##     
##     
##       Scientist
##     
##   
## 
## 

In XML and HTML the basic building blocks are something called tags. For example, the first tag in the structure shown above is . This tag is matched by at the end of the string:

If you pay close attention, you’ll see that each tag in the XML structure has a beginning (signaled by <>) and an end (signaled by ). For example, the next tag after is and right before the tag is the end of the jason tag .

Similarly, you’ll find that the tag is also matched by a finishing tag.

In theory, tags can have whatever meaning you attach to them (such as or ). However, in practice there are hundreds of tags which are standard in websites (for example, here). If you’re just getting started, there’s no need for you to learn them but as you progress in web scraping, you’ll start to recognize them (one brief example is which simply bolds text in a website).

The xml2 package was designed to read XML strings and to navigate the tree structure to extract information. For example, let’s read in the XML data from our fake example and look at its general structure:

xml_raw <- read_xml(xml_test)
xml_structure(xml_raw)
## 
##   
##     
##       
##         
##           {text}
##       
##         {text}
##       
##         {text}
##   
##     
##       
##         
##           {text}
##       
##         {text}
##       
##         {text}

You can see that the structure is tree-based, meaning that tags such as and are nested within the tag. In XML jargon, is the root node, whereas and are the child nodes from .

In more detail, the structure is as follows:

  • The root node is
  • The child nodes are and
  • Then each child node has nodes , , and nested within them.

Put another way, if something is nested within a node, then the nested node is a child of the upper-level node. In our example, the root node is so we can check which are its children:

# xml_child returns only one child (specified in search)
# Here, jason is the first child
xml_child(xml_raw, search = 1)
## {xml_node}
## 
## [1] \n  \n    \n        Ja ...
# Here, carol is the second child
xml_child(xml_raw, search = 2)
## {xml_node}
## 
## [1] \n  \n    \n        Carol\n ...
# Use xml_children to extract **all** children
child_xml <- xml_children(xml_raw)

child_xml
## {xml_nodeset (2)}
## [1] \n  \n    \n      \n  \n    \n      \n ...

Tags can also have different attributes which are usually specified as and ended as usual with . If you look at the XML structure of our example, you’ll notice that each tag has an attribute called type. As you’ll see in our real-world example, extracting these attributes is often the aim of our scraping adventure. Using xml2, we can extract all attributes that match a specific name with xml_attrs.

# Extract the attribute type from all nodes
xml_attrs(child_xml, "type")
## [[1]]
## named character(0)
##
## [[2]]
## named character(0)

Wait, why didn’t this work? Well, if you look at the output of child_xml, we have two nodes on which are for and .

child_xml
## {xml_nodeset (2)}
## [1] \n  \n    \n      \n  \n    \n      \n ...

Do these tags have an attribute? No, because if they did, they would have something like . What we need is to look down at the tag within and and extract the attribute from .

Does this sound familiar? Both and have an associated tag below them, making them their children. We can just go down one level by running xml_children on these tags and extract them.

# We go down one level of children
person_nodes <- xml_children(child_xml)

#  is now the main node, so we can extract attributes
person_nodes
## {xml_nodeset (2)}
## [1] \n  \n    \n        Ja ...
## [2] \n  \n    \n        Carol\n ...
# Both type attributes
xml_attrs(person_nodes, "type")
## [[1]]
##        type
## "fictional"
##
## [[2]]
##   type
## "real"

Using the xml_path function you can even find the ‘address’ of these nodes to retrieve specific tags without having to write down xml_children many times. For example:

# Specific address of each person tag for the whole xml tree
# only using the `person_nodes`
xml_path(person_nodes)
## [1] "/people/jason/person" "/people/carol/person"

We have the ‘address’ of specific tags in the tree but how do we extract them automatically? To extract specific ‘addresses’ of this XML tree, the main function we’ll use is xml_find_all. This function accepts the XML tree and an ‘address’ string. We can use very simple strings, such as the one given by xml_path:

# You can use results from xml_path like directories
xml_find_all(xml_raw, "/people/jason/person")
## {xml_nodeset (1)}
## [1] \n  \n    \n        Ja ...

The expression above is asking for the node "/people/jason/person". This will return the same as saying xml_raw %>% xml_child(search = 1). For deeply nested trees, xml_find_all will be many times much cleaner than calling xml_child recursively many times.

However, in most cases the ‘addresses’ used in xml_find_all come from a separate language called XPath (in fact, the ‘address’ we’ve been looking at is XPath). XPath is a complex language (such as regular expressions for strings) which is beyond this brief tutorial. However, with the examples we’ve seen so far, we can use some basic XPath which we’ll need later on.

To extract all the tags in a document, we can use //name_of_tag.

# Search for all 'married' nodes
xml_find_all(xml_raw, "//married")
## {xml_nodeset (2)}
## [1] \n        Jason\n      
## [2] \n        Carol\n      

With the previous XPath, we’re searching for all married tags within the complete XML tree. The result returns all married nodes (I use the words tags and nodes interchangeably) in the complete tree structure. Another example would be finding all tags:

xml_find_all(xml_raw, "//occupation")
## {xml_nodeset (2)}
## [1] \n      Spy\n    
## [2] \n      Scientist\n    

If you want to find any other tag you can replace "//occupation" with your tag of interest and xml_find_all will find all of them.

If you wanted to find all tags below your current node, you only need to add a . at the beginning: ".//occupation". For example, if we dived into the tag and we wanted his tag, "//occupation" will returns all tags. Instead, ".//occupation" will return only the found tags below the current tag. For example:

xml_raw %>%
  # Dive only into Jason's tag
  xml_child(search = 1) %>%
  xml_find_all(".//occupation")
## {xml_nodeset (1)}
## [1] \n      Spy\n    
# Instead, the wrong way would have been:
xml_raw %>%
  # Dive only into Jason's tag
  xml_child(search = 1) %>%
  # Here we get both occupation tags
  xml_find_all("//occupation")
## {xml_nodeset (2)}
## [1] \n      Spy\n    
## [2] \n      Scientist\n    

The first example only returns ’s occupation whereas the second returned all occupations, regardless of where you are in the tree.

XPath also allows you to identify tags that contain only one specific attribute, such as the one’s we saw earlier. For example, to filter all tags with the attribute filter set to fictional, we could do it with:

# Give me all the tags 'person' that have an attribute type='fictional'
xml_raw %>%
  xml_find_all("//person[@type='fictional']")
## {xml_nodeset (1)}
## [1] \n  \n    \n        Ja ...

If you wanted to do the same but for the tags below your current nodes, the same trick we learned earlier would work: ".//person[@type='fictional']". These are just some primers that can help you jump easily to using XPath, but I encourage you to look at other examples on the web, as complex websites often require complex XPath expressions.

Before we begin our real-word example, you might be asking yourself how you can actually extract the text/numeric data from these nodes. Well, that’s easy: xml_text.

xml_raw %>%
  xml_find_all(".//occupation") %>%
  xml_text()
## [1] "\n      Spy\n    "       "\n      Scientist\n    "

Once you’ve narrowed down your tree-based search to one single piece of text or numbers, xml_text() will extract that for you (there’s also xml_double and xml_integer for extracting numbers). As I said, XPath is really a huge language. If you’re interested, this XPath cheat sheets have helped me a lot to learn tricks for easy scraping.

Real-world example

We’re interested in making a list of many schools in Spain and visualizing their location. This can be useful for many things such as matching population density of children across different regions to school locations. The website www.buscocolegio.com contains a database of schools similar to what we’re looking for. As described at the beginning, instead we’re going to use scrapex which has the function spanish_schools_ex() containing the links to a sample of websites from different schools saved locally on your computer.

Let’s look at an example for one school.

school_links <- spanish_schools_ex()

# Keep only the HTML file of one particular school.
school_url <- school_links[13]

school_url
## [1] "/usr/local/lib/R/site-library/scrapex/extdata/spanish_schools_ex/school_3006839.html"

If you’re interested in looking at the website interactively in your browser, you can do it with browseURL(prep_browser(school_url)). Let’s read the HTML (XML and HTML are usually interchangeable, so here we use read_html).

# Here we use `read_html` because `read_xml` is throwing an error
# when attempting to read. However, everything we've discussed
# should be the same.
school_raw <- read_html(school_url) %>% xml_child()

school_raw
## {html_node}
## 
##  [1] Aquí encontrarás toda la información necesaria sobre CEIP SA ...
##  [2] <meta charset="utf-8">\n
##  [3] <meta name="viewport" content="width=device-width, initial-scale=1, ...
##  [4] <meta http-equiv="x-ua-compatible" content="ie=edge">\n
##  [5] <meta name="author" content="BuscoColegio">\n
##  [6] <meta name="description" content="Encuentra toda la información nec ...
##  [7] <meta name="keywords" content="opiniones SANCHIS GUARNER, contacto  ...
##  [8] <link rel="shortcut icon" href="/favicon.ico">\n
##  [9] <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Robo ...
## [10] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [11] <link rel="stylesheet" href="/assets/vendor/icon-awesome/css/font-a ...
## [12] <link rel="stylesheet" href="/assets/vendor/icon-line/css/simple-li ...
## [13] <link rel="stylesheet" href="/assets/vendor/icon-line-pro/style.css ...
## [14] <link rel="stylesheet" href="/assets/vendor/icon-hs/style.css">\n
## [15] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [16] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [17] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [18] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [19] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## [20] <link rel="stylesheet" href="https://s3.eu-west-3.amazonaws.com/bus ...
## ...</pre>
<p>Web scraping strategies are very specific to the website you’re after. You have to get very familiar with the website you’re interested to be able to match perfectly the information you’re looking for. In many cases, scraping two websites will require vastly different strategies. For this particular example, we’re only interested in figuring out the <strong>location</strong> of each school so we only have to extract its location.</p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/main_page.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>In the image above you’ll find a typical school’s website in <code>wwww.buscocolegio.com</code>. The website has a lot of information, but we’re only interested in the button that is circled by the orange rectangle. If you can’t find it easily, it’s below the Google Maps on the right which says “Buscar colegio cercano”.</p>
<p>When you click on this button, this actually points you towards the coordinates of the school so we just have to find a way of figuring out how to click this button or figure out how to get its information. All browsers allow you to do this if you press CTRL + SHIFT + c at the same time (Firefox and Chrome support this hotkey). If a window on the right popped in full of code, then you’re on the right track:</p>
<p></p>
<p><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Here we can search the source code of the website. If you place your mouse pointer over the lines of code from this right-most window, you’ll see sections of the website being highlighted in blue. This indicates which parts of the code refer to which parts of the website. Luckily for us, we don’t have to search the complete source code to find that specific location. We can approximate our search by typing the text we’re looking for in the search bar at the top of the right window:</p>
<p></p>
<p><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i1.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/search_developer_tools.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>After we click enter, we’ll be automatically directed to the tag that has the information that we want.</p>
<p></p>
<p><img src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i0.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>More specifically, we can see that the latitude and longitude of schools are found in an attributed called <code>href</code> in a tag <code><a></code>:</p>
<p></p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Can you see the latitude and longitude fields in the text highlighted blue? It’s hidden in-between words. That is precisely the type of information we’re after. Extracting all <code><a></code> tags from the website (hint: XPath similar to <code>"//a"</code>) will yield hundreds of matches because <code><a></code> is a very common tag. Moreover, refining the search to <code><a></code> tags which have an <code>href</code> attribute will also yield hundreds of matches because <code>href</code> is the standard attribute to attach links within websites. We need to narrow down our search within the website.</p>
<p>One strategy is to find the ‘father’ or ‘grandfather’ node of this particular <code><a></code> tag and then match a node which has that same sequence of grandfather -> father -> child node. By looking at the structure of this small XML snippet from the right-most window, we see that the ‘grandfather’ of this <code><a></code> tag is <code></p>
<p class="d-flex align-items-baseline g-mt-5'></code> which has a particularly long attribute named <code>class</code>.</p>
<p></p>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/buscocolegios_xml/location_tag_zoomed.png?w=450&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p></p>
<p>Don’t be intimidated by these tag names and long attributes. I also don’t know what any of these attributes mean. But what I do know is that this is the ‘grandfather’ of the <code><a></code> tag I’m interested in. So using our XPath skills, let’s search for that <code></p>
<p></code> tag and see if we get only one match.</p>
<pre># Search for all <p> tags with that class in the document
school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']")
## {xml_nodeset (1)}
## [1] <p class="d-flex align-items-baseline g-mt-5">\r\n\t                 ...</pre>
<p>Only one match, so this is good news. This means that we can uniquely identify this particular <code></p>
<p></code> tag. Let’s refine the search to say: Find all <code><a></code> tags which are children of that specific <code></p>
<p></code> tag. This only means I’ll add a <code>"//a"</code> to the previous expression. Since there is only one <code></p>
<p></code> tag with the class, we’re interested in checking whether there is more than one <code><a></code> tag below this <code></p>
<p></code> tag.</p>
<pre>school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a")
## {xml_nodeset (1)}
## [1] <a href="/Colegio/buscar-colegios-cercanos.action?colegio.latitud=38 rel=" target="_blank">
<p>There we go! We can see the specific <code>href</code> that contains the latitude and longitude data we’re interested in. How do we extract the <code>href</code> attribute? Using <code>xml_attr</code> as we did before!</p>
<pre>location_str <-
  school_raw %>%
  xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a") %>%
  xml_attr(attr = "href")

location_str
## [1] "/Colegio/buscar-colegios-cercanos.action?colegio.latitud=38.8274492&colegio.longitud=0.0221681"</pre>
<p>Ok, now we need some regex skills to get only the latitude and longitude (regex expressions are used to search for patterns inside a string, such as for example a date. See <a href="https://www.jumpingrivers.com/blog/regular-expressions-every-r-programmer-should-know/" rel="nofollow" target="_blank">here</a> for some examples):</p>
<pre>location <-
  location_str %>%
  str_extract_all("=.+$") %>%
  str_replace_all("=|colegio\\.longitud", "") %>%
  str_split("&") %>%
  .[[1]]

location
## [1] "38.8274492" "0.0221681"</pre>
<p>Ok, so we got the information we needed for one single school. Let’s turn that into a function so we can pass only the school’s link and get the coordinates back.</p>
<p>Before we do that, I will set something called my <code>User-Agent</code>. In short, the <code>User-Agent</code> is <strong>who</strong> you are. It is good practice to identify the person who is scraping the website because if you’re causing any trouble on the website, the website can directly identify who is causing problems. You can figure out your user agent <a href="https://www.google.com/search?client=ubuntu&channel=fs&q=what%27s+my+user+agent&ie=utf-8&oe=utf-8" rel="nofollow" target="_blank">here</a> and paste it in the string below. In addition, I will add a time sleep of 5 seconds to the function because we want to make sure we don’t cause any troubles to the website we’re scraping due to an overload of requests.</p>
<pre># This sets your `User-Agent` globally so that all requests are
# identified with this `User-Agent`
set_config(
  user_agent("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0")
)

# Collapse all of the code from above into one function called
# school grabber

school_grabber <- function(school_url) {
  # We add a time sleep of 5 seconds to avoid
  # sending too many quick requests to the website
  Sys.sleep(5)

  school_raw <- read_html(school_url) %>% xml_child()

  location_str <-
    school_raw %>%
    xml_find_all("//p[@class='d-flex align-items-baseline g-mt-5']//a") %>%
    xml_attr(attr = "href")

  location <-
    location_str %>%
    str_extract_all("=.+$") %>%
    str_replace_all("=|colegio\\.longitud", "") %>%
    str_split("&") %>%
    .[[1]]

  # Turn into a data frame
  data.frame(
    latitude = location[1],
    longitude = location[2],
    stringsAsFactors = FALSE
  )
}


school_grabber(school_url)
##     latitude longitude
## 1 38.8274492 0.0221681</pre>
<p>Ok, so it’s working. The only thing left is to extract this for many schools. As shown earlier, <code>scrapex</code> contains a list of 27 school links that we can automatically scrape. Let’s loop over those, get the information of coordinates for each and collapse all of them into a data frame.</p>
<pre>res <- map_dfr(school_links, school_grabber)
res
##    latitude  longitude
## 1  42.72779 -8.6567935
## 2  43.24439 -8.8921645
## 3  38.95592 -1.2255769
## 4  39.18657 -1.6225903
## 5  40.38245 -3.6410388
## 6  40.22929 -3.1106322
## 7  40.43860 -3.6970366
## 8  40.33514 -3.5155669
## 9  40.50546 -3.3738441
## 10 40.63826 -3.4537107
## 11 40.38543 -3.6639500
## 12 37.76485 -1.5030467
## 13 38.82745  0.0221681
## 14 40.99434 -5.6224391
## 15 40.99434 -5.6224391
## 16 40.56037 -5.6703725
## 17 40.99434 -5.6224391
## 18 40.99434 -5.6224391
## 19 41.13593  0.9901905
## 20 41.26155  1.1670507
## 21 41.22851  0.5461471
## 22 41.14580  0.8199749
## 23 41.18341  0.5680564
## 24 42.07820  1.8203155
## 25 42.25245  1.8621546
## 26 41.73767  1.8383666
## 27 41.62345  2.0013628</pre>
<p>So now that we have the locations of these schools, let’s plot them:</p>
<pre>res <- mutate_all(res, as.numeric)

sp_sf <-
  ne_countries(scale = "large", country = "Spain", returnclass = "sf") %>%
  st_transform(crs = 4326)

ggplot(sp_sf) +
  geom_sf() +
  geom_point(data = res, aes(x = longitude, y = latitude)) +
  coord_sf(xlim = c(-20, 10), ylim = c(25, 45)) +
  theme_minimal() +
  ggtitle("Sample of schools in Spain")</pre>
<p><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" data-lazy-src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/codingclubuc3m.rbind.io/post/2020-02-11_files/figure-html/unnamed-chunk-34-1.png?w=80%25&ssl=1" style="display: block; margin: auto;" data-recalc-dims="1" /></noscript></p>
<p>There we go! We went from literally no information at the beginning of this tutorial to interpretable and summarized information only using web data. We can see some schools in Madrid (center) as well in other regions of Spain, including Catalonia and Galicia.</p>
<p>This marks the end of our scraping adventure but before we finish, I want to mention some of the ethical guidelines for web scraping. Scraping is extremely useful for us but can give headaches to other people maintaining the website of interest. Here’s a list of ethical guidelines you should always follow:</p>
<ul>
<li>
<p>Read the terms and services: many websites prohibit web scraping and you could be in a breach of privacy by scraping the data. <a href="https://fortune.com/2016/05/18/okcupid-data-research/" rel="nofollow" target="_blank">One</a> famous example.</p>
</li>
<li>
<p>Check the <code>robots.txt</code> file. This is a file that most websites have (<code>www.buscocolegio.com</code> does <strong>not</strong>) which tell you which specific paths inside the website are scrapable and which are not. See <a href="https://www.robotstxt.org/robotstxt.html" rel="nofollow" target="_blank">here</a> for an explanation of what robots.txt look like and where to find them.</p>
</li>
<li>
<p>Some websites are supported by very big servers, which means you can send 4-5 website requests per second. Others, such as <code>www.buscocolegio.com</code> are not. It’s good practice to always put a time sleep between your requests. In our example, I set it to 5 seconds because this is a small website and we don’t want to crash their servers.</p>
</li>
<li>
<p>When making requests, there are computational ways of identifying yourself. For example, every request (such as the one’s we do) can have something called a <code>User-Agent</code>. It is good practice to include yourself in as the <code>User-Agent</code> (as we did in our code) because the admin of the server can directly identify if someone’s causing problems due to their web scraping.</p>
</li>
<li>
<p>Limit your scraping to non-busy hours such as overnight. This can help reduce the chances of collapsing the website since fewer people are visiting websites in the evening.</p>
</li>
</ul>
<p>You can read more about these ethical issues <a href="http://robertorocha.info/on-the-ethics-of-web-scraping/" rel="nofollow" target="_blank">here</a>.</p>
</div>
<div id="wrap-up" class="section level2">
<h2>Wrap up</h2>
<p>This tutorial introduced you to basic concepts in web scraping and applied them in a real-world setting. Web scraping is a vast field in computer science (you can find entire books on the subject such as <a href="https://www.apress.com/gp/book/9781484235812" rel="nofollow" target="_blank">this</a>). We covered some basic techniques which I think can take you a long way but there’s definitely more to learn. For those curious about where to turn, I’m looking forward to the upcoming book <a href="https://rud.is/b/books/" rel="nofollow" target="_blank">“A Field Guide for Web Scraping and Accessing APIs with R”</a> by Bob Rudis, which should be released in the near future. Now go scrape some websites ethically!</p>
</div>

<script type="text/javascript">
    var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'};
    (function(d, t) {
        var s = d.createElement(t);
            s.type = 'text/javascript';
            s.async = true;
			// s.defer = true;
//          s.src = '//cdn.viglink.com/api/vglnk.js'; 
			s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js';
        var r = d.getElementsByTagName(t)[0];
            r.parentNode.insertBefore(s, r);
    }(document, 'script'));
</script>
		
<div id='jp-relatedposts' class='jp-relatedposts' >
	<h3 class="jp-relatedposts-headline"><em>Related</em></h3>
</div><aside class="mashsb-container mashsb-main mashsb-stretched"><div class="mashsb-box"><div class="mashsb-buttons"><a class="mashicon-facebook mash-small mash-center mashsb-noshadow" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.r-bloggers.com%2F2020%2F02%2Fan-introduction-to-web-scraping-locating-spanish-schools%2F" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Share</span></a><a class="mashicon-twitter mash-small mash-center mashsb-noshadow" href="https://twitter.com/intent/tweet?text=An%20introduction%20to%20web%20scraping%3A%20locating%20Spanish%20schools&url=https://www.r-bloggers.com/2020/02/an-introduction-to-web-scraping-locating-spanish-schools/&via=Rbloggers" target="_blank" rel="nofollow"><span class="icon"></span><span class="text">Tweet</span></a><div class="onoffswitch2 mash-small mashsb-noshadow" style="display:none;"></div></div>
            </div>
                <div style="clear:both;"></div></aside>
            <!-- Share buttons by mashshare.net - Version: 3.7.7-->
<p class="syndicated-attribution"><div style="border: 1px solid; background: none repeat scroll 0 0 #EDEDED; margin: 1px; font-size: 13px;">
<div style="text-align: center;">To <strong>leave a comment</strong> for the author, please follow the link and comment on their blog: <strong><a href="https://codingclubuc3m.rbind.io/post/2020-02-11/"> R on Coding Club UC3M</a></strong>.</div>
<hr />
<a href="https://www.r-bloggers.com/" rel="nofollow">R-bloggers.com</a> offers <strong><a href="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" rel="nofollow">daily e-mail updates</a></strong> about <a title="The R Project for Statistical Computing" href="https://www.r-project.org/" rel="nofollow">R</a> news and tutorials about <a title="R tutorials" href="https://www.r-bloggers.com/how-to-learn-r-2/" rel="nofollow">learning R</a> and many other topics. <a title="Data science jobs" href="https://www.r-users.com/" rel="nofollow">Click here if you're looking to post or find an R/data-science job</a>.

<hr>Want to share your content on R-bloggers?<a href="https://www.r-bloggers.com/add-your-blog/" rel="nofollow"> click here</a> if you have a blog, or <a href="http://r-posts.com/" rel="nofollow"> here</a> if you don't.
</div></p>			</div>
	</article><nav class="post-navigation clearfix" role="navigation">
<div class="post-nav left">
<a href="https://www.r-bloggers.com/2020/02/upping-your-pipe-game/" rel="prev">← Previous post</a></div>
<div class="post-nav right">
<a href="https://www.r-bloggers.com/2020/02/the-one-question-you-should-ask-your-partner-before-marrying/" rel="next">Next post →</a></div>
</nav>
	</div>
	<aside class="mh-sidebar sb-right">
	<div id="custom_html-2" class="widget_text sb-widget widget_custom_html"><div class="textwidget custom-html-widget">
<div class="top-search" style="padding-left: 0px;">
	<form id="searchform" action="http://www.google.com/cse" target="_blank">
		<div>
			<input type="hidden" name="cx" value="005359090438081006639:paz69t-s8ua" />
			<input type="hidden" name="ie" value="UTF-8" />
			<input type="text" value="" name="q" id="q" autocomplete="on" style="font-size:16px;" placeholder="Search R-bloggers.." />
			<input type="submit" id="searchsubmit2" name="sa" value="Go" style="font-size:16px;" />
		</div>
	</form>

</div>
<!-- thanks: https://stackoverflow.com/questions/14981575/google-cse-with-a-custom-form 
https://stackoverflow.com/questions/10363674/change-size-of-text-in-text-input-tag
--></div></div><div id="text-6" class="sb-widget widget_text">			<div class="textwidget"><div style="min-height:26px;border:1px solid #ccc;padding:3px;text-align:left; background: none repeat scroll 0 0 #FDEADA;">

<form  style="width:202px; float:left;" action="https://feedburner.google.com/fb/a/mailverify" method="post" target="popupwindow" onsubmit="window.open('https://feedburner.google.com/fb/a/mailverify?uri=RBloggers', 'popupwindow', 'scrollbars=yes,width=550,height=520');return true">

<input type="text" style="width:110px"  onclick="if (this.value == 'Your e-mail here') this.value = '';" value='Your e-mail here' name="email"/>
<input type="hidden" value="RBloggers" name="uri"/><input type="hidden" name="loc" value="en_US"/><input type="submit" value="Subscribe" />

<!-- https://feeds.feedburner.com/~fc/RBloggers?bg=99CCFF&fg=444444&anim=0 -->

</form>

<div>
<a href="https://feeds.feedburner.com/RBloggers"><img src="https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2020/07/RBloggers_feedburner_count_2020_07_01-e1593671704447.gif?w=578&ssl=1" style="height:17px;min-width:80px;class:skip-lazy;" alt data-recalc-dims="1" data-lazy-src="https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2020/07/RBloggers_feedburner_count_2020_07_01-e1593671704447.gif?w=578&is-pending-load=1#038;ssl=1" srcset="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" class=" jetpack-lazy-image"><noscript><img src="https://i2.wp.com/www.r-bloggers.com/wp-content/uploads/2020/07/RBloggers_feedburner_count_2020_07_01-e1593671704447.gif?w=578&ssl=1" style="height:17px;min-width:80px;class:skip-lazy;" alt="" data-recalc-dims="1" /></noscript></a>
</div>

</div>

<br/>

<div>
<script>
function init() {
var vidDefer = document.getElementsByTagName('iframe');
for (var i=0; i<vidDefer.length; i++) {
if(vidDefer[i].getAttribute('data-src')) {
vidDefer[i].setAttribute('src',vidDefer[i].getAttribute('data-src'));
} } }
window.onload = init;
</script>

<iframe allowtransparency="true" frameborder="0" scrolling="no"
src="" data-src="//platform.twitter.com/widgets/follow_button.html?screen_name=rbloggers&data-show-count"
  style="width:100%; height:30px;"></iframe>


<div id="fb-root"></div>
<script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v7.0&appId=124112670941750&autoLogAppEvents=1" nonce="RysU23SE"></script>

<div style="min-height: 154px;" class="fb-page" data-href="https://www.facebook.com/rbloggers/" data-tabs="" data-width="300" data-height="154" data-small-header="true" data-adapt-container-width="true" data-hide-cover="false" data-show-facepile="true"><blockquote cite="https://www.facebook.com/rbloggers/" class="fb-xfbml-parse-ignore"><a href="https://www.facebook.com/rbloggers/">R bloggers Facebook page</a></blockquote></div>



<!--
<iframe src="" data-src="//www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FR-bloggers%2F191414254890&width=300&height=155&show_faces=true&colorscheme=light&stream=false&border_color&header=false&appId=400430016676958" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:100%; height:140px;" allowTransparency="true"></iframe>
-->


<!--
<br/>
<strong>If you are an R blogger yourself</strong> you are invited to <a href="https://www.r-bloggers.com/add-your-blog/">add your own R content feed to this site</a> (<strong>Non-English</strong> R bloggers should add themselves- <a href="https://www.r-bloggers.com/lang/add-your-blog">here</a>) -->

</div></div>
		</div><div id="wppp-3" class="sb-widget widget_wppp"><h4 class="widget-title">Most viewed posts (weekly)</h4>
<ul class='wppp_list'>
	<li><a href='https://www.r-bloggers.com/2016/11/5-ways-to-subset-a-data-frame-in-r/' title='5 Ways to Subset a Data Frame in R'>5 Ways to Subset a Data Frame in R</a></li>
	<li><a href='https://www.r-bloggers.com/2015/12/how-to-write-the-first-for-loop-in-r/' title='How to write the first for loop in R'>How to write the first for loop in R</a></li>
	<li><a href='https://www.r-bloggers.com/2013/08/date-formats-in-r/' title='Date Formats in R'>Date Formats in R</a></li>
	<li><a href='https://www.r-bloggers.com/2010/02/r-sorting-a-data-frame-by-the-contents-of-a-column/' title='R – Sorting a data frame by the contents of a column'>R – Sorting a data frame by the contents of a column</a></li>
	<li><a href='https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/' title='How to Remove Outliers in R'>How to Remove Outliers in R</a></li>
	<li><a href='https://www.r-bloggers.com/2020/09/the-fastest-way-to-read-and-writes-file-in-r/' title='The fastest way to Read and Writes file in R'>The fastest way to Read and Writes file in R</a></li>
	<li><a href='https://www.r-bloggers.com/2020/09/generalized-linear-models-and-plots-with-edger-advanced-differential-expression-analysis/' title='Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis'>Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis</a></li>
</ul>
</div><div id="text-18" class="sb-widget widget_text"><h4 class="widget-title">Sponsors</h4>			<div class="textwidget"><div style="min-height: 2055px;">

<script data-cfasync="false" type="text/javascript">
// https://support.cloudflare.com/hc/en-us/articles/200169436-How-can-I-have-Rocket-Loader-ignore-my-script-s-in-Automatic-Mode-
// this must be placed higher. Otherwise it doesn't work.
// data-cfasync="false" is for making sure cloudflares' rocketcache doesn't interfeare with this
// in this case it only works because it was used at the original script in the text widget


function createCookie(name,value,days) {
    var expires = "";
    if (days) {
        var date = new Date();
        date.setTime(date.getTime() + (days*24*60*60*1000));
        expires = "; expires=" + date.toUTCString();
    }
    document.cookie = name + "=" + value + expires + "; path=/";
}

function readCookie(name) {
    var nameEQ = name + "=";
    var ca = document.cookie.split(';');
    for(var i=0;i < ca.length;i++) {
        var c = ca[i];
        while (c.charAt(0)==' ') c = c.substring(1,c.length);
        if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length);
    }
    return null;
}

function eraseCookie(name) {
    createCookie(name,"",-1);
}

// no longer use async because of google
// async 
async function readTextFile(file)
{
	// Helps people browse between pages without the need to keep downloading the same 
	// ads txt page everytime. This way, it allows them to use their browser's cache.
	var random_number = readCookie("ad_random_number_cookie");
	if(random_number == null) {
		var random_number = Math.floor(Math.random()*100*(new Date().getTime()/10000000000));
		createCookie("ad_random_number_cookie",random_number,1)
	}
	
    file += '?t='+random_number;
    var rawFile = new XMLHttpRequest();
    rawFile.onreadystatechange = function ()
    {
        if(rawFile.readyState === 4)
        {
            if(rawFile.status === 200 || rawFile.status == 0)
            {
                // var allText = rawFile.responseText;
                // document.write(allText);
                document.write(rawFile.responseText);
            }
        }
    }
    rawFile.open("GET", file, false);
    rawFile.send(null);
}

// readTextFile('https://raw.githubusercontent.com/Raynos/file-store/master/temp.txt');

readTextFile("https://www.r-bloggers.com/wp-content/uploads/text-widget_anti-cache.txt");


</script>

</div></div>
		</div>
		<div id="recent-posts-3" class="sb-widget widget_recent_entries">
		<h4 class="widget-title">Recent Posts</h4>
		<ul>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/building-a-simple-pipeline-in-r/">Building a Simple Pipeline in R</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/building-apps-with-shinipsum-and-golem/">Building apps with {shinipsum} and {golem}</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/slicing-the-onion-3-ways-toy-problems-in-r-python-and-julia/">Slicing the onion 3 ways- Toy problems in R, python, and Julia</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/path-chain-concise-structure-for-chainable-paths/">path.chain: Concise Structure for Chainable Paths</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/generalized-linear-models-and-plots-with-edger-advanced-differential-expression-analysis/">Generalized Linear Models and Plots with edgeR – Advanced Differential Expression Analysis</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/national-weekly-death-rates/">National Weekly Death Rates</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/kmeans-clustering-of-penguins-2/">Kmeans Clustering of Penguins</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/running-an-r-script-on-a-schedule-overview/">Running an R Script on a Schedule: Overview</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/free-workshop-on-deep-learning-with-keras-and-tensorflow/">Free workshop on Deep Learning with Keras and TensorFlow</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/le-monde-puzzle-1155/">Le Monde puzzle [#1155]</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/the-fastest-way-to-read-and-writes-file-in-r/">The fastest way to Read and Writes file in R</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/why-r-2020-conference-starts-2020-09-26/">Why R? 2020 Conference Starts 2020-09-26</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/free-text-in-surveys-important-issues-in-the-2017-new-zealand-election-study-by-ellis2013nz/">Free text in surveys – important issues in the 2017 New Zealand Election Study by @ellis2013nz</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/lessons-learned-from-500-data-science-interviews/">Lessons learned from 500+ Data Science interviews</a>
									</li>
											<li>
					<a href="https://www.r-bloggers.com/2020/09/writing-conundrums/">Writing conundrums</a>
									</li>
					</ul>

		</div><div id="rss-7" class="sb-widget widget_rss"><h4 class="widget-title"><a class="rsswidget" href="https://feeds.feedburner.com/Rjobs"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://www.r-users.com/">Jobs for R-users</a></h4><ul><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/XUqQfUzxziw/'>Junior Data Scientist / Quantitative economist</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/C2KYkXtMCHw/'>Senior Quantitative Analyst</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/z5mEr8qKkUI/'>R programmer</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/wi3Gfi8GNqA/'>Data Scientist – CGIAR Excellence in Agronomy (Ref No: DDG-R4D/DS/1/CG/EA/06/20)</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/RJobs/~3/aSK4JGQQOfg/'>Data Analytics Auditor, Future of Audit Lead @ London or Newcastle</a></li></ul></div><div id="rss-9" class="sb-widget widget_rss"><h4 class="widget-title"><a class="rsswidget" href="https://feeds.feedburner.com/Python-bloggers"><img class="rss-widget-icon" style="border:0" width="14" height="14" src="https://www.r-bloggers.com/wp-includes/images/rss.png" alt="RSS" /></a> <a class="rsswidget" href="https://python-bloggers.com/">python-bloggers.com (python/data-science news)</a></h4><ul><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/aeOWm291YBM/'>Writing conundrums</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/Eu8ZoDGo_ro/'>Introducing Unguided Projects: The World’s First Interactive Code-Along Exercises</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/nEBrVWBG7Ao/'>Document Letter Frequency in Python</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/oQGmvR_1iGg/'>Equipping Petroleum Engineers in Calgary With Critical Data Skills</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/UcltGO0ZtYA/'>Connecting Python to SQL Server using trusted and login credentials</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/zRTGnHNTQR8/'>Intro to GSC API with Python (Video)</a></li><li><a class='rsswidget' href='http://feedproxy.google.com/~r/Python-bloggers/~3/Ih7X5g4SNhM/'>Technical documentation</a></li></ul></div><div id="text-16" class="sb-widget widget_text">			<div class="textwidget"><strong><a href="https://www.r-bloggers.com/blogs-list/">Full list of contributing R-bloggers</a></strong></div>
		</div><div id="archives-3" class="sb-widget widget_archive"><h4 class="widget-title">Archives</h4>		<label class="screen-reader-text" for="archives-dropdown-3">Archives</label>
		<select id="archives-dropdown-3" name="archive-dropdown">
			
			<option value="">Select Month</option>
				<option value='https://www.r-bloggers.com/2020/09/'> September 2020  (162)</option>
	<option value='https://www.r-bloggers.com/2020/08/'> August 2020  (180)</option>
	<option value='https://www.r-bloggers.com/2020/07/'> July 2020  (229)</option>
	<option value='https://www.r-bloggers.com/2020/06/'> June 2020  (204)</option>
	<option value='https://www.r-bloggers.com/2020/05/'> May 2020  (285)</option>
	<option value='https://www.r-bloggers.com/2020/04/'> April 2020  (292)</option>
	<option value='https://www.r-bloggers.com/2020/03/'> March 2020  (246)</option>
	<option value='https://www.r-bloggers.com/2020/02/'> February 2020  (219)</option>
	<option value='https://www.r-bloggers.com/2020/01/'> January 2020  (213)</option>
	<option value='https://www.r-bloggers.com/2019/12/'> December 2019  (215)</option>
	<option value='https://www.r-bloggers.com/2019/11/'> November 2019  (193)</option>
	<option value='https://www.r-bloggers.com/2019/10/'> October 2019  (216)</option>
	<option value='https://www.r-bloggers.com/2019/09/'> September 2019  (211)</option>
	<option value='https://www.r-bloggers.com/2019/08/'> August 2019  (256)</option>
	<option value='https://www.r-bloggers.com/2019/07/'> July 2019  (228)</option>
	<option value='https://www.r-bloggers.com/2019/06/'> June 2019  (218)</option>
	<option value='https://www.r-bloggers.com/2019/05/'> May 2019  (250)</option>
	<option value='https://www.r-bloggers.com/2019/04/'> April 2019  (275)</option>
	<option value='https://www.r-bloggers.com/2019/03/'> March 2019  (295)</option>
	<option value='https://www.r-bloggers.com/2019/02/'> February 2019  (255)</option>
	<option value='https://www.r-bloggers.com/2019/01/'> January 2019  (281)</option>
	<option value='https://www.r-bloggers.com/2018/12/'> December 2018  (252)</option>
	<option value='https://www.r-bloggers.com/2018/11/'> November 2018  (285)</option>
	<option value='https://www.r-bloggers.com/2018/10/'> October 2018  (308)</option>
	<option value='https://www.r-bloggers.com/2018/09/'> September 2018  (291)</option>
	<option value='https://www.r-bloggers.com/2018/08/'> August 2018  (270)</option>
	<option value='https://www.r-bloggers.com/2018/07/'> July 2018  (333)</option>
	<option value='https://www.r-bloggers.com/2018/06/'> June 2018  (298)</option>
	<option value='https://www.r-bloggers.com/2018/05/'> May 2018  (321)</option>
	<option value='https://www.r-bloggers.com/2018/04/'> April 2018  (301)</option>
	<option value='https://www.r-bloggers.com/2018/03/'> March 2018  (291)</option>
	<option value='https://www.r-bloggers.com/2018/02/'> February 2018  (241)</option>
	<option value='https://www.r-bloggers.com/2018/01/'> January 2018  (330)</option>
	<option value='https://www.r-bloggers.com/2017/12/'> December 2017  (261)</option>
	<option value='https://www.r-bloggers.com/2017/11/'> November 2017  (270)</option>
	<option value='https://www.r-bloggers.com/2017/10/'> October 2017  (290)</option>
	<option value='https://www.r-bloggers.com/2017/09/'> September 2017  (294)</option>
	<option value='https://www.r-bloggers.com/2017/08/'> August 2017  (340)</option>
	<option value='https://www.r-bloggers.com/2017/07/'> July 2017  (283)</option>
	<option value='https://www.r-bloggers.com/2017/06/'> June 2017  (317)</option>
	<option value='https://www.r-bloggers.com/2017/05/'> May 2017  (349)</option>
	<option value='https://www.r-bloggers.com/2017/04/'> April 2017  (324)</option>
	<option value='https://www.r-bloggers.com/2017/03/'> March 2017  (365)</option>
	<option value='https://www.r-bloggers.com/2017/02/'> February 2017  (317)</option>
	<option value='https://www.r-bloggers.com/2017/01/'> January 2017  (367)</option>
	<option value='https://www.r-bloggers.com/2016/12/'> December 2016  (347)</option>
	<option value='https://www.r-bloggers.com/2016/11/'> November 2016  (294)</option>
	<option value='https://www.r-bloggers.com/2016/10/'> October 2016  (306)</option>
	<option value='https://www.r-bloggers.com/2016/09/'> September 2016  (254)</option>
	<option value='https://www.r-bloggers.com/2016/08/'> August 2016  (287)</option>
	<option value='https://www.r-bloggers.com/2016/07/'> July 2016  (326)</option>
	<option value='https://www.r-bloggers.com/2016/06/'> June 2016  (263)</option>
	<option value='https://www.r-bloggers.com/2016/05/'> May 2016  (292)</option>
	<option value='https://www.r-bloggers.com/2016/04/'> April 2016  (260)</option>
	<option value='https://www.r-bloggers.com/2016/03/'> March 2016  (302)</option>
	<option value='https://www.r-bloggers.com/2016/02/'> February 2016  (268)</option>
	<option value='https://www.r-bloggers.com/2016/01/'> January 2016  (337)</option>
	<option value='https://www.r-bloggers.com/2015/12/'> December 2015  (304)</option>
	<option value='https://www.r-bloggers.com/2015/11/'> November 2015  (234)</option>
	<option value='https://www.r-bloggers.com/2015/10/'> October 2015  (259)</option>
	<option value='https://www.r-bloggers.com/2015/09/'> September 2015  (238)</option>
	<option value='https://www.r-bloggers.com/2015/08/'> August 2015  (264)</option>
	<option value='https://www.r-bloggers.com/2015/07/'> July 2015  (243)</option>
	<option value='https://www.r-bloggers.com/2015/06/'> June 2015  (213)</option>
	<option value='https://www.r-bloggers.com/2015/05/'> May 2015  (235)</option>
	<option value='https://www.r-bloggers.com/2015/04/'> April 2015  (211)</option>
	<option value='https://www.r-bloggers.com/2015/03/'> March 2015  (259)</option>
	<option value='https://www.r-bloggers.com/2015/02/'> February 2015  (212)</option>
	<option value='https://www.r-bloggers.com/2015/01/'> January 2015  (245)</option>
	<option value='https://www.r-bloggers.com/2014/12/'> December 2014  (236)</option>
	<option value='https://www.r-bloggers.com/2014/11/'> November 2014  (221)</option>
	<option value='https://www.r-bloggers.com/2014/10/'> October 2014  (218)</option>
	<option value='https://www.r-bloggers.com/2014/09/'> September 2014  (259)</option>
	<option value='https://www.r-bloggers.com/2014/08/'> August 2014  (217)</option>
	<option value='https://www.r-bloggers.com/2014/07/'> July 2014  (235)</option>
	<option value='https://www.r-bloggers.com/2014/06/'> June 2014  (241)</option>
	<option value='https://www.r-bloggers.com/2014/05/'> May 2014  (243)</option>
	<option value='https://www.r-bloggers.com/2014/04/'> April 2014  (260)</option>
	<option value='https://www.r-bloggers.com/2014/03/'> March 2014  (289)</option>
	<option value='https://www.r-bloggers.com/2014/02/'> February 2014  (269)</option>
	<option value='https://www.r-bloggers.com/2014/01/'> January 2014  (263)</option>
	<option value='https://www.r-bloggers.com/2013/12/'> December 2013  (264)</option>
	<option value='https://www.r-bloggers.com/2013/11/'> November 2013  (241)</option>
	<option value='https://www.r-bloggers.com/2013/10/'> October 2013  (234)</option>
	<option value='https://www.r-bloggers.com/2013/09/'> September 2013  (215)</option>
	<option value='https://www.r-bloggers.com/2013/08/'> August 2013  (223)</option>
	<option value='https://www.r-bloggers.com/2013/07/'> July 2013  (254)</option>
	<option value='https://www.r-bloggers.com/2013/06/'> June 2013  (272)</option>
	<option value='https://www.r-bloggers.com/2013/05/'> May 2013  (260)</option>
	<option value='https://www.r-bloggers.com/2013/04/'> April 2013  (279)</option>
	<option value='https://www.r-bloggers.com/2013/03/'> March 2013  (277)</option>
	<option value='https://www.r-bloggers.com/2013/02/'> February 2013  (294)</option>
	<option value='https://www.r-bloggers.com/2013/01/'> January 2013  (343)</option>
	<option value='https://www.r-bloggers.com/2012/12/'> December 2012  (308)</option>
	<option value='https://www.r-bloggers.com/2012/11/'> November 2012  (277)</option>
	<option value='https://www.r-bloggers.com/2012/10/'> October 2012  (308)</option>
	<option value='https://www.r-bloggers.com/2012/09/'> September 2012  (270)</option>
	<option value='https://www.r-bloggers.com/2012/08/'> August 2012  (263)</option>
	<option value='https://www.r-bloggers.com/2012/07/'> July 2012  (247)</option>
	<option value='https://www.r-bloggers.com/2012/06/'> June 2012  (298)</option>
	<option value='https://www.r-bloggers.com/2012/05/'> May 2012  (287)</option>
	<option value='https://www.r-bloggers.com/2012/04/'> April 2012  (295)</option>
	<option value='https://www.r-bloggers.com/2012/03/'> March 2012  (304)</option>
	<option value='https://www.r-bloggers.com/2012/02/'> February 2012  (264)</option>
	<option value='https://www.r-bloggers.com/2012/01/'> January 2012  (280)</option>
	<option value='https://www.r-bloggers.com/2011/12/'> December 2011  (251)</option>
	<option value='https://www.r-bloggers.com/2011/11/'> November 2011  (261)</option>
	<option value='https://www.r-bloggers.com/2011/10/'> October 2011  (281)</option>
	<option value='https://www.r-bloggers.com/2011/09/'> September 2011  (187)</option>
	<option value='https://www.r-bloggers.com/2011/08/'> August 2011  (258)</option>
	<option value='https://www.r-bloggers.com/2011/07/'> July 2011  (219)</option>
	<option value='https://www.r-bloggers.com/2011/06/'> June 2011  (225)</option>
	<option value='https://www.r-bloggers.com/2011/05/'> May 2011  (239)</option>
	<option value='https://www.r-bloggers.com/2011/04/'> April 2011  (268)</option>
	<option value='https://www.r-bloggers.com/2011/03/'> March 2011  (249)</option>
	<option value='https://www.r-bloggers.com/2011/02/'> February 2011  (205)</option>
	<option value='https://www.r-bloggers.com/2011/01/'> January 2011  (209)</option>
	<option value='https://www.r-bloggers.com/2010/12/'> December 2010  (188)</option>
	<option value='https://www.r-bloggers.com/2010/11/'> November 2010  (172)</option>
	<option value='https://www.r-bloggers.com/2010/10/'> October 2010  (219)</option>
	<option value='https://www.r-bloggers.com/2010/09/'> September 2010  (185)</option>
	<option value='https://www.r-bloggers.com/2010/08/'> August 2010  (203)</option>
	<option value='https://www.r-bloggers.com/2010/07/'> July 2010  (175)</option>
	<option value='https://www.r-bloggers.com/2010/06/'> June 2010  (167)</option>
	<option value='https://www.r-bloggers.com/2010/05/'> May 2010  (164)</option>
	<option value='https://www.r-bloggers.com/2010/04/'> April 2010  (152)</option>
	<option value='https://www.r-bloggers.com/2010/03/'> March 2010  (165)</option>
	<option value='https://www.r-bloggers.com/2010/02/'> February 2010  (135)</option>
	<option value='https://www.r-bloggers.com/2010/01/'> January 2010  (121)</option>
	<option value='https://www.r-bloggers.com/2009/12/'> December 2009  (126)</option>
	<option value='https://www.r-bloggers.com/2009/11/'> November 2009  (66)</option>
	<option value='https://www.r-bloggers.com/2009/10/'> October 2009  (87)</option>
	<option value='https://www.r-bloggers.com/2009/09/'> September 2009  (65)</option>
	<option value='https://www.r-bloggers.com/2009/08/'> August 2009  (56)</option>
	<option value='https://www.r-bloggers.com/2009/07/'> July 2009  (64)</option>
	<option value='https://www.r-bloggers.com/2009/06/'> June 2009  (54)</option>
	<option value='https://www.r-bloggers.com/2009/05/'> May 2009  (35)</option>
	<option value='https://www.r-bloggers.com/2009/04/'> April 2009  (38)</option>
	<option value='https://www.r-bloggers.com/2009/03/'> March 2009  (40)</option>
	<option value='https://www.r-bloggers.com/2009/02/'> February 2009  (33)</option>
	<option value='https://www.r-bloggers.com/2009/01/'> January 2009  (42)</option>
	<option value='https://www.r-bloggers.com/2008/12/'> December 2008  (16)</option>
	<option value='https://www.r-bloggers.com/2008/11/'> November 2008  (14)</option>
	<option value='https://www.r-bloggers.com/2008/10/'> October 2008  (10)</option>
	<option value='https://www.r-bloggers.com/2008/09/'> September 2008  (8)</option>
	<option value='https://www.r-bloggers.com/2008/08/'> August 2008  (11)</option>
	<option value='https://www.r-bloggers.com/2008/07/'> July 2008  (7)</option>
	<option value='https://www.r-bloggers.com/2008/06/'> June 2008  (8)</option>
	<option value='https://www.r-bloggers.com/2008/05/'> May 2008  (8)</option>
	<option value='https://www.r-bloggers.com/2008/04/'> April 2008  (4)</option>
	<option value='https://www.r-bloggers.com/2008/03/'> March 2008  (5)</option>
	<option value='https://www.r-bloggers.com/2008/02/'> February 2008  (6)</option>
	<option value='https://www.r-bloggers.com/2008/01/'> January 2008  (10)</option>
	<option value='https://www.r-bloggers.com/2007/12/'> December 2007  (3)</option>
	<option value='https://www.r-bloggers.com/2007/11/'> November 2007  (5)</option>
	<option value='https://www.r-bloggers.com/2007/10/'> October 2007  (9)</option>
	<option value='https://www.r-bloggers.com/2007/09/'> September 2007  (7)</option>
	<option value='https://www.r-bloggers.com/2007/08/'> August 2007  (21)</option>
	<option value='https://www.r-bloggers.com/2007/07/'> July 2007  (9)</option>
	<option value='https://www.r-bloggers.com/2007/06/'> June 2007  (3)</option>
	<option value='https://www.r-bloggers.com/2007/05/'> May 2007  (3)</option>
	<option value='https://www.r-bloggers.com/2007/04/'> April 2007  (1)</option>
	<option value='https://www.r-bloggers.com/2007/03/'> March 2007  (5)</option>
	<option value='https://www.r-bloggers.com/2007/02/'> February 2007  (4)</option>
	<option value='https://www.r-bloggers.com/2006/11/'> November 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2006/10/'> October 2006  (2)</option>
	<option value='https://www.r-bloggers.com/2006/08/'> August 2006  (3)</option>
	<option value='https://www.r-bloggers.com/2006/07/'> July 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2006/06/'> June 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2006/05/'> May 2006  (3)</option>
	<option value='https://www.r-bloggers.com/2006/04/'> April 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2006/03/'> March 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2006/02/'> February 2006  (5)</option>
	<option value='https://www.r-bloggers.com/2006/01/'> January 2006  (1)</option>
	<option value='https://www.r-bloggers.com/2005/10/'> October 2005  (1)</option>
	<option value='https://www.r-bloggers.com/2005/09/'> September 2005  (3)</option>
	<option value='https://www.r-bloggers.com/2005/05/'> May 2005  (1)</option>

		</select>

<script type="text/javascript">
/* <![CDATA[ */
(function() {
	var dropdown = document.getElementById( "archives-dropdown-3" );
	function onSelectChange() {
		if ( dropdown.options[ dropdown.selectedIndex ].value !== '' ) {
			document.location.href = this.options[ this.selectedIndex ].value;
		}
	}
	dropdown.onchange = onSelectChange;
})();
/* ]]> */
</script>
			</div><div id="linkcat-3349" class="sb-widget widget_links"><h4 class="widget-title">Other sites</h4>
	<ul class='xoxo blogroll'>
<li><a href="https://www.r-users.com/">Jobs for R-users</a></li>
<li><a href="http://www.proc-x.com/" title="SAS news gathered from bloggers">SAS blogs</a></li>

	</ul>
</div>
</aside></div>
</div>
<div class="copyright-wrap">
	<p class="copyright">Copyright © 2020 | <a href="https://www.mhthemes.com/" rel="nofollow">MH Corporate basic by MH Themes</a></p>
</div>
</div>

<!--
TPC! Memory Usage (http://webjawns.com)
Memory Usage: 73676080
Memory Peak Usage: 73791984
WP Memory Limit: 820M
PHP Memory Limit: 800M
Checkpoints: 9
-->


<!-- Schema & Structured Data For WP v1.9.49.1 - -->
<script type="application/ld+json" class="saswp-schema-markup-output">
[{"@context":"https:\/\/schema.org","@graph":[{"@type":"Organization","@id":"https:\/\/www.r-bloggers.com#Organization","name":"R-bloggers","url":"http:\/\/www.r-bloggers.com","sameAs":[],"logo":{"@type":"ImageObject","url":"http:\/\/www.r-bloggers.com\/wp-content\/uploads\/2020\/07\/R_blogger_logo_02.png","width":"1061","height":"304"},"contactPoint":{"@type":"ContactPoint","contactType":"technical support","telephone":"","url":"https:\/\/www.r-bloggers.com\/contact-us\/"}},{"@type":"WebSite","@id":"https:\/\/www.r-bloggers.com#website","headline":"R-bloggers","name":"R-bloggers","description":"R news and tutorials contributed by hundreds of R bloggers","url":"https:\/\/www.r-bloggers.com","potentialAction":{"@type":"SearchAction","target":"https:\/\/www.r-bloggers.com\/?s={search_term_string}","query-input":"required name=search_term_string"},"publisher":{"@id":"https:\/\/www.r-bloggers.com#Organization"}},{"@context":"https:\/\/schema.org","@type":"WebPage","@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#webpage","name":"An introduction to web scraping: locating Spanish schools | R-bloggers","url":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/","lastReviewed":"2020-02-10T18:00:00-06:00","reviewedBy":{"@type":"Organization","logo":{"@type":"ImageObject","url":"http:\/\/www.r-bloggers.com\/wp-content\/uploads\/2020\/07\/R_blogger_logo_02.png","width":"1061","height":"304"},"name":"R-bloggers"},"inLanguage":"en-US","description":"by Jorge Cimentada\n        \n\n\n\nIntroduction\nWhenever a new paper is released using some type of scraped data, most of my peers in the social science community get baffled at how researchers can do this. In fact, many social scientists can\u2019t even think of research questions that can be addressed with this type of data simply because they don\u2019t know it\u2019s even possible. As the old saying goes, when you have a hammer, every problem looks like a nail.\nWith the increasing amount of data being collected on a daily basis, it is eminent that scientists start getting familiar with new technologies that can help answer old questions. Moreover, we need to be adventurous about cutting edge data sources as they can also allow us to ask new questions which weren\u2019t even thought of in the past.\nIn this tutorial I\u2019ll be guiding you through the basics of web scraping using R and the xml2 package. I\u2019ll begin with a simple example using fake data and elaborate further by trying to scrape the location of a sample of schools in Spain.\n\n\nBasic steps\nFor web scraping in R, you can fulfill almost all of your needs with the xml2 package. As you wander through the web, you\u2019ll see many examples using the rvest package. xml2 and rvest are very similar so don\u2019t feel you\u2019re lacking behind for learning one and not the other. In addition to these two packages, we\u2019ll need some other libraries for plotting locations on a map (ggplot2, sf, rnaturalearth), identifying who we are when we scrape (httr) and wrangling data (tidyverse).\nAdditionally, we\u2019ll also need the package scrapex. In the real-world example that we\u2019ll be doing below, we\u2019ll be scraping data from the website www.buscocolegio.com to locate a sample of schools in Spain. However, throughout the tutorial we won\u2019t be scraping the data directly from their real-website. What would happen to this tutorial if 6 months from now www.buscocolegio.com updates the design of their website? Everything from our real-world example would be lost.\nWeb scraping tutorials are usually very unstable precisely because of this. To circumvent that problem, I\u2019ve saved a random sample of websites from some schools in www.buscocolegio.com into an R package called scrapex. Although the links we\u2019ll be working on will be hosted locally on your machine, the HTML of the website should be very similar to the one hosted on the website (with the exception of some images\/icons which were deleted on purpose to make the package lightweight).\nYou can install the package with:\n# install.packages(\"devtools\")\ndevtools::install_github(\"cimentadaj\/scrapex\")\nNow, let\u2019s move on the fake data example and load all of our packages with:\nlibrary(xml2)\nlibrary(httr)\nlibrary(tidyverse)\nlibrary(sf)\nlibrary(rnaturalearth)\nlibrary(ggplot2)\nlibrary(scrapex)\nLet\u2019s begin with a simple example. Below we define an XML string and look at its structure:\nxml_test <- \"<people>\n<jason>\n  <person type='fictional'>\n    <first_name>\n      <married>\n        Jason\n      <\/married>\n    <\/first_name>\n    <last_name>\n        Bourne\n    <\/last_name>\n    <occupation>\n      Spy\n    <\/occupation>\n  <\/person>\n<\/jason>\n<carol>\n  <person type='real'>\n    <first_name>\n      <married>\n        Carol\n      <\/married>\n    <\/first_name>\n    <last_name>\n        Kalp\n    <\/last_name>\n    <occupation>\n      Scientist\n    <\/occupation>\n  <\/person>\n<\/carol>\n<\/people>\n\"\n\ncat(xml_test)\n## <people>\n## <jason>\n##   <person type='fictional'>\n##     <first_name>\n##       <married>\n##         Jason\n##       <\/married>\n##     <\/first_name>\n##     <last_name>\n##         Bourne\n##     <\/last_name>\n##     <occupation>\n##       Spy\n##     <\/occupation>\n##   <\/person>\n## <\/jason>\n## <carol>\n##   <person type='real'>\n##     <first_name>\n##       <married>\n##         Carol\n##       <\/married>\n##     <\/first_name>\n##     <last_name>\n##         Kalp\n##     <\/last_name>\n##     <occupation>\n##       Scientist\n##     <\/occupation>\n##   <\/person>\n## <\/carol>\n## <\/people>\nIn XML and HTML the basic building blocks are something called tags. For example, the first tag in the structure shown above is <people>. This tag is matched by <\/people> at the end of the string:\n\nIf you pay close attention, you\u2019ll see that each tag in the XML structure has a beginning (signaled by <>) and an end (signaled by <\/>). For example, the next tag after <people> is <jason> and right before the tag <carol> is the end of the jason tag <\/jason>.\n\nSimilarly, you\u2019ll find that the <carol> tag is also matched by a <\/carol> finishing tag.\n\nIn theory, tags can have whatever meaning you attach to them (such as <people> or <occupation>). However, in practice there are hundreds of tags which are standard in websites (for example, here). If you\u2019re just getting started, there\u2019s no need for you to learn them but as you progress in web scraping, you\u2019ll start to recognize them (one brief example is <strong> which simply bolds text in a website).\nThe xml2 package was designed to read XML strings and to navigate the tree structure to extract information. For example, let\u2019s read in the XML data from our fake example and look at its general structure:\nxml_raw <- read_xml(xml_test)\nxml_structure(xml_raw)\n## <people>\n##   <jason>\n##     <person >\n##       <first_name>\n##         <married>\n##           {text}\n##       <last_name>\n##         {text}\n##       <occupation>\n##         {text}\n##   <carol>\n##     <person >\n##       <first_name>\n##         <married>\n##           {text}\n##       <last_name>\n##         {text}\n##       <occupation>\n##         {text}\nYou can see that the structure is tree-based, meaning that tags such as <jason> and <carol> are nested within the <people> tag. In XML jargon, <people> is the root node, whereas <jason> and <carol> are the child nodes from <people>.\nIn more detail, the structure is as follows:\n\nThe root node is <people>\n\nThe child nodes are <jason> and <carol>\n\nThen each child node has nodes <first_name>, <married>, <last_name> and <occupation> nested within them.\n\nPut another way, if something is nested within a node, then the nested node is a child of the upper-level node. In our example, the root node is <people> so we can check which are its children:\n# xml_child returns only one child (specified in search)\n# Here, jason is the first child\nxml_child(xml_raw, search = 1)\n## {xml_node}\n## <jason>\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\n# Here, carol is the second child\nxml_child(xml_raw, search = 2)\n## {xml_node}\n## <carol>\n##  <person type=\"real\">\\n  <first_name>\\n    <married>\\n        Carol\\n ...\n# Use xml_children to extract **all** children\nchild_xml <- xml_children(xml_raw)\n\nchild_xml\n## {xml_nodeset (2)}\n##  <jason>\\n  <person type=\"fictional\">\\n    <first_name>\\n      <marri ...\n##  <carol>\\n  <person type=\"real\">\\n    <first_name>\\n      <married>\\n ...\nTags can also have different attributes which are usually specified as <fake_tag attribute='fake'> and ended as usual with <\/fake_tag>. If you look at the XML structure of our example, you\u2019ll notice that each <person> tag has an attribute called type. As you\u2019ll see in our real-world example, extracting these attributes is often the aim of our scraping adventure. Using xml2, we can extract all attributes that match a specific name with xml_attrs.\n# Extract the attribute type from all nodes\nxml_attrs(child_xml, \"type\")\n## ]\n## named character(0)\n##\n## ]\n## named character(0)\nWait, why didn\u2019t this work? Well, if you look at the output of child_xml, we have two nodes on which are for <jason> and <carol>.\nchild_xml\n## {xml_nodeset (2)}\n##  <jason>\\n  <person type=\"fictional\">\\n    <first_name>\\n      <marri ...\n##  <carol>\\n  <person type=\"real\">\\n    <first_name>\\n      <married>\\n ...\nDo these tags have an attribute? No, because if they did, they would have something like <jason type='fake_tag'>. What we need is to look down at the <person> tag within <jason> and <carol> and extract the attribute from <person>.\nDoes this sound familiar? Both <jason> and <carol> have an associated <person> tag below them, making them their children. We can just go down one level by running xml_children on these tags and extract them.\n# We go down one level of children\nperson_nodes <- xml_children(child_xml)\n\n# <person> is now the main node, so we can extract attributes\nperson_nodes\n## {xml_nodeset (2)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\n##  <person type=\"real\">\\n  <first_name>\\n    <married>\\n        Carol\\n ...\n# Both type attributes\nxml_attrs(person_nodes, \"type\")\n## ]\n##        type\n## \"fictional\"\n##\n## ]\n##   type\n## \"real\"\nUsing the xml_path function you can even find the \u2018address\u2019 of these nodes to retrieve specific tags without having to write down xml_children many times. For example:\n# Specific address of each person tag for the whole xml tree\n# only using the `person_nodes`\nxml_path(person_nodes)\n##  \"\/people\/jason\/person\" \"\/people\/carol\/person\"\nWe have the \u2018address\u2019 of specific tags in the tree but how do we extract them automatically? To extract specific \u2018addresses\u2019 of this XML tree, the main function we\u2019ll use is xml_find_all. This function accepts the XML tree and an \u2018address\u2019 string. We can use very simple strings, such as the one given by xml_path:\n# You can use results from xml_path like directories\nxml_find_all(xml_raw, \"\/people\/jason\/person\")\n## {xml_nodeset (1)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\nThe expression above is asking for the node \"\/people\/jason\/person\". This will return the same as saying xml_raw %>% xml_child(search = 1). For deeply nested trees, xml_find_all will be many times much cleaner than calling xml_child recursively many times.\nHowever, in most cases the \u2018addresses\u2019 used in xml_find_all come from a separate language called XPath (in fact, the \u2018address\u2019 we\u2019ve been looking at is XPath). XPath is a complex language (such as regular expressions for strings) which is beyond this brief tutorial. However, with the examples we\u2019ve seen so far, we can use some basic XPath which we\u2019ll need later on.\nTo extract all the tags in a document, we can use \/\/name_of_tag.\n# Search for all 'married' nodes\nxml_find_all(xml_raw, \"\/\/married\")\n## {xml_nodeset (2)}\n##  <married>\\n        Jason\\n      <\/married>\n##  <married>\\n        Carol\\n      <\/married>\nWith the previous XPath, we\u2019re searching for all married tags within the complete XML tree. The result returns all married nodes (I use the words tags and nodes interchangeably) in the complete tree structure. Another example would be finding all <occupation> tags:\nxml_find_all(xml_raw, \"\/\/occupation\")\n## {xml_nodeset (2)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n##  <occupation>\\n      Scientist\\n    <\/occupation>\nIf you want to find any other tag you can replace \"\/\/occupation\" with your tag of interest and xml_find_all will find all of them.\nIf you wanted to find all tags below your current node, you only need to add a . at the beginning: \".\/\/occupation\". For example, if we dived into the <jason> tag and we wanted his <occupation> tag, \"\/\/occupation\" will returns all <occupation> tags. Instead, \".\/\/occupation\" will return only the found tags below the current tag. For example:\nxml_raw %>%\n  # Dive only into Jason's tag\n  xml_child(search = 1) %>%\n  xml_find_all(\".\/\/occupation\")\n## {xml_nodeset (1)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n# Instead, the wrong way would have been:\nxml_raw %>%\n  # Dive only into Jason's tag\n  xml_child(search = 1) %>%\n  # Here we get both occupation tags\n  xml_find_all(\"\/\/occupation\")\n## {xml_nodeset (2)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n##  <occupation>\\n      Scientist\\n    <\/occupation>\nThe first example only returns <jason>\u2019s occupation whereas the second returned all occupations, regardless of where you are in the tree.\nXPath also allows you to identify tags that contain only one specific attribute, such as the one\u2019s we saw earlier. For example, to filter all <person> tags with the attribute filter set to fictional, we could do it with:\n# Give me all the tags 'person' that have an attribute type='fictional'\nxml_raw %>%\n  xml_find_all(\"\/\/person\")\n## {xml_nodeset (1)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\nIf you wanted to do the same but for the tags below your current nodes, the same trick we learned earlier would work: \".\/\/person\". These are just some primers that can help you jump easily to using XPath, but I encourage you to look at other examples on the web, as complex websites often require complex XPath expressions.\nBefore we begin our real-word example, you might be asking yourself how you can actually extract the text\/numeric data from these nodes. Well, that\u2019s easy: xml_text.\nxml_raw %>%\n  xml_find_all(\".\/\/occupation\") %>%\n  xml_text()\n##  \"\\n      Spy\\n    \"       \"\\n      Scientist\\n    \"\nOnce you\u2019ve narrowed down your tree-based search to one single piece of text or numbers, xml_text() will extract that for you (there\u2019s also xml_double and xml_integer for extracting numbers). As I said, XPath is really a huge language. If you\u2019re interested, this XPath cheat sheets have helped me a lot to learn tricks for easy scraping.\n\n\nReal-world example\nWe\u2019re interested in making a list of many schools in Spain and visualizing their location. This can be useful for many things such as matching population density of children across different regions to school locations. The website www.buscocolegio.com contains a database of schools similar to what we\u2019re looking for. As described at the beginning, instead we\u2019re going to use scrapex which has the function spanish_schools_ex() containing the links to a sample of websites from different schools saved locally on your computer.\nLet\u2019s look at an example for one school.\nschool_links <- spanish_schools_ex()\n\n# Keep only the HTML file of one particular school.\nschool_url <- school_links\n\nschool_url\n##  \"\/usr\/local\/lib\/R\/site-library\/scrapex\/extdata\/spanish_schools_ex\/school_3006839.html\"\nIf you\u2019re interested in looking at the website interactively in your browser, you can do it with browseURL(prep_browser(school_url)). Let\u2019s read the HTML (XML and HTML are usually interchangeable, so here we use read_html).\n# Here we use `read_html` because `read_xml` is throwing an error\n# when attempting to read. However, everything we've discussed\n# should be the same.\nschool_raw <- read_html(school_url) %>% xml_child()\n\nschool_raw\n## {html_node}\n## <head>\n##   <title>Aqu\u00ed encontrar\u00e1s toda la informaci\u00f3n necesaria sobre CEIP SA ...\n##   <meta charset=\"utf-8\">\\n\n##   <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, ...\n##   <meta http-equiv=\"x-ua-compatible\" content=\"ie=edge\">\\n\n##   <meta name=\"author\" content=\"BuscoColegio\">\\n\n##   <meta name=\"description\" content=\"Encuentra toda la informaci\u00f3n nec ...\n##   <meta name=\"keywords\" content=\"opiniones SANCHIS GUARNER, contacto  ...\n##   <link rel=\"shortcut icon\" href=\"\/favicon.ico\">\\n\n##   <link rel=\"stylesheet\" href=\"\/\/fonts.googleapis.com\/css?family=Robo ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-awesome\/css\/font-a ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-line\/css\/simple-li ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-line-pro\/style.css ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-hs\/style.css\">\\n\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n## ...\nWeb scraping strategies are very specific to the website you\u2019re after. You have to get very familiar with the website you\u2019re interested to be able to match perfectly the information you\u2019re looking for. In many cases, scraping two websites will require vastly different strategies. For this particular example, we\u2019re only interested in figuring out the location of each school so we only have to extract its location.\n\n\nIn the image above you\u2019ll find a typical school\u2019s website in wwww.buscocolegio.com. The website has a lot of information, but we\u2019re only interested in the button that is circled by the orange rectangle. If you can\u2019t find it easily, it\u2019s below the Google Maps on the right which says \u201cBuscar colegio cercano\u201d.\nWhen you click on this button, this actually points you towards the coordinates of the school so we just have to find a way of figuring out how to click this button or figure out how to get its information. All browsers allow you to do this if you press CTRL + SHIFT + c at the same time (Firefox and Chrome support this hotkey). If a window on the right popped in full of code, then you\u2019re on the right track:\n\n\n\nHere we can search the source code of the website. If you place your mouse pointer over the lines of code from this right-most window, you\u2019ll see sections of the website being highlighted in blue. This indicates which parts of the code refer to which parts of the website. Luckily for us, we don\u2019t have to search the complete source code to find that specific location. We can approximate our search by typing the text we\u2019re looking for in the search bar at the top of the right window:\n\n\n\nAfter we click enter, we\u2019ll be automatically directed to the tag that has the information that we want.\n\n\n\nMore specifically, we can see that the latitude and longitude of schools are found in an attributed called href in a tag <a>:\n\n\n\nCan you see the latitude and longitude fields in the text highlighted blue? It\u2019s hidden in-between words. That is precisely the type of information we\u2019re after. Extracting all <a> tags from the website (hint: XPath similar to \"\/\/a\") will yield hundreds of matches because <a> is a very common tag. Moreover, refining the search to <a> tags which have an href attribute will also yield hundreds of matches because href is the standard attribute to attach links within websites. We need to narrow down our search within the website.\nOne strategy is to find the \u2018father\u2019 or \u2018grandfather\u2019 node of this particular <a> tag and then match a node which has that same sequence of grandfather -> father -> child node. By looking at the structure of this small XML snippet from the right-most window, we see that the \u2018grandfather\u2019 of this <a> tag is <p class=\"d-flex align-items-baseline g-mt-5'> which has a particularly long attribute named class.\n\n\n\nDon\u2019t be intimidated by these tag names and long attributes. I also don\u2019t know what any of these attributes mean. But what I do know is that this is the \u2018grandfather\u2019 of the <a> tag I\u2019m interested in. So using our XPath skills, let\u2019s search for that <p> tag and see if we get only one match.\n# Search for all <p> tags with that class in the document\nschool_raw %>%\n  xml_find_all(\"\/\/p\")\n## {xml_nodeset (1)}\n##  <p class=\"d-flex align-items-baseline g-mt-5\">\\r\\n\\t                 ...\nOnly one match, so this is good news. This means that we can uniquely identify this particular <p> tag. Let\u2019s refine the search to say: Find all <a> tags which are children of that specific <p> tag. This only means I\u2019ll add a \"\/\/a\" to the previous expression. Since there is only one <p> tag with the class, we\u2019re interested in checking whether there is more than one <a> tag below this <p> tag.\nschool_raw %>%\n  xml_find_all(\"\/\/p\/\/a\")\n## {xml_nodeset (1)}\n##  <a href=\"\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud=38 ...\nThere we go! We can see the specific href that contains the latitude and longitude data we\u2019re interested in. How do we extract the href attribute? Using xml_attr as we did before!\nlocation_str <-\n  school_raw %>%\n  xml_find_all(\"\/\/p\/\/a\") %>%\n  xml_attr(attr = \"href\")\n\nlocation_str\n##  \"\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud=38.8274492&colegio.longitud=0.0221681\"\nOk, now we need some regex skills to get only the latitude and longitude (regex expressions are used to search for patterns inside a string, such as for example a date. See here for some examples):\nlocation <-\n  location_str %>%\n  str_extract_all(\"=.+$\") %>%\n  str_replace_all(\"=|colegio\\\\.longitud\", \"\") %>%\n  str_split(\"&\") %>%\n  .]\n\nlocation\n##  \"38.8274492\" \"0.0221681\"\nOk, so we got the information we needed for one single school. Let\u2019s turn that into a function so we can pass only the school\u2019s link and get the coordinates back.\nBefore we do that, I will set something called my User-Agent. In short, the User-Agent is who you are. It is good practice to identify the person who is scraping the website because if you\u2019re causing any trouble on the website, the website can directly identify who is causing problems. You can figure out your user agent here and paste it in the string below. In addition, I will add a time sleep of 5 seconds to the function because we want to make sure we don\u2019t cause any troubles to the website we\u2019re scraping due to an overload of requests.\n# This sets your `User-Agent` globally so that all requests are\n# identified with this `User-Agent`\nset_config(\n  user_agent(\"Mozilla\/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko\/20100101 Firefox\/70.0\")\n)\n\n# Collapse all of the code from above into one function called\n# school grabber\n\nschool_grabber <- function(school_url) {\n  # We add a time sleep of 5 seconds to avoid\n  # sending too many quick requests to the website\n  Sys.sleep(5)\n\n  school_raw <- read_html(school_url) %>% xml_child()\n\n  location_str <-\n    school_raw %>%\n    xml_find_all(\"\/\/p\/\/a\") %>%\n    xml_attr(attr = \"href\")\n\n  location <-\n    location_str %>%\n    str_extract_all(\"=.+$\") %>%\n    str_replace_all(\"=|colegio\\\\.longitud\", \"\") %>%\n    str_split(\"&\") %>%\n    .]\n\n  # Turn into a data frame\n  data.frame(\n    latitude = location,\n    longitude = location,\n    stringsAsFactors = FALSE\n  )\n}\n\n\nschool_grabber(school_url)\n##     latitude longitude\n## 1 38.8274492 0.0221681\nOk, so it\u2019s working. The only thing left is to extract this for many schools. As shown earlier, scrapex contains a list of 27 school links that we can automatically scrape. Let\u2019s loop over those, get the information of coordinates for each and collapse all of them into a data frame.\nres <- map_dfr(school_links, school_grabber)\nres\n##    latitude  longitude\n## 1  42.72779 -8.6567935\n## 2  43.24439 -8.8921645\n## 3  38.95592 -1.2255769\n## 4  39.18657 -1.6225903\n## 5  40.38245 -3.6410388\n## 6  40.22929 -3.1106322\n## 7  40.43860 -3.6970366\n## 8  40.33514 -3.5155669\n## 9  40.50546 -3.3738441\n## 10 40.63826 -3.4537107\n## 11 40.38543 -3.6639500\n## 12 37.76485 -1.5030467\n## 13 38.82745  0.0221681\n## 14 40.99434 -5.6224391\n## 15 40.99434 -5.6224391\n## 16 40.56037 -5.6703725\n## 17 40.99434 -5.6224391\n## 18 40.99434 -5.6224391\n## 19 41.13593  0.9901905\n## 20 41.26155  1.1670507\n## 21 41.22851  0.5461471\n## 22 41.14580  0.8199749\n## 23 41.18341  0.5680564\n## 24 42.07820  1.8203155\n## 25 42.25245  1.8621546\n## 26 41.73767  1.8383666\n## 27 41.62345  2.0013628\nSo now that we have the locations of these schools, let\u2019s plot them:\nres <- mutate_all(res, as.numeric)\n\nsp_sf <-\n  ne_countries(scale = \"large\", country = \"Spain\", returnclass = \"sf\") %>%\n  st_transform(crs = 4326)\n\nggplot(sp_sf) +\n  geom_sf() +\n  geom_point(data = res, aes(x = longitude, y = latitude)) +\n  coord_sf(xlim = c(-20, 10), ylim = c(25, 45)) +\n  theme_minimal() +\n  ggtitle(\"Sample of schools in Spain\")\n\nThere we go! We went from literally no information at the beginning of this tutorial to interpretable and summarized information only using web data. We can see some schools in Madrid (center) as well in other regions of Spain, including Catalonia and Galicia.\nThis marks the end of our scraping adventure but before we finish, I want to mention some of the ethical guidelines for web scraping. Scraping is extremely useful for us but can give headaches to other people maintaining the website of interest. Here\u2019s a list of ethical guidelines you should always follow:\n\nRead the terms and services: many websites prohibit web scraping and you could be in a breach of privacy by scraping the data. One famous example.\nCheck the robots.txt file. This is a file that most websites have (www.buscocolegio.com does not) which tell you which specific paths inside the website are scrapable and which are not. See here for an explanation of what robots.txt look like and where to find them.\nSome websites are supported by very big servers, which means you can send 4-5 website requests per second. Others, such as www.buscocolegio.com are not. It\u2019s good practice to always put a time sleep between your requests. In our example, I set it to 5 seconds because this is a small website and we don\u2019t want to crash their servers.\nWhen making requests, there are computational ways of identifying yourself. For example, every request (such as the one\u2019s we do) can have something called a User-Agent. It is good practice to include yourself in as the User-Agent (as we did in our code) because the admin of the server can directly identify if someone\u2019s causing problems due to their web scraping.\nLimit your scraping to non-busy hours such as overnight. This can help reduce the chances of collapsing the website since fewer people are visiting websites in the evening.\n\nYou can read more about these ethical issues here.\n\n\nWrap up\nThis tutorial introduced you to basic concepts in web scraping and applied them in a real-world setting. Web scraping is a vast field in computer science (you can find entire books on the subject such as this). We covered some basic techniques which I think can take you a long way but there\u2019s definitely more to learn. For those curious about where to turn, I\u2019m looking forward to the upcoming book \u201cA Field Guide for Web Scraping and Accessing APIs with R\u201d by Bob Rudis, which should be released in the near future. Now go scrape some websites ethically!","primaryImageOfPage":{"@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#primaryimage"},"mainContentOfPage":[[{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"Home","url":"https:\/\/www.r-bloggers.com"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"About","url":"http:\/\/www.r-bloggers.com\/about\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"RSS","url":"https:\/\/feeds.feedburner.com\/RBloggers"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"add your blog!","url":"http:\/\/www.r-bloggers.com\/add-your-blog\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"Learn R","url":"https:\/\/www.r-bloggers.com\/how-to-learn-r-2\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"R jobs","url":"https:\/\/www.r-users.com\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"Submit a new job (it's free)","url":"https:\/\/www.r-users.com\/submit-job\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"Browse latest jobs (also free)","url":"https:\/\/www.r-users.com\/"},{"@context":"https:\/\/schema.org","@type":"SiteNavigationElement","@id":"https:\/\/www.r-bloggers.com\/#top nav","name":"Contact us","url":"http:\/\/www.r-bloggers.com\/contact-us\/"}]],"isPartOf":{"@id":"https:\/\/www.r-bloggers.com#website"},"breadcrumb":{"@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#breadcrumb"}},{"@type":"BreadcrumbList","@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"https:\/\/www.r-bloggers.com","name":"R-bloggers"}},{"@type":"ListItem","position":2,"item":{"@id":"https:\/\/www.r-bloggers.com\/category\/r-bloggers\/","name":"R bloggers"}},{"@type":"ListItem","position":3,"item":{"@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/","name":"An introduction to web scraping: locating Spanish schools | R-bloggers"}}]},{"@type":"Article","@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#article","url":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/","inLanguage":"en-US","mainEntityOfPage":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#webpage","headline":"An introduction to web scraping: locating Spanish schools | R-bloggers","description":"by Jorge Cimentada\n        \n\n\n\nIntroduction\nWhenever a new paper is released using some type of scraped data, most of my peers in the social science community get baffled at how researchers can do this. In fact, many social scientists can\u2019t even think of research questions that can be addressed with this type of data simply because they don\u2019t know it\u2019s even possible. As the old saying goes, when you have a hammer, every problem looks like a nail.\nWith the increasing amount of data being collected on a daily basis, it is eminent that scientists start getting familiar with new technologies that can help answer old questions. Moreover, we need to be adventurous about cutting edge data sources as they can also allow us to ask new questions which weren\u2019t even thought of in the past.\nIn this tutorial I\u2019ll be guiding you through the basics of web scraping using R and the xml2 package. I\u2019ll begin with a simple example using fake data and elaborate further by trying to scrape the location of a sample of schools in Spain.\n\n\nBasic steps\nFor web scraping in R, you can fulfill almost all of your needs with the xml2 package. As you wander through the web, you\u2019ll see many examples using the rvest package. xml2 and rvest are very similar so don\u2019t feel you\u2019re lacking behind for learning one and not the other. In addition to these two packages, we\u2019ll need some other libraries for plotting locations on a map (ggplot2, sf, rnaturalearth), identifying who we are when we scrape (httr) and wrangling data (tidyverse).\nAdditionally, we\u2019ll also need the package scrapex. In the real-world example that we\u2019ll be doing below, we\u2019ll be scraping data from the website www.buscocolegio.com to locate a sample of schools in Spain. However, throughout the tutorial we won\u2019t be scraping the data directly from their real-website. What would happen to this tutorial if 6 months from now www.buscocolegio.com updates the design of their website? Everything from our real-world example would be lost.\nWeb scraping tutorials are usually very unstable precisely because of this. To circumvent that problem, I\u2019ve saved a random sample of websites from some schools in www.buscocolegio.com into an R package called scrapex. Although the links we\u2019ll be working on will be hosted locally on your machine, the HTML of the website should be very similar to the one hosted on the website (with the exception of some images\/icons which were deleted on purpose to make the package lightweight).\nYou can install the package with:\n# install.packages(\"devtools\")\ndevtools::install_github(\"cimentadaj\/scrapex\")\nNow, let\u2019s move on the fake data example and load all of our packages with:\nlibrary(xml2)\nlibrary(httr)\nlibrary(tidyverse)\nlibrary(sf)\nlibrary(rnaturalearth)\nlibrary(ggplot2)\nlibrary(scrapex)\nLet\u2019s begin with a simple example. Below we define an XML string and look at its structure:\nxml_test <- \"<people>\n<jason>\n  <person type='fictional'>\n    <first_name>\n      <married>\n        Jason\n      <\/married>\n    <\/first_name>\n    <last_name>\n        Bourne\n    <\/last_name>\n    <occupation>\n      Spy\n    <\/occupation>\n  <\/person>\n<\/jason>\n<carol>\n  <person type='real'>\n    <first_name>\n      <married>\n        Carol\n      <\/married>\n    <\/first_name>\n    <last_name>\n        Kalp\n    <\/last_name>\n    <occupation>\n      Scientist\n    <\/occupation>\n  <\/person>\n<\/carol>\n<\/people>\n\"\n\ncat(xml_test)\n## <people>\n## <jason>\n##   <person type='fictional'>\n##     <first_name>\n##       <married>\n##         Jason\n##       <\/married>\n##     <\/first_name>\n##     <last_name>\n##         Bourne\n##     <\/last_name>\n##     <occupation>\n##       Spy\n##     <\/occupation>\n##   <\/person>\n## <\/jason>\n## <carol>\n##   <person type='real'>\n##     <first_name>\n##       <married>\n##         Carol\n##       <\/married>\n##     <\/first_name>\n##     <last_name>\n##         Kalp\n##     <\/last_name>\n##     <occupation>\n##       Scientist\n##     <\/occupation>\n##   <\/person>\n## <\/carol>\n## <\/people>\nIn XML and HTML the basic building blocks are something called tags. For example, the first tag in the structure shown above is <people>. This tag is matched by <\/people> at the end of the string:\n\nIf you pay close attention, you\u2019ll see that each tag in the XML structure has a beginning (signaled by <>) and an end (signaled by <\/>). For example, the next tag after <people> is <jason> and right before the tag <carol> is the end of the jason tag <\/jason>.\n\nSimilarly, you\u2019ll find that the <carol> tag is also matched by a <\/carol> finishing tag.\n\nIn theory, tags can have whatever meaning you attach to them (such as <people> or <occupation>). However, in practice there are hundreds of tags which are standard in websites (for example, here). If you\u2019re just getting started, there\u2019s no need for you to learn them but as you progress in web scraping, you\u2019ll start to recognize them (one brief example is <strong> which simply bolds text in a website).\nThe xml2 package was designed to read XML strings and to navigate the tree structure to extract information. For example, let\u2019s read in the XML data from our fake example and look at its general structure:\nxml_raw <- read_xml(xml_test)\nxml_structure(xml_raw)\n## <people>\n##   <jason>\n##     <person >\n##       <first_name>\n##         <married>\n##           {text}\n##       <last_name>\n##         {text}\n##       <occupation>\n##         {text}\n##   <carol>\n##     <person >\n##       <first_name>\n##         <married>\n##           {text}\n##       <last_name>\n##         {text}\n##       <occupation>\n##         {text}\nYou can see that the structure is tree-based, meaning that tags such as <jason> and <carol> are nested within the <people> tag. In XML jargon, <people> is the root node, whereas <jason> and <carol> are the child nodes from <people>.\nIn more detail, the structure is as follows:\n\nThe root node is <people>\n\nThe child nodes are <jason> and <carol>\n\nThen each child node has nodes <first_name>, <married>, <last_name> and <occupation> nested within them.\n\nPut another way, if something is nested within a node, then the nested node is a child of the upper-level node. In our example, the root node is <people> so we can check which are its children:\n# xml_child returns only one child (specified in search)\n# Here, jason is the first child\nxml_child(xml_raw, search = 1)\n## {xml_node}\n## <jason>\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\n# Here, carol is the second child\nxml_child(xml_raw, search = 2)\n## {xml_node}\n## <carol>\n##  <person type=\"real\">\\n  <first_name>\\n    <married>\\n        Carol\\n ...\n# Use xml_children to extract **all** children\nchild_xml <- xml_children(xml_raw)\n\nchild_xml\n## {xml_nodeset (2)}\n##  <jason>\\n  <person type=\"fictional\">\\n    <first_name>\\n      <marri ...\n##  <carol>\\n  <person type=\"real\">\\n    <first_name>\\n      <married>\\n ...\nTags can also have different attributes which are usually specified as <fake_tag attribute='fake'> and ended as usual with <\/fake_tag>. If you look at the XML structure of our example, you\u2019ll notice that each <person> tag has an attribute called type. As you\u2019ll see in our real-world example, extracting these attributes is often the aim of our scraping adventure. Using xml2, we can extract all attributes that match a specific name with xml_attrs.\n# Extract the attribute type from all nodes\nxml_attrs(child_xml, \"type\")\n## ]\n## named character(0)\n##\n## ]\n## named character(0)\nWait, why didn\u2019t this work? Well, if you look at the output of child_xml, we have two nodes on which are for <jason> and <carol>.\nchild_xml\n## {xml_nodeset (2)}\n##  <jason>\\n  <person type=\"fictional\">\\n    <first_name>\\n      <marri ...\n##  <carol>\\n  <person type=\"real\">\\n    <first_name>\\n      <married>\\n ...\nDo these tags have an attribute? No, because if they did, they would have something like <jason type='fake_tag'>. What we need is to look down at the <person> tag within <jason> and <carol> and extract the attribute from <person>.\nDoes this sound familiar? Both <jason> and <carol> have an associated <person> tag below them, making them their children. We can just go down one level by running xml_children on these tags and extract them.\n# We go down one level of children\nperson_nodes <- xml_children(child_xml)\n\n# <person> is now the main node, so we can extract attributes\nperson_nodes\n## {xml_nodeset (2)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\n##  <person type=\"real\">\\n  <first_name>\\n    <married>\\n        Carol\\n ...\n# Both type attributes\nxml_attrs(person_nodes, \"type\")\n## ]\n##        type\n## \"fictional\"\n##\n## ]\n##   type\n## \"real\"\nUsing the xml_path function you can even find the \u2018address\u2019 of these nodes to retrieve specific tags without having to write down xml_children many times. For example:\n# Specific address of each person tag for the whole xml tree\n# only using the `person_nodes`\nxml_path(person_nodes)\n##  \"\/people\/jason\/person\" \"\/people\/carol\/person\"\nWe have the \u2018address\u2019 of specific tags in the tree but how do we extract them automatically? To extract specific \u2018addresses\u2019 of this XML tree, the main function we\u2019ll use is xml_find_all. This function accepts the XML tree and an \u2018address\u2019 string. We can use very simple strings, such as the one given by xml_path:\n# You can use results from xml_path like directories\nxml_find_all(xml_raw, \"\/people\/jason\/person\")\n## {xml_nodeset (1)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\nThe expression above is asking for the node \"\/people\/jason\/person\". This will return the same as saying xml_raw %>% xml_child(search = 1). For deeply nested trees, xml_find_all will be many times much cleaner than calling xml_child recursively many times.\nHowever, in most cases the \u2018addresses\u2019 used in xml_find_all come from a separate language called XPath (in fact, the \u2018address\u2019 we\u2019ve been looking at is XPath). XPath is a complex language (such as regular expressions for strings) which is beyond this brief tutorial. However, with the examples we\u2019ve seen so far, we can use some basic XPath which we\u2019ll need later on.\nTo extract all the tags in a document, we can use \/\/name_of_tag.\n# Search for all 'married' nodes\nxml_find_all(xml_raw, \"\/\/married\")\n## {xml_nodeset (2)}\n##  <married>\\n        Jason\\n      <\/married>\n##  <married>\\n        Carol\\n      <\/married>\nWith the previous XPath, we\u2019re searching for all married tags within the complete XML tree. The result returns all married nodes (I use the words tags and nodes interchangeably) in the complete tree structure. Another example would be finding all <occupation> tags:\nxml_find_all(xml_raw, \"\/\/occupation\")\n## {xml_nodeset (2)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n##  <occupation>\\n      Scientist\\n    <\/occupation>\nIf you want to find any other tag you can replace \"\/\/occupation\" with your tag of interest and xml_find_all will find all of them.\nIf you wanted to find all tags below your current node, you only need to add a . at the beginning: \".\/\/occupation\". For example, if we dived into the <jason> tag and we wanted his <occupation> tag, \"\/\/occupation\" will returns all <occupation> tags. Instead, \".\/\/occupation\" will return only the found tags below the current tag. For example:\nxml_raw %>%\n  # Dive only into Jason's tag\n  xml_child(search = 1) %>%\n  xml_find_all(\".\/\/occupation\")\n## {xml_nodeset (1)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n# Instead, the wrong way would have been:\nxml_raw %>%\n  # Dive only into Jason's tag\n  xml_child(search = 1) %>%\n  # Here we get both occupation tags\n  xml_find_all(\"\/\/occupation\")\n## {xml_nodeset (2)}\n##  <occupation>\\n      Spy\\n    <\/occupation>\n##  <occupation>\\n      Scientist\\n    <\/occupation>\nThe first example only returns <jason>\u2019s occupation whereas the second returned all occupations, regardless of where you are in the tree.\nXPath also allows you to identify tags that contain only one specific attribute, such as the one\u2019s we saw earlier. For example, to filter all <person> tags with the attribute filter set to fictional, we could do it with:\n# Give me all the tags 'person' that have an attribute type='fictional'\nxml_raw %>%\n  xml_find_all(\"\/\/person\")\n## {xml_nodeset (1)}\n##  <person type=\"fictional\">\\n  <first_name>\\n    <married>\\n        Ja ...\nIf you wanted to do the same but for the tags below your current nodes, the same trick we learned earlier would work: \".\/\/person\". These are just some primers that can help you jump easily to using XPath, but I encourage you to look at other examples on the web, as complex websites often require complex XPath expressions.\nBefore we begin our real-word example, you might be asking yourself how you can actually extract the text\/numeric data from these nodes. Well, that\u2019s easy: xml_text.\nxml_raw %>%\n  xml_find_all(\".\/\/occupation\") %>%\n  xml_text()\n##  \"\\n      Spy\\n    \"       \"\\n      Scientist\\n    \"\nOnce you\u2019ve narrowed down your tree-based search to one single piece of text or numbers, xml_text() will extract that for you (there\u2019s also xml_double and xml_integer for extracting numbers). As I said, XPath is really a huge language. If you\u2019re interested, this XPath cheat sheets have helped me a lot to learn tricks for easy scraping.\n\n\nReal-world example\nWe\u2019re interested in making a list of many schools in Spain and visualizing their location. This can be useful for many things such as matching population density of children across different regions to school locations. The website www.buscocolegio.com contains a database of schools similar to what we\u2019re looking for. As described at the beginning, instead we\u2019re going to use scrapex which has the function spanish_schools_ex() containing the links to a sample of websites from different schools saved locally on your computer.\nLet\u2019s look at an example for one school.\nschool_links <- spanish_schools_ex()\n\n# Keep only the HTML file of one particular school.\nschool_url <- school_links\n\nschool_url\n##  \"\/usr\/local\/lib\/R\/site-library\/scrapex\/extdata\/spanish_schools_ex\/school_3006839.html\"\nIf you\u2019re interested in looking at the website interactively in your browser, you can do it with browseURL(prep_browser(school_url)). Let\u2019s read the HTML (XML and HTML are usually interchangeable, so here we use read_html).\n# Here we use `read_html` because `read_xml` is throwing an error\n# when attempting to read. However, everything we've discussed\n# should be the same.\nschool_raw <- read_html(school_url) %>% xml_child()\n\nschool_raw\n## {html_node}\n## <head>\n##   <title>Aqu\u00ed encontrar\u00e1s toda la informaci\u00f3n necesaria sobre CEIP SA ...\n##   <meta charset=\"utf-8\">\\n\n##   <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, ...\n##   <meta http-equiv=\"x-ua-compatible\" content=\"ie=edge\">\\n\n##   <meta name=\"author\" content=\"BuscoColegio\">\\n\n##   <meta name=\"description\" content=\"Encuentra toda la informaci\u00f3n nec ...\n##   <meta name=\"keywords\" content=\"opiniones SANCHIS GUARNER, contacto  ...\n##   <link rel=\"shortcut icon\" href=\"\/favicon.ico\">\\n\n##   <link rel=\"stylesheet\" href=\"\/\/fonts.googleapis.com\/css?family=Robo ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-awesome\/css\/font-a ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-line\/css\/simple-li ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-line-pro\/style.css ...\n##  <link rel=\"stylesheet\" href=\"\/assets\/vendor\/icon-hs\/style.css\">\\n\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n##  <link rel=\"stylesheet\" href=\"https:\/\/s3.eu-west-3.amazonaws.com\/bus ...\n## ...\nWeb scraping strategies are very specific to the website you\u2019re after. You have to get very familiar with the website you\u2019re interested to be able to match perfectly the information you\u2019re looking for. In many cases, scraping two websites will require vastly different strategies. For this particular example, we\u2019re only interested in figuring out the location of each school so we only have to extract its location.\n\n\nIn the image above you\u2019ll find a typical school\u2019s website in wwww.buscocolegio.com. The website has a lot of information, but we\u2019re only interested in the button that is circled by the orange rectangle. If you can\u2019t find it easily, it\u2019s below the Google Maps on the right which says \u201cBuscar colegio cercano\u201d.\nWhen you click on this button, this actually points you towards the coordinates of the school so we just have to find a way of figuring out how to click this button or figure out how to get its information. All browsers allow you to do this if you press CTRL + SHIFT + c at the same time (Firefox and Chrome support this hotkey). If a window on the right popped in full of code, then you\u2019re on the right track:\n\n\n\nHere we can search the source code of the website. If you place your mouse pointer over the lines of code from this right-most window, you\u2019ll see sections of the website being highlighted in blue. This indicates which parts of the code refer to which parts of the website. Luckily for us, we don\u2019t have to search the complete source code to find that specific location. We can approximate our search by typing the text we\u2019re looking for in the search bar at the top of the right window:\n\n\n\nAfter we click enter, we\u2019ll be automatically directed to the tag that has the information that we want.\n\n\n\nMore specifically, we can see that the latitude and longitude of schools are found in an attributed called href in a tag <a>:\n\n\n\nCan you see the latitude and longitude fields in the text highlighted blue? It\u2019s hidden in-between words. That is precisely the type of information we\u2019re after. Extracting all <a> tags from the website (hint: XPath similar to \"\/\/a\") will yield hundreds of matches because <a> is a very common tag. Moreover, refining the search to <a> tags which have an href attribute will also yield hundreds of matches because href is the standard attribute to attach links within websites. We need to narrow down our search within the website.\nOne strategy is to find the \u2018father\u2019 or \u2018grandfather\u2019 node of this particular <a> tag and then match a node which has that same sequence of grandfather -> father -> child node. By looking at the structure of this small XML snippet from the right-most window, we see that the \u2018grandfather\u2019 of this <a> tag is <p class=\"d-flex align-items-baseline g-mt-5'> which has a particularly long attribute named class.\n\n\n\nDon\u2019t be intimidated by these tag names and long attributes. I also don\u2019t know what any of these attributes mean. But what I do know is that this is the \u2018grandfather\u2019 of the <a> tag I\u2019m interested in. So using our XPath skills, let\u2019s search for that <p> tag and see if we get only one match.\n# Search for all <p> tags with that class in the document\nschool_raw %>%\n  xml_find_all(\"\/\/p\")\n## {xml_nodeset (1)}\n##  <p class=\"d-flex align-items-baseline g-mt-5\">\\r\\n\\t                 ...\nOnly one match, so this is good news. This means that we can uniquely identify this particular <p> tag. Let\u2019s refine the search to say: Find all <a> tags which are children of that specific <p> tag. This only means I\u2019ll add a \"\/\/a\" to the previous expression. Since there is only one <p> tag with the class, we\u2019re interested in checking whether there is more than one <a> tag below this <p> tag.\nschool_raw %>%\n  xml_find_all(\"\/\/p\/\/a\")\n## {xml_nodeset (1)}\n##  <a href=\"\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud=38 ...\nThere we go! We can see the specific href that contains the latitude and longitude data we\u2019re interested in. How do we extract the href attribute? Using xml_attr as we did before!\nlocation_str <-\n  school_raw %>%\n  xml_find_all(\"\/\/p\/\/a\") %>%\n  xml_attr(attr = \"href\")\n\nlocation_str\n##  \"\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud=38.8274492&colegio.longitud=0.0221681\"\nOk, now we need some regex skills to get only the latitude and longitude (regex expressions are used to search for patterns inside a string, such as for example a date. See here for some examples):\nlocation <-\n  location_str %>%\n  str_extract_all(\"=.+$\") %>%\n  str_replace_all(\"=|colegio\\\\.longitud\", \"\") %>%\n  str_split(\"&\") %>%\n  .]\n\nlocation\n##  \"38.8274492\" \"0.0221681\"\nOk, so we got the information we needed for one single school. Let\u2019s turn that into a function so we can pass only the school\u2019s link and get the coordinates back.\nBefore we do that, I will set something called my User-Agent. In short, the User-Agent is who you are. It is good practice to identify the person who is scraping the website because if you\u2019re causing any trouble on the website, the website can directly identify who is causing problems. You can figure out your user agent here and paste it in the string below. In addition, I will add a time sleep of 5 seconds to the function because we want to make sure we don\u2019t cause any troubles to the website we\u2019re scraping due to an overload of requests.\n# This sets your `User-Agent` globally so that all requests are\n# identified with this `User-Agent`\nset_config(\n  user_agent(\"Mozilla\/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko\/20100101 Firefox\/70.0\")\n)\n\n# Collapse all of the code from above into one function called\n# school grabber\n\nschool_grabber <- function(school_url) {\n  # We add a time sleep of 5 seconds to avoid\n  # sending too many quick requests to the website\n  Sys.sleep(5)\n\n  school_raw <- read_html(school_url) %>% xml_child()\n\n  location_str <-\n    school_raw %>%\n    xml_find_all(\"\/\/p\/\/a\") %>%\n    xml_attr(attr = \"href\")\n\n  location <-\n    location_str %>%\n    str_extract_all(\"=.+$\") %>%\n    str_replace_all(\"=|colegio\\\\.longitud\", \"\") %>%\n    str_split(\"&\") %>%\n    .]\n\n  # Turn into a data frame\n  data.frame(\n    latitude = location,\n    longitude = location,\n    stringsAsFactors = FALSE\n  )\n}\n\n\nschool_grabber(school_url)\n##     latitude longitude\n## 1 38.8274492 0.0221681\nOk, so it\u2019s working. The only thing left is to extract this for many schools. As shown earlier, scrapex contains a list of 27 school links that we can automatically scrape. Let\u2019s loop over those, get the information of coordinates for each and collapse all of them into a data frame.\nres <- map_dfr(school_links, school_grabber)\nres\n##    latitude  longitude\n## 1  42.72779 -8.6567935\n## 2  43.24439 -8.8921645\n## 3  38.95592 -1.2255769\n## 4  39.18657 -1.6225903\n## 5  40.38245 -3.6410388\n## 6  40.22929 -3.1106322\n## 7  40.43860 -3.6970366\n## 8  40.33514 -3.5155669\n## 9  40.50546 -3.3738441\n## 10 40.63826 -3.4537107\n## 11 40.38543 -3.6639500\n## 12 37.76485 -1.5030467\n## 13 38.82745  0.0221681\n## 14 40.99434 -5.6224391\n## 15 40.99434 -5.6224391\n## 16 40.56037 -5.6703725\n## 17 40.99434 -5.6224391\n## 18 40.99434 -5.6224391\n## 19 41.13593  0.9901905\n## 20 41.26155  1.1670507\n## 21 41.22851  0.5461471\n## 22 41.14580  0.8199749\n## 23 41.18341  0.5680564\n## 24 42.07820  1.8203155\n## 25 42.25245  1.8621546\n## 26 41.73767  1.8383666\n## 27 41.62345  2.0013628\nSo now that we have the locations of these schools, let\u2019s plot them:\nres <- mutate_all(res, as.numeric)\n\nsp_sf <-\n  ne_countries(scale = \"large\", country = \"Spain\", returnclass = \"sf\") %>%\n  st_transform(crs = 4326)\n\nggplot(sp_sf) +\n  geom_sf() +\n  geom_point(data = res, aes(x = longitude, y = latitude)) +\n  coord_sf(xlim = c(-20, 10), ylim = c(25, 45)) +\n  theme_minimal() +\n  ggtitle(\"Sample of schools in Spain\")\n\nThere we go! We went from literally no information at the beginning of this tutorial to interpretable and summarized information only using web data. We can see some schools in Madrid (center) as well in other regions of Spain, including Catalonia and Galicia.\nThis marks the end of our scraping adventure but before we finish, I want to mention some of the ethical guidelines for web scraping. Scraping is extremely useful for us but can give headaches to other people maintaining the website of interest. Here\u2019s a list of ethical guidelines you should always follow:\n\nRead the terms and services: many websites prohibit web scraping and you could be in a breach of privacy by scraping the data. One famous example.\nCheck the robots.txt file. This is a file that most websites have (www.buscocolegio.com does not) which tell you which specific paths inside the website are scrapable and which are not. See here for an explanation of what robots.txt look like and where to find them.\nSome websites are supported by very big servers, which means you can send 4-5 website requests per second. Others, such as www.buscocolegio.com are not. It\u2019s good practice to always put a time sleep between your requests. In our example, I set it to 5 seconds because this is a small website and we don\u2019t want to crash their servers.\nWhen making requests, there are computational ways of identifying yourself. For example, every request (such as the one\u2019s we do) can have something called a User-Agent. It is good practice to include yourself in as the User-Agent (as we did in our code) because the admin of the server can directly identify if someone\u2019s causing problems due to their web scraping.\nLimit your scraping to non-busy hours such as overnight. This can help reduce the chances of collapsing the website since fewer people are visiting websites in the evening.\n\nYou can read more about these ethical issues here.\n\n\nWrap up\nThis tutorial introduced you to basic concepts in web scraping and applied them in a real-world setting. Web scraping is a vast field in computer science (you can find entire books on the subject such as this). We covered some basic techniques which I think can take you a long way but there\u2019s definitely more to learn. For those curious about where to turn, I\u2019m looking forward to the upcoming book \u201cA Field Guide for Web Scraping and Accessing APIs with R\u201d by Bob Rudis, which should be released in the near future. Now go scrape some websites ethically!","articleBody":"by Jorge Cimentada             Introduction Whenever a new paper is released using some type of scraped data, most of my peers in the social science community get baffled at how researchers can do this. In fact, many social scientists can\u2019t even think of research questions that can be addressed with this type of data simply because they don\u2019t know it\u2019s even possible. As the old saying goes, when you have a hammer, every problem looks like a nail. With the increasing amount of data being collected on a daily basis, it is eminent that scientists start getting familiar with new technologies that can help answer old questions. Moreover, we need to be adventurous about cutting edge data sources as they can also allow us to ask new questions which weren\u2019t even thought of in the past. In this tutorial I\u2019ll be guiding you through the basics of web scraping using R and the xml2 package. I\u2019ll begin with a simple example using fake data and elaborate further by trying to scrape the location of a sample of schools in Spain.   Basic steps For web scraping in R, you can fulfill almost all of your needs with the xml2 package. As you wander through the web, you\u2019ll see many examples using the rvest package. xml2 and rvest are very similar so don\u2019t feel you\u2019re lacking behind for learning one and not the other. In addition to these two packages, we\u2019ll need some other libraries for plotting locations on a map (ggplot2, sf, rnaturalearth), identifying who we are when we scrape (httr) and wrangling data (tidyverse). Additionally, we\u2019ll also need the package scrapex. In the real-world example that we\u2019ll be doing below, we\u2019ll be scraping data from the website www.buscocolegio.com to locate a sample of schools in Spain. However, throughout the tutorial we won\u2019t be scraping the data directly from their real-website. What would happen to this tutorial if 6 months from now www.buscocolegio.com updates the design of their website? Everything from our real-world example would be lost. Web scraping tutorials are usually very unstable precisely because of this. To circumvent that problem, I\u2019ve saved a random sample of websites from some schools in www.buscocolegio.com into an R package called scrapex. Although the links we\u2019ll be working on will be hosted locally on your machine, the HTML of the website should be very similar to the one hosted on the website (with the exception of some images\/icons which were deleted on purpose to make the package lightweight). You can install the package with: # install.packages("devtools") devtools::install_github("cimentadaj\/scrapex") Now, let\u2019s move on the fake data example and load all of our packages with: library(xml2) library(httr) library(tidyverse) library(sf) library(rnaturalearth) library(ggplot2) library(scrapex) Let\u2019s begin with a simple example. Below we define an XML string and look at its structure: xml_test <- "<people> <jason>   <person type'fictional'>     <first_name>       <married>         Jason       <\/married>     <\/first_name>     <last_name>         Bourne     <\/last_name>     <occupation>       Spy     <\/occupation>   <\/person> <\/jason> <carol>   <person type'real'>     <first_name>       <married>         Carol       <\/married>     <\/first_name>     <last_name>         Kalp     <\/last_name>     <occupation>       Scientist     <\/occupation>   <\/person> <\/carol> <\/people> "  cat(xml_test) ## <people> ## <jason> ##   <person type'fictional'> ##     <first_name> ##       <married> ##         Jason ##       <\/married> ##     <\/first_name> ##     <last_name> ##         Bourne ##     <\/last_name> ##     <occupation> ##       Spy ##     <\/occupation> ##   <\/person> ## <\/jason> ## <carol> ##   <person type'real'> ##     <first_name> ##       <married> ##         Carol ##       <\/married> ##     <\/first_name> ##     <last_name> ##         Kalp ##     <\/last_name> ##     <occupation> ##       Scientist ##     <\/occupation> ##   <\/person> ## <\/carol> ## <\/people> In XML and HTML the basic building blocks are something called tags. For example, the first tag in the structure shown above is <people>. This tag is matched by <\/people> at the end of the string:  If you pay close attention, you\u2019ll see that each tag in the XML structure has a beginning (signaled by <>) and an end (signaled by <\/>). For example, the next tag after <people> is <jason> and right before the tag <carol> is the end of the jason tag <\/jason>.  Similarly, you\u2019ll find that the <carol> tag is also matched by a <\/carol> finishing tag.  In theory, tags can have whatever meaning you attach to them (such as <people> or <occupation>). However, in practice there are hundreds of tags which are standard in websites (for example, here). If you\u2019re just getting started, there\u2019s no need for you to learn them but as you progress in web scraping, you\u2019ll start to recognize them (one brief example is <strong> which simply bolds text in a website). The xml2 package was designed to read XML strings and to navigate the tree structure to extract information. For example, let\u2019s read in the XML data from our fake example and look at its general structure: xml_raw <- read_xml(xml_test) xml_structure(xml_raw) ## <people> ##   <jason> ##     <person > ##       <first_name> ##         <married> ##           {text} ##       <last_name> ##         {text} ##       <occupation> ##         {text} ##   <carol> ##     <person > ##       <first_name> ##         <married> ##           {text} ##       <last_name> ##         {text} ##       <occupation> ##         {text} You can see that the structure is tree-based, meaning that tags such as <jason> and <carol> are nested within the <people> tag. In XML jargon, <people> is the root node, whereas <jason> and <carol> are the child nodes from <people>. In more detail, the structure is as follows:  The root node is <people> The child nodes are <jason> and <carol> Then each child node has nodes <first_name>, <married>, <last_name> and <occupation> nested within them.  Put another way, if something is nested within a node, then the nested node is a child of the upper-level node. In our example, the root node is <people> so we can check which are its children: # xml_child returns only one child (specified in search) # Here, jason is the first child xml_child(xml_raw, search  1) ## {xml_node} ## <jason> ##  <person type"fictional">\\n  <first_name>\\n    <married>\\n        Ja ... # Here, carol is the second child xml_child(xml_raw, search  2) ## {xml_node} ## <carol> ##  <person type"real">\\n  <first_name>\\n    <married>\\n        Carol\\n ... # Use xml_children to extract **all** children child_xml <- xml_children(xml_raw)  child_xml ## {xml_nodeset (2)} ##  <jason>\\n  <person type"fictional">\\n    <first_name>\\n      <marri ... ##  <carol>\\n  <person type"real">\\n    <first_name>\\n      <married>\\n ... Tags can also have different attributes which are usually specified as <fake_tag attribute'fake'> and ended as usual with <\/fake_tag>. If you look at the XML structure of our example, you\u2019ll notice that each <person> tag has an attribute called type. As you\u2019ll see in our real-world example, extracting these attributes is often the aim of our scraping adventure. Using xml2, we can extract all attributes that match a specific name with xml_attrs. # Extract the attribute type from all nodes xml_attrs(child_xml, "type") ## ] ## named character(0) ## ## ] ## named character(0) Wait, why didn\u2019t this work? Well, if you look at the output of child_xml, we have two nodes on which are for <jason> and <carol>. child_xml ## {xml_nodeset (2)} ##  <jason>\\n  <person type"fictional">\\n    <first_name>\\n      <marri ... ##  <carol>\\n  <person type"real">\\n    <first_name>\\n      <married>\\n ... Do these tags have an attribute? No, because if they did, they would have something like <jason type'fake_tag'>. What we need is to look down at the <person> tag within <jason> and <carol> and extract the attribute from <person>. Does this sound familiar? Both <jason> and <carol> have an associated <person> tag below them, making them their children. We can just go down one level by running xml_children on these tags and extract them. # We go down one level of children person_nodes <- xml_children(child_xml)  # <person> is now the main node, so we can extract attributes person_nodes ## {xml_nodeset (2)} ##  <person type"fictional">\\n  <first_name>\\n    <married>\\n        Ja ... ##  <person type"real">\\n  <first_name>\\n    <married>\\n        Carol\\n ... # Both type attributes xml_attrs(person_nodes, "type") ## ] ##        type ## "fictional" ## ## ] ##   type ## "real" Using the xml_path function you can even find the \u2018address\u2019 of these nodes to retrieve specific tags without having to write down xml_children many times. For example: # Specific address of each person tag for the whole xml tree # only using the `person_nodes` xml_path(person_nodes) ##  "\/people\/jason\/person" "\/people\/carol\/person" We have the \u2018address\u2019 of specific tags in the tree but how do we extract them automatically? To extract specific \u2018addresses\u2019 of this XML tree, the main function we\u2019ll use is xml_find_all. This function accepts the XML tree and an \u2018address\u2019 string. We can use very simple strings, such as the one given by xml_path: # You can use results from xml_path like directories xml_find_all(xml_raw, "\/people\/jason\/person") ## {xml_nodeset (1)} ##  <person type"fictional">\\n  <first_name>\\n    <married>\\n        Ja ... The expression above is asking for the node \"\/people\/jason\/person\". This will return the same as saying xml_raw %>% xml_child(search  1). For deeply nested trees, xml_find_all will be many times much cleaner than calling xml_child recursively many times. However, in most cases the \u2018addresses\u2019 used in xml_find_all come from a separate language called XPath (in fact, the \u2018address\u2019 we\u2019ve been looking at is XPath). XPath is a complex language (such as regular expressions for strings) which is beyond this brief tutorial. However, with the examples we\u2019ve seen so far, we can use some basic XPath which we\u2019ll need later on. To extract all the tags in a document, we can use \/\/name_of_tag. # Search for all 'married' nodes xml_find_all(xml_raw, "\/\/married") ## {xml_nodeset (2)} ##  <married>\\n        Jason\\n      <\/married> ##  <married>\\n        Carol\\n      <\/married> With the previous XPath, we\u2019re searching for all married tags within the complete XML tree. The result returns all married nodes (I use the words tags and nodes interchangeably) in the complete tree structure. Another example would be finding all <occupation> tags: xml_find_all(xml_raw, "\/\/occupation") ## {xml_nodeset (2)} ##  <occupation>\\n      Spy\\n    <\/occupation> ##  <occupation>\\n      Scientist\\n    <\/occupation> If you want to find any other tag you can replace \"\/\/occupation\" with your tag of interest and xml_find_all will find all of them. If you wanted to find all tags below your current node, you only need to add a . at the beginning: \".\/\/occupation\". For example, if we dived into the <jason> tag and we wanted his <occupation> tag, \"\/\/occupation\" will returns all <occupation> tags. Instead, \".\/\/occupation\" will return only the found tags below the current tag. For example: xml_raw %>%   # Dive only into Jason's tag   xml_child(search  1) %>%   xml_find_all(".\/\/occupation") ## {xml_nodeset (1)} ##  <occupation>\\n      Spy\\n    <\/occupation> # Instead, the wrong way would have been: xml_raw %>%   # Dive only into Jason's tag   xml_child(search  1) %>%   # Here we get both occupation tags   xml_find_all("\/\/occupation") ## {xml_nodeset (2)} ##  <occupation>\\n      Spy\\n    <\/occupation> ##  <occupation>\\n      Scientist\\n    <\/occupation> The first example only returns <jason>\u2019s occupation whereas the second returned all occupations, regardless of where you are in the tree. XPath also allows you to identify tags that contain only one specific attribute, such as the one\u2019s we saw earlier. For example, to filter all <person> tags with the attribute filter set to fictional, we could do it with: # Give me all the tags 'person' that have an attribute type'fictional' xml_raw %>%   xml_find_all("\/\/person") ## {xml_nodeset (1)} ##  <person type"fictional">\\n  <first_name>\\n    <married>\\n        Ja ... If you wanted to do the same but for the tags below your current nodes, the same trick we learned earlier would work: \".\/\/person\". These are just some primers that can help you jump easily to using XPath, but I encourage you to look at other examples on the web, as complex websites often require complex XPath expressions. Before we begin our real-word example, you might be asking yourself how you can actually extract the text\/numeric data from these nodes. Well, that\u2019s easy: xml_text. xml_raw %>%   xml_find_all(".\/\/occupation") %>%   xml_text() ##  "\\n      Spy\\n    "       "\\n      Scientist\\n    " Once you\u2019ve narrowed down your tree-based search to one single piece of text or numbers, xml_text() will extract that for you (there\u2019s also xml_double and xml_integer for extracting numbers). As I said, XPath is really a huge language. If you\u2019re interested, this XPath cheat sheets have helped me a lot to learn tricks for easy scraping.   Real-world example We\u2019re interested in making a list of many schools in Spain and visualizing their location. This can be useful for many things such as matching population density of children across different regions to school locations. The website www.buscocolegio.com contains a database of schools similar to what we\u2019re looking for. As described at the beginning, instead we\u2019re going to use scrapex which has the function spanish_schools_ex() containing the links to a sample of websites from different schools saved locally on your computer. Let\u2019s look at an example for one school. school_links <- spanish_schools_ex()  # Keep only the HTML file of one particular school. school_url <- school_links  school_url ##  "\/usr\/local\/lib\/R\/site-library\/scrapex\/extdata\/spanish_schools_ex\/school_3006839.html" If you\u2019re interested in looking at the website interactively in your browser, you can do it with browseURL(prep_browser(school_url)). Let\u2019s read the HTML (XML and HTML are usually interchangeable, so here we use read_html). # Here we use `read_html` because `read_xml` is throwing an error # when attempting to read. However, everything we've discussed # should be the same. school_raw <- read_html(school_url) %>% xml_child()  school_raw ## {html_node} ## <head> ##   <title>Aqu\u00ed encontrar\u00e1s toda la informaci\u00f3n necesaria sobre CEIP SA ... ##   <meta charset"utf-8">\\n ##   <meta name"viewport" content"widthdevice-width, initial-scale1, ... ##   <meta http-equiv"x-ua-compatible" content"ieedge">\\n ##   <meta name"author" content"BuscoColegio">\\n ##   <meta name"description" content"Encuentra toda la informaci\u00f3n nec ... ##   <meta name"keywords" content"opiniones SANCHIS GUARNER, contacto  ... ##   <link rel"shortcut icon" href"\/favicon.ico">\\n ##   <link rel"stylesheet" href"\/\/fonts.googleapis.com\/css?familyRobo ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"\/assets\/vendor\/icon-awesome\/css\/font-a ... ##  <link rel"stylesheet" href"\/assets\/vendor\/icon-line\/css\/simple-li ... ##  <link rel"stylesheet" href"\/assets\/vendor\/icon-line-pro\/style.css ... ##  <link rel"stylesheet" href"\/assets\/vendor\/icon-hs\/style.css">\\n ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ##  <link rel"stylesheet" href"https:\/\/s3.eu-west-3.amazonaws.com\/bus ... ## ... Web scraping strategies are very specific to the website you\u2019re after. You have to get very familiar with the website you\u2019re interested to be able to match perfectly the information you\u2019re looking for. In many cases, scraping two websites will require vastly different strategies. For this particular example, we\u2019re only interested in figuring out the location of each school so we only have to extract its location.   In the image above you\u2019ll find a typical school\u2019s website in wwww.buscocolegio.com. The website has a lot of information, but we\u2019re only interested in the button that is circled by the orange rectangle. If you can\u2019t find it easily, it\u2019s below the Google Maps on the right which says \u201cBuscar colegio cercano\u201d. When you click on this button, this actually points you towards the coordinates of the school so we just have to find a way of figuring out how to click this button or figure out how to get its information. All browsers allow you to do this if you press CTRL + SHIFT + c at the same time (Firefox and Chrome support this hotkey). If a window on the right popped in full of code, then you\u2019re on the right track:    Here we can search the source code of the website. If you place your mouse pointer over the lines of code from this right-most window, you\u2019ll see sections of the website being highlighted in blue. This indicates which parts of the code refer to which parts of the website. Luckily for us, we don\u2019t have to search the complete source code to find that specific location. We can approximate our search by typing the text we\u2019re looking for in the search bar at the top of the right window:    After we click enter, we\u2019ll be automatically directed to the tag that has the information that we want.    More specifically, we can see that the latitude and longitude of schools are found in an attributed called href in a tag <a>:    Can you see the latitude and longitude fields in the text highlighted blue? It\u2019s hidden in-between words. That is precisely the type of information we\u2019re after. Extracting all <a> tags from the website (hint: XPath similar to \"\/\/a\") will yield hundreds of matches because <a> is a very common tag. Moreover, refining the search to <a> tags which have an href attribute will also yield hundreds of matches because href is the standard attribute to attach links within websites. We need to narrow down our search within the website. One strategy is to find the \u2018father\u2019 or \u2018grandfather\u2019 node of this particular <a> tag and then match a node which has that same sequence of grandfather -> father -> child node. By looking at the structure of this small XML snippet from the right-most window, we see that the \u2018grandfather\u2019 of this <a> tag is <p class\"d-flex align-items-baseline g-mt-5'> which has a particularly long attribute named class.    Don\u2019t be intimidated by these tag names and long attributes. I also don\u2019t know what any of these attributes mean. But what I do know is that this is the \u2018grandfather\u2019 of the <a> tag I\u2019m interested in. So using our XPath skills, let\u2019s search for that <p> tag and see if we get only one match. # Search for all <p> tags with that class in the document school_raw %>%   xml_find_all("\/\/p") ## {xml_nodeset (1)} ##  <p class"d-flex align-items-baseline g-mt-5">\\r\\n\\t                 ... Only one match, so this is good news. This means that we can uniquely identify this particular <p> tag. Let\u2019s refine the search to say: Find all <a> tags which are children of that specific <p> tag. This only means I\u2019ll add a \"\/\/a\" to the previous expression. Since there is only one <p> tag with the class, we\u2019re interested in checking whether there is more than one <a> tag below this <p> tag. school_raw %>%   xml_find_all("\/\/p\/\/a") ## {xml_nodeset (1)} ##  <a href"\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud38 ... There we go! We can see the specific href that contains the latitude and longitude data we\u2019re interested in. How do we extract the href attribute? Using xml_attr as we did before! location_str <-   school_raw %>%   xml_find_all("\/\/p\/\/a") %>%   xml_attr(attr  "href")  location_str ##  "\/Colegio\/buscar-colegios-cercanos.action?colegio.latitud38.8274492&colegio.longitud0.0221681" Ok, now we need some regex skills to get only the latitude and longitude (regex expressions are used to search for patterns inside a string, such as for example a date. See here for some examples): location <-   location_str %>%   str_extract_all(".+$") %>%   str_replace_all("|colegio\\\\.longitud", "") %>%   str_split("&") %>%   .]  location ##  "38.8274492" "0.0221681" Ok, so we got the information we needed for one single school. Let\u2019s turn that into a function so we can pass only the school\u2019s link and get the coordinates back. Before we do that, I will set something called my User-Agent. In short, the User-Agent is who you are. It is good practice to identify the person who is scraping the website because if you\u2019re causing any trouble on the website, the website can directly identify who is causing problems. You can figure out your user agent here and paste it in the string below. In addition, I will add a time sleep of 5 seconds to the function because we want to make sure we don\u2019t cause any troubles to the website we\u2019re scraping due to an overload of requests. # This sets your `User-Agent` globally so that all requests are # identified with this `User-Agent` set_config(   user_agent("Mozilla\/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko\/20100101 Firefox\/70.0") )  # Collapse all of the code from above into one function called # school grabber  school_grabber <- function(school_url) {   # We add a time sleep of 5 seconds to avoid   # sending too many quick requests to the website   Sys.sleep(5)    school_raw <- read_html(school_url) %>% xml_child()    location_str <-     school_raw %>%     xml_find_all("\/\/p\/\/a") %>%     xml_attr(attr  "href")    location <-     location_str %>%     str_extract_all(".+$") %>%     str_replace_all("|colegio\\\\.longitud", "") %>%     str_split("&") %>%     .]    # Turn into a data frame   data.frame(     latitude  location,     longitude  location,     stringsAsFactors  FALSE   ) }   school_grabber(school_url) ##     latitude longitude ## 1 38.8274492 0.0221681 Ok, so it\u2019s working. The only thing left is to extract this for many schools. As shown earlier, scrapex contains a list of 27 school links that we can automatically scrape. Let\u2019s loop over those, get the information of coordinates for each and collapse all of them into a data frame. res <- map_dfr(school_links, school_grabber) res ##    latitude  longitude ## 1  42.72779 -8.6567935 ## 2  43.24439 -8.8921645 ## 3  38.95592 -1.2255769 ## 4  39.18657 -1.6225903 ## 5  40.38245 -3.6410388 ## 6  40.22929 -3.1106322 ## 7  40.43860 -3.6970366 ## 8  40.33514 -3.5155669 ## 9  40.50546 -3.3738441 ## 10 40.63826 -3.4537107 ## 11 40.38543 -3.6639500 ## 12 37.76485 -1.5030467 ## 13 38.82745  0.0221681 ## 14 40.99434 -5.6224391 ## 15 40.99434 -5.6224391 ## 16 40.56037 -5.6703725 ## 17 40.99434 -5.6224391 ## 18 40.99434 -5.6224391 ## 19 41.13593  0.9901905 ## 20 41.26155  1.1670507 ## 21 41.22851  0.5461471 ## 22 41.14580  0.8199749 ## 23 41.18341  0.5680564 ## 24 42.07820  1.8203155 ## 25 42.25245  1.8621546 ## 26 41.73767  1.8383666 ## 27 41.62345  2.0013628 So now that we have the locations of these schools, let\u2019s plot them: res <- mutate_all(res, as.numeric)  sp_sf <-   ne_countries(scale  "large", country  "Spain", returnclass  "sf") %>%   st_transform(crs  4326)  ggplot(sp_sf) +   geom_sf() +   geom_point(data  res, aes(x  longitude, y  latitude)) +   coord_sf(xlim  c(-20, 10), ylim  c(25, 45)) +   theme_minimal() +   ggtitle("Sample of schools in Spain")  There we go! We went from literally no information at the beginning of this tutorial to interpretable and summarized information only using web data. We can see some schools in Madrid (center) as well in other regions of Spain, including Catalonia and Galicia. This marks the end of our scraping adventure but before we finish, I want to mention some of the ethical guidelines for web scraping. Scraping is extremely useful for us but can give headaches to other people maintaining the website of interest. Here\u2019s a list of ethical guidelines you should always follow:  Read the terms and services: many websites prohibit web scraping and you could be in a breach of privacy by scraping the data. One famous example. Check the robots.txt file. This is a file that most websites have (www.buscocolegio.com does not) which tell you which specific paths inside the website are scrapable and which are not. See here for an explanation of what robots.txt look like and where to find them. Some websites are supported by very big servers, which means you can send 4-5 website requests per second. Others, such as www.buscocolegio.com are not. It\u2019s good practice to always put a time sleep between your requests. In our example, I set it to 5 seconds because this is a small website and we don\u2019t want to crash their servers. When making requests, there are computational ways of identifying yourself. For example, every request (such as the one\u2019s we do) can have something called a User-Agent. It is good practice to include yourself in as the User-Agent (as we did in our code) because the admin of the server can directly identify if someone\u2019s causing problems due to their web scraping. Limit your scraping to non-busy hours such as overnight. This can help reduce the chances of collapsing the website since fewer people are visiting websites in the evening.  You can read more about these ethical issues here.   Wrap up This tutorial introduced you to basic concepts in web scraping and applied them in a real-world setting. Web scraping is a vast field in computer science (you can find entire books on the subject such as this). We covered some basic techniques which I think can take you a long way but there\u2019s definitely more to learn. For those curious about where to turn, I\u2019m looking forward to the upcoming book \u201cA Field Guide for Web Scraping and Accessing APIs with R\u201d by Bob Rudis, which should be released in the near future. Now go scrape some websites ethically!","keywords":"","datePublished":"2020-02-10T18:00:00-06:00","dateModified":"2020-02-10T18:00:00-06:00","author":{"@type":"Person","name":"R on Coding Club UC3M","description":"","url":"https:\/\/www.r-bloggers.com\/author\/r-on-coding-club-uc3m\/","sameAs":["\/categories\/r\/"],"image":{"@type":"ImageObject","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","height":96,"width":96}},"publisher":{"@id":"https:\/\/www.r-bloggers.com#Organization"},"image":[{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/xml_examples\/xml_one.png","width":313,"height":578,"@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#primaryimage"},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/xml_examples\/xml_two.png","width":276,"height":636},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/xml_examples\/xml_three.png","width":246,"height":637},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/main_page.png","width":1920,"height":988},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/developer_tools.png","width":1920,"height":988},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/search_developer_tools.png","width":729,"height":512},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/location_tag.png","width":1920,"height":988},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/location_tag_zoomed.png","width":717,"height":468},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/buscocolegios_xml\/location_tag_zoomed.png","width":717,"height":468},{"@type":"ImageObject","url":"https:\/\/codingclubuc3m.rbind.io\/post\/2020-02-11_files\/figure-html\/unnamed-chunk-34-1.png","width":1344,"height":830}],"isPartOf":{"@id":"https:\/\/www.r-bloggers.com\/2020\/02\/an-introduction-to-web-scraping-locating-spanish-schools\/#webpage"}}]}]
</script>

    <script>
        var snp_f = [];
        var snp_hostname = new RegExp(location.host);
        var snp_http = new RegExp("^(http|https)://", "i");
        var snp_cookie_prefix = '';
        var snp_separate_cookies = false;
        var snp_ajax_url = 'https://www.r-bloggers.com/wp-admin/admin-ajax.php';
		var snp_ajax_nonce = '381996bdf4';
        var snp_ignore_cookies = false;
        var snp_enable_analytics_events = false;
        var snp_enable_mobile = false;
        var snp_use_in_all = false;
        var snp_excluded_urls = [];
        snp_excluded_urls.push('');    </script>
    <div class="snp-root">
        <input type="hidden" id="snp_popup" value="" />
        <input type="hidden" id="snp_popup_id" value="" />
        <input type="hidden" id="snp_popup_theme" value="" />
        <input type="hidden" id="snp_exithref" value="" />
        <input type="hidden" id="snp_exittarget" value="" />
        	<div id="snppopup-welcome" class="snp-pop-109583 snppopup"><input type="hidden" class="snp_open" value="scroll" /><input type="hidden" class="snp_show_on_exit" value="2" /><input type="hidden" class="snp_exit_js_alert_text" value="" /><input type="hidden" class="snp_exit_scroll_down" value="" /><input type="hidden" class="snp_exit_scroll_up" value="" /><input type="hidden" class="snp_open_scroll" value="50" /><input type="hidden" class="snp_optin_redirect_url" value="" /><input type="hidden" class="snp_show_cb_button" value="yes" /><input type="hidden" class="snp_popup_id" value="109583" /><input type="hidden" class="snp_popup_theme" value="theme6" /><input type="hidden" class="snp_overlay" value="disabled" /><input type="hidden" class="snp_cookie_conversion" value="30" /><input type="hidden" class="snp_cookie_close" value="180" /><div class="snp-fb snp-theme6">
    <div class="snp-subscribe-inner">
	<h1 class="snp-header"><i>Never miss an update! </i>
<br/>
<strong>Subscribe to R-bloggers</strong> to receive <br/>e-mails with the latest R posts.<br/>

<small>(You will not see this message again.)</small></h1>	<div class="snp-form">
	    <form action="https://feedburner.google.com/fb/a/mailverify?uri=RBloggers" method="post" class="snp-subscribeform snp_subscribeform">
				<fieldset>
		    <div class="snp-field">
			<input type="text" name="email" id="snp_email" placeholder="Your E-mail..." class="snp-field snp-field-email" />		
		    </div>
		    <button type="submit" class="snp-submit">Submit</button>
		</fieldset>
	    </form>
	</div>
	<a href="#" class="snp_nothanks snp-close">Click here to close (This popup will not appear again)</a>    </div>
    </div>
<style>.snp-pop-109583 .snp-theme6 { max-width: 700px;}
.snp-pop-109583 .snp-theme6 h1 {font-size: 17px;}
.snp-pop-109583 .snp-theme6 { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field ::-webkit-input-placeholder { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field :-moz-placeholder { color: #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field :-ms-input-placeholder { color: #a0a4a9;}
.snp-pop-109583  .snp-theme6 .snp-field input { border: 1px solid #a0a4a9;}
.snp-pop-109583 .snp-theme6 .snp-field { color: #000000;}
.snp-pop-109583 .snp-theme6 { background: #f2f2f2;}
</style><script>
jQuery(document).ready(function() {
});
</script>
</div>        <script type="text/javascript">
            var CaptchaCallback = function() {
                jQuery('.g-recaptcha').each(function(index, el) {
                    grecaptcha.render(el, {
                        'sitekey' : ''
                    });
                });
            };
        </script>
    </div>
    <script type="text/javascript">/* <![CDATA[ */!function(e,n){var r={"selectors":{"block":"pre","inline":"code"},"options":{"indent":4,"ampersandCleanup":true,"linehover":true,"rawcodeDbclick":false,"textOverflow":"scroll","linenumbers":false,"theme":"enlighter","language":"r","retainCssClasses":false,"collapse":false,"toolbarOuter":"","toolbarTop":"{BTN_RAW}{BTN_COPY}{BTN_WINDOW}{BTN_WEBSITE}","toolbarBottom":""},"resources":["https:\/\/www.r-bloggers.com\/wp-content\/plugins\/enlighter\/cache\/enlighterjs.min.css?vVCnEZeurtkU0vr","https:\/\/www.r-bloggers.com\/wp-content\/plugins\/enlighter\/\/resources\/enlighterjs\/enlighterjs.min.js"]},o=document.getElementsByTagName("head")[0],t=n&&(n.error||n.log)||function(){};e.EnlighterJSINIT=function(){!function(e,n){var r=0,l=null;function c(o){l=o,++r==e.length&&(!0,n(l))}e.forEach(function(e){switch(e.match(/\.([a-z]+)(?:[#?].*)?$/)[1]){case"js":var n=document.createElement("script");n.onload=function(){c(null)},n.onerror=c,n.src=e,n.async=!0,o.appendChild(n);break;case"css":var r=document.createElement("link");r.onload=function(){c(null)},r.onerror=c,r.rel="stylesheet",r.type="text/css",r.href=e,r.media="all",o.appendChild(r);break;default:t("Error: invalid file extension",e)}})}(r.resources,function(e){e?t("Error: failed to dynamically load EnlighterJS resources!",e):"undefined"!=typeof EnlighterJS?EnlighterJS.init(r.selectors.block,r.selectors.inline,r.options):t("Error: EnlighterJS resources not loaded yet!")})},(document.querySelector(r.selectors.block)||document.querySelector(r.selectors.inline))&&e.EnlighterJSINIT()}(window,console); /* ]]> */</script><script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/jquery.ck.min.js?ver=5.5.1' id='jquery-np-cookie-js'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/dialog_trigger.js?ver=5.5.1' id='js-dialog_trigger-js'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/js/ninjapopups.min.js?ver=5.5.1' id='js-ninjapopups-js'></script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/arscode-ninja-popups/fancybox2/jquery.fancybox.min.js?ver=5.5.1' id='fancybox2-js'></script>
<script type='text/javascript' src='https://c0.wp.com/p/jetpack/7.3.2/_inc/build/photon/photon.min.js' id='jetpack-photon-js'></script>
<script type='text/javascript' id='flying-pages-js-before'>
window.FPConfig= {
	delay: 0,
	ignoreKeywords: ["\/wp-admin","\/wp-login.php","\/cart","add-to-cart","logout","#","?",".png",".jpeg",".jpg",".gif",".svg"],
	maxRPS: 3,
    hoverDelay: 50
};
</script>
<script type='text/javascript' src='https://www.r-bloggers.com/wp-content/plugins/flying-pages/flying-pages.min.js?ver=2.4.2' id='flying-pages-js' defer></script>
<script type='text/javascript' src='https://s0.wp.com/wp-content/js/devicepx-jetpack.js?ver=202040' id='devicepx-js'></script>
<script type='text/javascript' src='https://c0.wp.com/p/jetpack/7.3.2/_inc/build/lazy-images/js/lazy-images.min.js' id='jetpack-lazy-images-js'></script>
<script type='text/javascript' src='https://c0.wp.com/c/5.5.1/wp-includes/js/wp-embed.min.js' id='wp-embed-js'></script>
<script type='text/javascript' src='https://stats.wp.com/e-202040.js' async='async' defer='defer'></script>
<script type='text/javascript'>
	_stq = window._stq || [];
	_stq.push([ 'view', {v:'ext',j:'1:7.3.2',blog:'11524731',post:'193096',tz:'-6',srv:'www.r-bloggers.com'} ]);
	_stq.push([ 'clickTrackerInit', '11524731', '193096' ]);
</script>
	<script type="text/javascript">
        jQuery(document).ready(function ($) {
            //$( document ).ajaxStart(function() {
            //});

			
            for (var i = 0; i < document.forms.length; ++i) {
                var form = document.forms[i];
				if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="PnM-zH_AKJNfBeFs" value="v*TMZOlu5zynohG6" />'); }
if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="-cRxUiIS" value="6UaC7c1T" />'); }
if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="VXUxhtONi" value="Nc8akts2n" />'); }
if ($(form).attr("method") != "get") { $(form).append('<input type="hidden" name="GyoDAYLMZ" value="yE2gzT5Zm" />'); }
            }

			
            $(document).on('submit', 'form', function () {
				if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="PnM-zH_AKJNfBeFs" value="v*TMZOlu5zynohG6" />'); }
if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="-cRxUiIS" value="6UaC7c1T" />'); }
if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="VXUxhtONi" value="Nc8akts2n" />'); }
if ($(this).attr("method") != "get") { $(this).append('<input type="hidden" name="GyoDAYLMZ" value="yE2gzT5Zm" />'); }
                return true;
            });

			
            jQuery.ajaxSetup({
                beforeSend: function (e, data) {

                    //console.log(Object.getOwnPropertyNames(data).sort());
                    //console.log(data.type);

                    if (data.type !== 'POST') return;

                    if (typeof data.data === 'object' && data.data !== null) {
						data.data.append("PnM-zH_AKJNfBeFs", "v*TMZOlu5zynohG6");
data.data.append("-cRxUiIS", "6UaC7c1T");
data.data.append("VXUxhtONi", "Nc8akts2n");
data.data.append("GyoDAYLMZ", "yE2gzT5Zm");
                    }
                    else {
                        data.data =  data.data + '&PnM-zH_AKJNfBeFs=v*TMZOlu5zynohG6&-cRxUiIS=6UaC7c1T&VXUxhtONi=Nc8akts2n&GyoDAYLMZ=yE2gzT5Zm';
                    }
                }
            });

        });
	</script>
	</body>
</html>
<!-- Dynamic page generated in 1.088 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2020-09-28 05:59:48 -->

<!-- Compression = gzip -->