Step-by-Step Guide to Use R and Selenium to Scrape Empleos Publicos

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

Motivation

My friend Nicolas Didier asked me about reading Empleos Publicos with R or Python. Here is a short example for him and anybody who may benefit from reading this.

The following steps were adapted from a tutorial I taught at the University of Michigan (GO BLUE!) in 2023.

Required R packages

  • RSelenium: R-Selenium integration
  • rvest: HTML processing
  • dplyr: to load the pipe operator (can be used later for data cleaning)
  • purrr: iteration (i.e., repeated operations)

I installed RSelenium from the R console:

if (!require(RSelenium)) install.packages("RSelenium")

# or

remotes::install_github("ropensci/RSelenium")

For the rest of the packages:

if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(purrr)) install.packages("purrr")

Installing Selenium and Chrome/Chromium

Note for Ubuntu/Debian users: We need to check that chrome or chromium is installed in our system. One of the many options is to use the bash console.

sudo add-apt-repository ppa:savoury1/chromium
sudo apt update
sudo apt install chromium-browser
sudo apt install chromium-chromedriver

Not using the PPA will install the snap version of Chromium, which is not compatible with Selenium.

I tried to start Selenium as it is mentioned in the official guide and it did not work.

I had to install Chromium. I am on Manjaro and I ran sudo pacman -S chromium. Windows/Mac users can use Google Chrome.

An extra requirement was to download Selenium Server. Based on this, I started by creating a directory to store the data for this post by typing this in VS Code terminal:

mkdir -p /tmp/didier-example
cd /tmp/didier-example

Then I opened R witn R and downloaded the JAR file:

url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar"
sel_jar <- "selenium-server-standalone-3.9.1.jar"

if (!file.exists(sel_jar)) {
  download.file(url_jar, sel_jar)
}

I had to run Selenium from a new terminal:

cd /tmp/didier-example
java -jar selenium-server-standalone-3.9.1.jar

Back to the R terminal, I was finally in condition to control the browser from R:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

url <- "https://www.empleospublicos.cl"

rmDr$navigate(url)

This should display a new Chrome/Chromium window that says “Chrome is being controlled by automated test software”.

Scraping the data

Using the browser’s inspector (ctrl + shift + i), I explored the page to see that the search bar corresponds to:

<input class="buscador-principal search form-control buscador-movil" name="q" type="search" autocomplete="off" placeholder="Ingresa el cargo, comuna o institución" id="buscadorprincipal">

For example, I can search for “Ministerio de Salud” because there were many posts by that organization on the landing page:

search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal")
search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))

That typed “Ministerio de Salud” and clicked search on my behalf. Inspecting the results I see that each job offer starts with

<div class="items col-md-4 col-lg-4 postulacion ...

The first offer listed is this:

<div class="items col-md-4 col-lg-4 postulacion otro otro eepp region7renta3calidad2 busqueda "><div class="item"><div class="top"><div class="label label-estado"><i class="fa fa-circle circulo-status1" aria-hidden="true"></i> Postulación hasta 30/09/2025 23:59:00</div><h3><a target="_blank" href="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="ga('send', 'event', 'convocatorias', 'Medico (a) especialista en Anestesiología 44 horas | Servicio de Salud Maule / Hospital de Constitución', 'eepp');">Medico (a) especialista en Anestesiología 44 horas</a></h3><p>Servicio de Salud Maule / Hospital de Constitución</p></div><hr><div class="cnt"><p>Ministerio de Salud</p><p>Constitución</p><br><div class="alert alert-primer"><i class="fa fa-address-card" aria-hidden="true"></i>  No pide experiencia</div><div class="row card-footer"><div class="col-xs-9 col-md-8 text-left"><a class="cronograma btn " url="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="return false;" href="#" title="Ver Cronograma de la Convocatoria"><i class="fa fa-calendar-days"></i> Calendarización</a>
        <div class="compartir-social">      
            <div class="row">
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('t', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Twitter"><i class="fa-brands fa-square-x-twitter fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('f', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Facebook"><i class="fa-brands fa-square-facebook fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn" onclick="enviarRS('l', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Linkedin"><i class="fa-brands fa-linkedin fa-xl" aria-hidden="true"></i></a>
                </div>
                <div class="col-xs-3 col-md-4">
                    <a class="btn whatsapp-link visible-xs visible-sm" title="Compartir en Whatsapp" onclick="enviarRS('w', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" data-action="share/whatsapp/share"><i class="fa-brands fa-square-whatsapp fa-xl" aria-hidden="true"></i></a>
                </div>
            </div>
        </div>
    <div class="row"><div class="col-md-12 card-footer-contenido "></div></div></div></div></div></div></div>
html <- read_html(rmDr$getPageSource()[[1]])

offers <- html %>%
  html_nodes("div.items")

offers_tbl <- map_df(offers, function(offer) {
  # Extract position (job title)
  position <- offer %>%
    html_node("h3 a") %>%
    html_text(trim = TRUE)
  
  # Extract organization (usually the first <p> inside .top)
  organization <- offer %>%
    html_node(".top p") %>%
    html_text(trim = TRUE)
  
  # Extract city (the second <p> inside .cnt)
  city <- offer %>%
    html_nodes(".cnt p") %>%
    .[2] %>%
    html_text(trim = TRUE)
  
  tibble(
    position = position,
    organization = organization,
    city = city
  )
})

The result has the following structure:

offers_tbl
# A tibble: 552 × 3
   position                                                   organization city 
   <chr>                                                      <chr>        <chr>
 1 Medico (a) especialista en Anestesiología 44 horas         Servicio de… Cons…
 2 Titulares de la Planta Profesional Ley 18.834              Servicio de… Valp…
 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLÍN… Servicio de… Reco…
 4 Psiquiatra infanto-juvenil sistema de atención intersecto… Servicio de… La P…
 5 Neurólogo(a) adulto GES Alzheimer y otras demencias        Servicio de… Puen…
 6 Médico(a) especialista en Neurología Infantil Hospital de… Servicio de… Cast…
 7 Arquitecto de Software                                     Central de … Ñuñoa
 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIÓN                 Servicio de… Peña…
 9 (850-2892) Médico Especialista Broncopulmonar o Internist… Servicio de… Talc…
10 Enfermero(a) Clínico(a) Atención Abierta y Cerrada         Servicio de… Huas…
glimpse(offers_tbl)
> glimpse(offers_tbl)
Rows: 552
Columns: 3
$ position     <chr> "Medico (a) especialista en Anestesiología 44 horas", "Ti…
$ organization <chr> "Servicio de Salud Maule / Hospital de Constitución", "Se…
$ city         <chr> "Constitución", "Valparaíso", "Recoleta", "La Pintana", "…

I know this is a simple example but should allow different kinds of exploration and data extraction. I hope it helps.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)