Step-by-Step Guide to Use R and Selenium to Scrape Empleos Publicos
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.
Motivation
My friend Nicolas Didier asked me about reading Empleos Publicos with R or Python. Here is a short example for him and anybody who may benefit from reading this.
The following steps were adapted from a tutorial I taught at the University of Michigan (GO BLUE!) in 2023.
Required R packages
- RSelenium: R-Selenium integration
- rvest: HTML processing
- dplyr: to load the pipe operator (can be used later for data cleaning)
- purrr: iteration (i.e., repeated operations)
I installed RSelenium from the R console:
if (!require(RSelenium)) install.packages("RSelenium") # or remotes::install_github("ropensci/RSelenium")
For the rest of the packages:
if (!require(rvest)) install.packages("rvest") if (!require(dplyr)) install.packages("dplyr") if (!require(purrr)) install.packages("purrr")
Installing Selenium and Chrome/Chromium
Note for Ubuntu/Debian users: We need to check that chrome
or chromium
is installed in our system. One of the many options is to use the bash console.
sudo add-apt-repository ppa:savoury1/chromium sudo apt update sudo apt install chromium-browser sudo apt install chromium-chromedriver
Not using the PPA will install the snap version of Chromium, which is not compatible with Selenium.
I tried to start Selenium as it is mentioned in the official guide and it did not work.
I had to install Chromium. I am on Manjaro and I ran sudo pacman -S chromium
. Windows/Mac users can use Google Chrome.
An extra requirement was to download Selenium Server. Based on this, I started by creating a directory to store the data for this post by typing this in VS Code terminal:
mkdir -p /tmp/didier-example cd /tmp/didier-example
Then I opened R witn R
and downloaded the JAR file:
url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar" sel_jar <- "selenium-server-standalone-3.9.1.jar" if (!file.exists(sel_jar)) { download.file(url_jar, sel_jar) }
I had to run Selenium from a new terminal:
cd /tmp/didier-example java -jar selenium-server-standalone-3.9.1.jar
Back to the R terminal, I was finally in condition to control the browser from R:
library(RSelenium) library(rvest) library(dplyr) library(purrr) rmDr <- remoteDriver(port = 4444L, browserName = "chrome") rmDr$open(silent = TRUE) url <- "https://www.empleospublicos.cl" rmDr$navigate(url)
This should display a new Chrome/Chromium window that says “Chrome is being controlled by automated test software”.
Scraping the data
Using the browser’s inspector (ctrl + shift + i), I explored the page to see that the search bar corresponds to:
<input class="buscador-principal search form-control buscador-movil" name="q" type="search" autocomplete="off" placeholder="Ingresa el cargo, comuna o institución" id="buscadorprincipal">
For example, I can search for “Ministerio de Salud” because there were many posts by that organization on the landing page:
search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal") search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))
That typed “Ministerio de Salud” and clicked search on my behalf. Inspecting the results I see that each job offer starts with
<div class="items col-md-4 col-lg-4 postulacion ...
The first offer listed is this:
<div class="items col-md-4 col-lg-4 postulacion otro otro eepp region7renta3calidad2 busqueda "><div class="item"><div class="top"><div class="label label-estado"><i class="fa fa-circle circulo-status1" aria-hidden="true"></i> Postulación hasta 30/09/2025 23:59:00</div><h3><a target="_blank" href="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="ga('send', 'event', 'convocatorias', 'Medico (a) especialista en Anestesiología 44 horas | Servicio de Salud Maule / Hospital de Constitución', 'eepp');">Medico (a) especialista en Anestesiología 44 horas</a></h3><p>Servicio de Salud Maule / Hospital de Constitución</p></div><hr><div class="cnt"><p>Ministerio de Salud</p><p>Constitución</p><br><div class="alert alert-primer"><i class="fa fa-address-card" aria-hidden="true"></i> No pide experiencia</div><div class="row card-footer"><div class="col-xs-9 col-md-8 text-left"><a class="cronograma btn " url="https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo" onclick="return false;" href="#" title="Ver Cronograma de la Convocatoria"><i class="fa fa-calendar-days"></i> Calendarización</a> <div class="compartir-social"> <div class="row"> <div class="col-xs-3 col-md-4"> <a class="btn" onclick="enviarRS('t', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Twitter"><i class="fa-brands fa-square-x-twitter fa-xl" aria-hidden="true"></i></a> </div> <div class="col-xs-3 col-md-4"> <a class="btn" onclick="enviarRS('f', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Facebook"><i class="fa-brands fa-square-facebook fa-xl" aria-hidden="true"></i></a> </div> <div class="col-xs-3 col-md-4"> <a class="btn" onclick="enviarRS('l', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" target="_blank" title="Compartir en Linkedin"><i class="fa-brands fa-linkedin fa-xl" aria-hidden="true"></i></a> </div> <div class="col-xs-3 col-md-4"> <a class="btn whatsapp-link visible-xs visible-sm" title="Compartir en Whatsapp" onclick="enviarRS('w', 'https://www.empleospublicos.cl/pub/convocatorias/convpostularavisoTrabajo.aspx?i=130648&c=0&j=0&tipo=convpostularavisoTrabajo', 'Medico (a) especialista en Anestesiología 44 horas Servicio de Salud Maule / Hospital de Constitución'); return false;" href="#" data-action="share/whatsapp/share"><i class="fa-brands fa-square-whatsapp fa-xl" aria-hidden="true"></i></a> </div> </div> </div> <div class="row"><div class="col-md-12 card-footer-contenido "></div></div></div></div></div></div></div> html <- read_html(rmDr$getPageSource()[[1]]) offers <- html %>% html_nodes("div.items") offers_tbl <- map_df(offers, function(offer) { # Extract position (job title) position <- offer %>% html_node("h3 a") %>% html_text(trim = TRUE) # Extract organization (usually the first <p> inside .top) organization <- offer %>% html_node(".top p") %>% html_text(trim = TRUE) # Extract city (the second <p> inside .cnt) city <- offer %>% html_nodes(".cnt p") %>% .[2] %>% html_text(trim = TRUE) tibble( position = position, organization = organization, city = city ) })
The result has the following structure:
offers_tbl # A tibble: 552 × 3 position organization city <chr> <chr> <chr> 1 Medico (a) especialista en Anestesiología 44 horas Servicio de… Cons… 2 Titulares de la Planta Profesional Ley 18.834 Servicio de… Valp… 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLÍN… Servicio de… Reco… 4 Psiquiatra infanto-juvenil sistema de atención intersecto… Servicio de… La P… 5 Neurólogo(a) adulto GES Alzheimer y otras demencias Servicio de… Puen… 6 Médico(a) especialista en Neurología Infantil Hospital de… Servicio de… Cast… 7 Arquitecto de Software Central de … Ñuñoa 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIÓN Servicio de… Peña… 9 (850-2892) Médico Especialista Broncopulmonar o Internist… Servicio de… Talc… 10 Enfermero(a) Clínico(a) Atención Abierta y Cerrada Servicio de… Huas… glimpse(offers_tbl) > glimpse(offers_tbl) Rows: 552 Columns: 3 $ position <chr> "Medico (a) especialista en Anestesiología 44 horas", "Ti… $ organization <chr> "Servicio de Salud Maule / Hospital de Constitución", "Se… $ city <chr> "Constitución", "Valparaíso", "Recoleta", "La Pintana", "…
I know this is a simple example but should allow different kinds of exploration and data extraction. I hope it helps.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.