Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.
Continuing with the previous Selenium post, now I will process each job offer and organize its contents.
This requires the readxl package to read XLSX files:
if (!require(readxl)) install.packages("readxl")
To read the XLSX from part2 and start reading each offer I start with:
library(RSelenium) library(rvest) library(dplyr) library(purrr) library(writexl) library(readxl) offers_tbl <- read_xlsx("offers_20250821.xlsx") rmDr <- remoteDriver(port = 4444L, browserName = "chrome") rmDr$open(silent = TRUE)
From this table, I can proceed reading the HTML for each job offer and see how it is structured. Starting with the first URL:
rmDr$navigate(offers_tbl$link[1]) html <- read_html(rmDr$getPageSource()[[1]])
This specific job offer has the following contents:
> html {html_document} <html lang="es"> [1] <head id="Head1">\n<http-equiv="Content-Type" content="text/html; ch ... [2] <body>\n <form name="form1" method="post" action="convpostularavis ...
Inspecting the details in the offer as in part 1, the full description is contained in a single HTML division with sub-divisions:
<div class="item formatodeclaraciones"> <div class="row top"> <h2><span id="lblAvisoTrabajo">Medico (a) especialista en Anestesiología</span></h2> </div> <hr> <div class="bottom"> <div class="row"> <div class="col-md-6"> <span id="lblAvisoTrabajoDatos"><div><h3> Institución</h3><p>Ministerio de Salud / Servicio de Salud Maule / Hospital de Constitución</p><h3>Convocatoria</h3><p>Medico (a) especialista en Anestesiología 44 horas</p><h3>Nº de Vacantes </h3><p>1</p><h3>Área de Trabajo</h3><p>Salud</p><h3>Región</h3><p>Región del Maule</p><h3>Ciudad</h3><p>Constitución</p><h3>Tipo de Vacante</h3><p>Contrata</p></div></span> ...
To organize this in a table, I can do the following amongh other possibilities to get the title, institution, number of offers, city, compensation, and educational requirements:
title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE) institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE) positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE) city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE) compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE) education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE) d <- tibble( title = title, institution = institution, positions = positions, city = city, compensation = compensation, education = education )
The result is the following table:
> d # A tibble: 1 × 6 title institution positions city compensation education <chr> <chr> <chr> <chr> <chr> <chr> 1 Medico (a) especialista en… Ministerio… 1 Cons… Renta Bruta… Título p…
I see that the compensation value needs tidying:
> d$compensation [1] "Renta Bruta6.398.194"
To tidy it, I can do this to remove the leading text and number separators:
d <- d %>% mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))
which leads to the desired value for posterior analysis:
> d$compensation [1] 6398194
To do the same with each of the 545 saved job offers, we repeat the same with purrr:
descriptions_tbl <- map_df( seq_len(nrow(offers_tbl)), function(x) { print(x) # just to see at which iteration it fails (if it fails) rmDr$navigate(offers_tbl$link[x]) html <- read_html(rmDr$getPageSource()[[1]]) title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE) institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE) positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE) city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE) compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE) education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE) d <- tibble( title = title, institution = institution, positions = positions, city = city, compensation = compensation, education = education ) d <- d %>% mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation))) d } )
The result is the following:
> descriptions_tbl # A tibble: 545 × 6 title institution positions city compensation education <chr> <chr> <chr> <chr> <dbl> <chr> 1 Medico (a) especialista e… Ministerio… 1 Cons… 6398194 "Título … 2 NA NA NA NA NA NA 3 ENFERMERA-O, JORNADA DIUR… Ministerio… 1 Reco… 1906087 "Título … 4 Psiquiatra infanto-juveni… Ministerio… 1 La P… 2333658 "" 5 Neurólogo(a) adulto GES A… Ministerio… 1 Puen… 637926 "" 6 Médico(a) especialista en… Ministerio… 2 Cast… 5328446 "Título … 7 Arquitecto de Software Ministerio… 1 Ñuñoa 2256523 "Profesi… 8 DIRECCIÓN DEL SERVICIO DE… Ministerio… 1 Chil… 1540891 "" 9 01 CARGO DE TENS OPERADOR… Ministerio… 1 Peña… 851161 "" 10 TENS DE CUIDADOS PALIATIV… Ministerio… 2 San … 636462 "Titulo … # ℹ 535 more rows
There are some blank rows because of links under maintenance or that lead to external municipal sites with a different structure.
Here is a recount of the blanks on each field:
descriptions_tbl %>% summarise( across( everything(), list( na_count = ~sum(is.na(.)) ), .names = "{.col}_{.fn}" ) )
which shows that all the blank values correspond to the same observations:
# A tibble: 1 × 6 title_na_count institution_na_count positions_na_count city_na_count <int> <int> <int> <int> 1 187 187 187 187 # ℹ 2 more variables: compensation_na_count <int>, education_na_count <int>
I got 547 – 187 = 360 well organized observations with a scraping process that took around five minutes. Not bad!
This needs an XLSX backup to avoid scraping twice:
write_xlsx(descriptions_tbl, "descriptions_20250821.xlsx")
I hope this was useful. In the next parts I will cover some analysis and plots with this data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.