Step-by-Step Guide to Use R and Selenium to Scrape Empleos Publicos (Part 3)

pacha.dev/blog

1 day ago

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Because of delays with my scholarship payment, if this post is useful to you I kindly ask a minimal donation on Buy Me a Coffee. It shall be used to continue my Open Source efforts. The full explanation is here: A Personal Message from an Open Source Contributor.

Continuing with the previous Selenium post, now I will process each job offer and organize its contents.

This requires the readxl package to read XLSX files:

if (!require(readxl)) install.packages("readxl")

To read the XLSX from part2 and start reading each offer I start with:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)
library(writexl)
library(readxl)

offers_tbl <- read_xlsx("offers_20250821.xlsx")

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

From this table, I can proceed reading the HTML for each job offer and see how it is structured. Starting with the first URL:

rmDr$navigate(offers_tbl$link[1])
html <- read_html(rmDr$getPageSource()[[1]])

This specific job offer has the following contents:

> html
{html_document}
<html lang="es">
[1] <head id="Head1">\n<http-equiv="Content-Type" content="text/html; ch ...
[2] <body>\n        <form name="form1" method="post" action="convpostularavis ...

Inspecting the details in the offer as in part 1, the full description is contained in a single HTML division with sub-divisions:

<div class="item formatodeclaraciones">
                                            <div class="row top">
                                                <h2><span id="lblAvisoTrabajo">Medico (a) especialista en Anestesiología</span></h2>
                                            </div>
                                            <hr>
                                            <div class="bottom">
                                                <div class="row">

                                                    <div class="col-md-6">
                                                <span id="lblAvisoTrabajoDatos"><div><h3> Institución</h3><p>Ministerio de Salud / Servicio de Salud Maule / Hospital de Constitución</p><h3>Convocatoria</h3><p>Medico (a) especialista en Anestesiología 44 horas</p><h3>Nº de Vacantes </h3><p>1</p><h3>Área de Trabajo</h3><p>Salud</p><h3>Región</h3><p>Región del Maule</p><h3>Ciudad</h3><p>Constitución</p><h3>Tipo de Vacante</h3><p>Contrata</p></div></span>
...

To organize this in a table, I can do the following amongh other possibilities to get the title, institution, number of offers, city, compensation, and educational requirements:

title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)

d <- tibble(
  title = title,
  institution = institution,
  positions = positions,
  city = city,
  compensation = compensation,
  education = education
)

The result is the following table:

> d
# A tibble: 1 × 6
  title                       institution positions city  compensation education
  <chr>                       <chr>       <chr>     <chr> <chr>        <chr>    
1 Medico (a) especialista en… Ministerio… 1         Cons… Renta Bruta… Título p…

I see that the compensation value needs tidying:

> d$compensation
[1] "Renta Bruta6.398.194"

To tidy it, I can do this to remove the leading text and number separators:

d <- d %>%
  mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))

which leads to the desired value for posterior analysis:

> d$compensation
[1] 6398194

To do the same with each of the 545 saved job offers, we repeat the same with purrr:

descriptions_tbl <- map_df(
  seq_len(nrow(offers_tbl)),
  function(x) {
    print(x) # just to see at which iteration it fails (if it fails)

    rmDr$navigate(offers_tbl$link[x])
    html <- read_html(rmDr$getPageSource()[[1]])

    title <- html %>% html_element("h2 span#lblAvisoTrabajo") %>% html_text(trim = TRUE)
    institution <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Institución') + p") %>% html_text(trim = TRUE)
    positions <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Nº de Vacantes') + p") %>% html_text(trim = TRUE)
    city <- html %>% html_element("span#lblAvisoTrabajoDatos h3:contains('Ciudad') + p") %>% html_text(trim = TRUE)
    compensation <- html %>% html_element("span#lblRenta li ul li") %>% html_text(trim = TRUE)
    education <- html %>% html_element("span#lblFormacion p") %>% html_text(trim = TRUE)

    d <- tibble(
      title = title,
      institution = institution,
      positions = positions,
      city = city,
      compensation = compensation,
      education = education
    )

    d <- d %>%
      mutate(compensation = as.numeric(gsub("Renta Bruta|\\.", "", compensation)))

    d
  }
)

The result is the following:

> descriptions_tbl
# A tibble: 545 × 6
   title                      institution positions city  compensation education
   <chr>                      <chr>       <chr>     <chr>        <dbl> <chr>    
 1 Medico (a) especialista e… Ministerio… 1         Cons…      6398194 "Título …
 2 NA                         NA          NA        NA              NA  NA      
 3 ENFERMERA-O, JORNADA DIUR… Ministerio… 1         Reco…      1906087 "Título …
 4 Psiquiatra infanto-juveni… Ministerio… 1         La P…      2333658 ""       
 5 Neurólogo(a) adulto GES A… Ministerio… 1         Puen…       637926 ""       
 6 Médico(a) especialista en… Ministerio… 2         Cast…      5328446 "Título …
 7 Arquitecto de Software     Ministerio… 1         Ñuñoa      2256523 "Profesi…
 8 DIRECCIÓN DEL SERVICIO DE… Ministerio… 1         Chil…      1540891 ""       
 9 01 CARGO DE TENS OPERADOR… Ministerio… 1         Peña…       851161 ""       
10 TENS DE CUIDADOS PALIATIV… Ministerio… 2         San …       636462 "Titulo …
# ℹ 535 more rows

There are some blank rows because of links under maintenance or that lead to external municipal sites with a different structure.

Here is a recount of the blanks on each field:

descriptions_tbl %>%
  summarise(
    across(
      everything(),
      list(
        na_count = ~sum(is.na(.))
      ),
      .names = "{.col}_{.fn}"
    )
  )

which shows that all the blank values correspond to the same observations:

# A tibble: 1 × 6
  title_na_count institution_na_count positions_na_count city_na_count
           <int>                <int>              <int>         <int>
1            187                  187                187           187
# ℹ 2 more variables: compensation_na_count <int>, education_na_count <int>

I got 547 – 187 = 360 well organized observations with a scraping process that took around five minutes. Not bad!

This needs an XLSX backup to avoid scraping twice:

write_xlsx(descriptions_tbl, "descriptions_20250821.xlsx")

I hope this was useful. In the next parts I will cover some analysis and plots with this data.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related