Navigating & Scraping a Job Site | rvest & RSelenium

February 13, 2016
By

(This article was first published on r – Recommended Texts, and kindly contributed to R-bloggers)

One of my family members gave me an idea to perhaps try scraping data from a job site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i didn’t think of this idea myself. Needless to say, i was anxious to try this out.

I picked a site and started inspecting the HTML code to see how would i get the information i needed from each job posting. Normally, the easiest scrapes (for me) are the ones where the site is structured with two characteristics.

First, it helps if all (or at least most) of the information that i need to extract is in the site’s search results page. For instance, in the context of job postings, if you search for “Data Scientist”, and the search results show the job title, the company that’s hiring, the years of experience required, the location, and a short summary – then there is no real need to navigate to each post and get that data from the post itself.

Second characteristic is if the URL of the search results shows the result page number that you are currently in – or even shows any indication of which search result number i am looking at. For instance, google “Data Scientist” and the take note of the URL. Scroll down and click the second page, and notice that the URL now ends with “start=10”. Go to the third page and you’ll notice that the it now ends with “start=20”. Although it doesn’t mention which page, it does indicate that if you were to change those last two digits to anything (go ahead and try), the search results would begin from start + 1; i.e. if start = 10, the search results would begin with search result no. 11. If i’m lucky, some websites have clear indications in the URL, like “page=2”, which makes the task even more easier.

Now why would these two characteristics make it much easier? Mainly because you can split the URL into different parts, with only one variable – the page number – and then concatenate the different parts back. After that it’s just a matter of looping through these URLs and picking up the information you need from the HTML source.

If the above two characteristics exist, all i need is the rvest package to make it all work, with dplyr and stringr for some of the “tidying”.

There are certain instances however, when both of these characteristics do not exist. It’s usually because the site incorporates some javascript and so the URL does not change when going through different search pages. This means that in order to make this work, i would actually have to click the page buttons in order to get the HTML source – and i can’t do that with rvest.

Enter RSelenium. The wonderful R package that allows me to do all that.

As always i started off with loading the packages, assigning the URL for the search result page, and extracting the data for just the first page. You’ll have to excuse me for using the “=” operator. WordPress seems to screw up the formatting if i use the “less than” operator combined with a hyphen; which is sort of annoying.

#Load packages
library(dplyr)
library(rvest)
library(stringr)
library(RSelenium)
library(beepr)

#Manually paste the URL for the search results here
link = "jobsite search results URL here"

#Get the html source of the URL
hlink = html(link)

#Extract Job Title
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
  html_text() %>% data.frame(stringsAsFactors = FALSE) -> a
names(a) = "Title"

#Extract Recruitment Company
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
  data.frame(stringsAsFactors = FALSE) -> b
names(b) = "Company"

#Extract Links to Job postings
hlink %>% html_nodes("#main_section") %>% 
  html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
  html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> c
names(c) = "Links"

At this point i’ve only extracted the job titles, the hiring company’s name, and the link to the post. In order for me to get the same details for the remaining posts, i would need to first navigate to the next page, which involves clicking the Next button at the bottom of the search results page.

#From RSelenium
checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser = remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE) #Open the browser
Sys.sleep((5)) #Wait a few seconds
mybrowser$navigate(link) #Navigate to URL
Sys.sleep(5) 

Pages = 16 #Select how many pages to go through

for(i in 1:Pages){ 
  
  #Find the "Next" button and click it
  try(wxbutton = mybrowser$findElement(using = 'css selector', "a.pagination_item.next.lft"))
  try(wxbutton$clickElement()) # Click
  
  Sys.sleep(8)
  
  hlink = html(mybrowser$getPageSource()[[1]]) #Get the html source from site
  
  hlink %>% html_text() -> service_check
  
  #If there is a 503 error, go back
  if(grepl("503 Service", service_check)){ 
    
    mybrowser$goBack()
    
  }
  else
  {
    
    #Job Title
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_title") %>% 
      html_text() %>% data.frame(stringsAsFactors = FALSE) -> x
    names(x) = "Title"
    a = rbind(a,x) #Add the new job postings to the ones extracted earlier
    
    #Recruitment Company
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjobwrap") %>% html_nodes(".tpjob_cname") %>% html_text() %>% 
      data.frame(stringsAsFactors = FALSE) -> y
    names(y) = "Company"
    b = rbind(b,y)
    
    #Links
    hlink %>% html_nodes("#main_section") %>% 
      html_nodes(".tpjob_item") %>% html_nodes(".tpjob_lnk") %>% 
      html_attr("href") %>% data.frame(stringsAsFactors = FALSE) -> z
    names(z) = "Links"
    c = rbind(c,z)
    
  }
  
}

beep()

#Put everything together in one dataframe
compile = cbind(a,b,c)

#export a copy, for backup
write.csv(compile, "Backup.csv", row.names = FALSE)

#close server and browser
mybrowser$close()
mybrowser$closeServer()

Now that i have all the links to the posts, i can now loop through the previously compiled dataframe and get all the details from all the URLS.

#Make another copy to loop through
compile_2 = compile

#Create 8 new columns to represent the details to be extracted
compile_2$Location = NA
compile_2$Experience = NA
compile_2$Education = NA
compile_2$Stream = NA
compile_2$Function = NA
compile_2$Role = NA
compile_2$Industry = NA
compile_2$Posted_On = NA

#3 loops, 2 in 1
#First loop to go through the links extracted
for(i in 1:nrow(compile_2)){
  
  hlink = ""
  
  link = compile_2$Links[i]
  
  try(hlink = html(link))
  
  if(html_text(hlink) != ""){
  
        hlink %>% html_nodes(".jd_infoh") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> a_column
        
        hlink %>% html_nodes(".jd_infotxt") %>% 
          html_text() %>% data.frame(stringsAsFactors = FALSE) -> l_column
   
  if(nrow(a_column) != 0){      
             
        #Second loop to check if the details are in the same order in each page
        for(j in nrow(l_column):1){
          
          if(nchar(str_trim(l_column[j,1])) == 0){l_column[-j,] %>% data.frame(stringsAsFactors = FALSE) -> l_column}
          
        }
         
    if(nrow(a_column) == nrow(l_column)){
    
        cbind(a_column, l_column) -> comp_column
        
        #Third loop to update dataframe with all the details from each post
        for(k in 1:nrow(comp_column)){
          
          if(grepl("Location", comp_column[k,1])){compile_2$Location[i] = comp_column[k,2]} 
          
          if(grepl("Experience", comp_column[k,1])){compile_2$Experience[i] = comp_column[k,2]}
          
          if(grepl("Education", comp_column[k,1])){compile_2$Education[i] = comp_column[k,2]}
          
          if(grepl("Stream", comp_column[k,1])){compile_2$Stream[i] = comp_column[k,2]}
          
          if(grepl("Function", comp_column[k,1])){compile_2$Function[i] = comp_column[k,2]}
          
          if(grepl("Role", comp_column[k,1])){compile_2$Role[i] = comp_column[k,2]}
          
          if(grepl("Industry", comp_column[k,1])){compile_2$Industry[i] = comp_column[k,2]}
          
          if(grepl("Posted", comp_column[k,1])){compile_2$Posted_On[i] = comp_column[k,2]}
        }
  
  }
  }
  }
}

beep()

#Export a copy for backup
write.csv(compile_2, "Raw_Complete.csv", row.names = FALSE)

#Alert
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.2)
beep()
Sys.sleep(0.3)
beep(sound = 8) #That one's just me goofing around

Alright, we now have a nice dataframe of 1840 jobs and 11 columns showing:

1. Job Title
2. Company: The hiring company.
3. Links: The URL of the job posting.
4. Location: Where the job is situated.
5. Experience: Level of experience required for the job, shown as a range (e.g. 2-3 years)
6. Education: Minimum educational qualification.
7. Stream: Work stream category.
8. Function: Job function category
9. Role: Then job’s general role.
10. Industry: Which industry the hiring company is involved in.
11. Posted_On: The day the job was originally posted.

As a matter of convenience, i decided to split the 5th column, Experience, into two other columns:

12. Min: Minimum years of experience required.
13. Max: Maximum years of experience.

The code used to mutate this Experience column was:

com_clean = compile_2

#logical vector of all the observation with no details extracted because of error
is.na(com_clean[,4]) -> log_vec


#Place NA tows in separate dataframe
com_clean_NA = com_clean[log_vec,]

#Place the remaining in onther dataframe
com_clean_OK = com_clean[!log_vec,]


com_clean_OK[,"Experience"] -> Exp

#Remove whitespace and the "years" part
str_replace_all(Exp, " ", "") %>% 
  str_replace_all(pattern = "years", replacement = "") -> Exp

#Assign the location of the hyphen to a list
str_locate_all(Exp[], "-") -> hyphens

#Assign empty vectors to be populated with a loop
Min = c()
Max = c()

for(i in 1:length(Exp)){
  
  substr(Exp[i], 0, hyphens[[i]][1,1] - 1) %>% 
    as.integer() -> Min[i]
  
  substr(Exp[i], hyphens[[i]][1,1] + 1, nchar(Exp[i])) %>% 
    as.integer() -> Max[i]
  
}

#Assign results to new columns
com_clean_OK$Min_Experience = Min
com_clean_OK$Max_Experience = Max

#Rearrange the columns
select(com_clean_OK, 1:4, 12:13, 5:11) -> com_clean_OK

write.csv(com_clean_OK, "Complete_No_NA.csv", row.names = FALSE)
write.csv(com_clean_NA, "Complete_All_NA.csv", row.names = FALSE)

And with that, i have a nice dataframe of all the information i need to go through the posts. I was flirting with the idea of even trying to compile some code that would automatically apply for a job if it meets certain criteria, e.g. if a job title equals X, minimum experience is less than Y, and location is in a list of Z; then click this and, so on. Obviously, there is the question of how to go through the Captcha walls, as a colleague had once highlighted. In any case, i thought i should leave this idea for a different post. Till then, i’ll be involved in some intense googling to see if someone else has actually tried it out (using R, or even Python) and maybe pick up a few things.

Tagged: browser, data, headless, headless browser, programming, r, RSelenium, rstats, scraping, web scraping

To leave a comment for the author, please follow the link and comment on their blog: r – Recommended Texts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)