An Overview of Room Rentals in Sydney | RSelenium, rvest, Leaflet, googleway

December 27, 2016

(This article was first published on r – Recommended Texts, and kindly contributed to R-bloggers)

Started another data scraping script similar to the post about rental rates in Houston; except this time i picked Sydney, Australia. The site that i’ve selected uses an awful lot of javascript, so the rvest package won’t be enough in this case. I’m going to have to use RSelenium.

The script for scraping the site is below. I say this every time i post a scraping script, and i’ll say it again: the URL to the actual site has been replaced for obvious reasons. Although i place plenty of comments in my code, this time i’m going to try and break down the code in little chunks and explain each part:


check_error <- function(code) {
           error = function(c) "error",
           warning = function(c) "warning",
           message = function(c) "message"

#checkForServer() #Check if server file is available
startServer() #Start the server
mybrowser <- remoteDriver(browser = "chrome") #Change the browser to chrome
mybrowser$open(silent = TRUE)

#navigate to url. enter according to whatever is filtered in the site

#run a loop to scroll all the way to the bottom
for(i in 1:100){
  mybrowser$executeScript("scroll(0, 500000);")
  Sys.sleep(3) #wait for posts to load
  print(i) #check progress

#get HTML
mybrowser$getPageSource()[[1]] %>% read_html() -> source

The packages dplyr, rvest, RSelenium, and stringr are all packages that i’ve used before so i’m not going to explain what they’re used for. The only thing that’s different is the tcltk package. The only reason i loaded this library is so that i can create a nice progress bar for the loops i use.

Unlike many other rental sites, this site uses some kind of javascript function that shows more and more posts as you scroll down the page. So, i would have to somehow simulate this scrolling. That’s what the loop at the end of the script is for. The line mybrowser$executeScript("scroll(0, 500000);") scrolls down all the way to the bottom of the page while the line Sys.sleep(3) waits three seconds for the posts to be generated before scrolling down again. This is done 100 times before the HTML of the page is extracted and stored in source.

The next step is to try and retrieve as much information from the posts that have been generated without actually having to navigate to each post.

#get rentals rates
source %>% html_nodes("#search-results") %>% 
  html_nodes("") %>% html_text() -> rentals

#get title of the post
source %>% html_nodes("#search-results") %>% 
  html_nodes(".listing-head") %>% html_text() -> title

#get the URL to the post
source %>% html_nodes("#search-results") %>% 
  html_nodes(".content-column") %>% html_nodes("a") %>% 
  html_attr("href") %>% unique() -> links

#remove redundant URLs
links[grep("/P", links)] %>% unique() -> links

There is some more information that i need but this data can’t be retrieved without navigating to each post.

#create an empty datafrmae to populated later
matrix(NA, length(links), 14) %>% -> df

names(df) <- c("title", "lat", "lon", "room_type", 
               "bathroom", "furnishing", "available", 
               "length_of_stay", "gender", "weekly_rent", 
               "bond", "bills", "internet", "parking")

#might need to remove this
  lstCharges <- list()
  lstDetails <- list()
  vecLat <- c()
  vecLon <- c()
  lstSource <- list()

#total for the progress bar
total <- length(links)

for(i in 1:length(links)){
  if(i == 1){
    pb <- tkProgressBar(title = "progress bar", min = 0,
                        max = total, width = 300)
  #navigate to the post URL
  try(mybrowser$navigate(paste("", links[i], sep = "")))
  #get HTML
  try(mybrowser$getPageSource()[[1]] %>% html() -> source)
  #check if a post has been deleted
  source %>% as.character() %>% 
    str_locate("longer available") %>% %>% c() %>% unique() -> logError
  #if the post has not been deleted...
      lat <- 0
      #extract latitude
      source %>% 
        html_nodes("#map-canvas") %>% 
        html_attr("data-latitude") -> lat
    if(length(lat) != 0){ #if lat retrieved is not empty
      vecLat[length(vecLat) + 1] <- lat
    }else{vecLat[length(vecLat) + 1] <- 0}
      lon <- 0
      #extract longitude
      source %>% 
        html_nodes("#map-canvas") %>% 
        html_attr("data-longitude") -> lon
    if(length(lon) != 0){ #if lon retrieved is not empty
      vecLon[length(vecLon) + 1] <- lon
    }else{vecLon[length(vecLon) + 1] <- 0}

      matDetails <- matrix(NA, 1, 1)
      #extract table showing Room Details
      source %>% 
        html_nodes("#property-view > div:nth-child(2) > > > div:nth-child(4) > div > div > div:nth-child(1) > table") %>% 
        html_table() -> matDetails
      if(check_error(matDetails[[1]]) != "error"){
        lstDetails[[length(lstDetails) + 1]] <- matDetails[[1]]
        lstDetails[[length(lstDetails) + 1]] <- NA
      matCharges <- matrix(NA, 1, 1)
      #extract table showing Charges for the room
      source %>% 
        html_nodes("#property-view > div:nth-child(2) > > > div:nth-child(4) > div > div > div:nth-child(2) > table") %>% 
        html_table() -> matCharges
      if(check_error(matCharges[[1]]) != "error"){
        lstCharges[[length(lstCharges) + 1]] <- matCharges[[1]]
        lstCharges[[length(lstCharges) + 1]] <- NA
    #if page is empty, populate with NA
      vecLat[length(vecLat) + 1] <- NA
      vecLon[length(vecLon) + 1] <- NA
      lstDetails[[length(lstDetails) + 1]] <- matrix(NA,1,1)
      lstCharges[[length(lstCharges) + 1]] <- matrix(NA,1,1)
  #progress bar
  setTkProgressBar(pb, i, label=paste(i, " out of ", total, sep = ""))
close(pb) #close progress bar
mybrowser$close()#close browser

#notification beeps using beepr package

Now for the cleaning of the data. There are quite a number of old posts on the site that did not have any basic details posted, so those need to be removed.

#posts with no details
nonNA <-

#remove all extractions with no details
lstDet_Clean <- lstDetails[!nonNA]

#filter charges table with according to no posts
lstCharges_Clean <- lstCharges[!nonNA] #oops! mistakenly removed original charges list
#filter lon, lat, and title by no post filter
vecLat_Clean <- vecLat[!nonNA]
vecLon_Clean <- vecLon[!nonNA]
title_Clean <- title[!nonNA]

#change dim of original df to accomodate no posts
df <- df[1:length(lstCharges_Clean),]

Now it’s just a matter of populating the empty dataframe, using the lists and vectors that were generated from the scraping script:

#start loop to populate empty dataframe
for(i in 1:nrow(df)){
  #title, lon, and lat
  df[i, "title"] <- title_Clean[i]
  df[i, "lat"] <- vecLat_Clean[i]
  df[i, "lon"] <- vecLon_Clean[i]
  lstCharges_Clean[[i]] -> tblCharges
  tolower(tblCharges[,1]) == "weekly rent" -> logRent
  if(sum(logRent) != 0){
    tblCharges[logRent,2] -> df[i, "weekly_rent"]
  tolower(tblCharges[,1]) == "internet" -> logRent
  if(sum(logRent) != 0){
    tblCharges[logRent,2] -> df[i, "internet"]
  tolower(tblCharges[,1]) == "parking" -> logRent
  if(sum(logRent) != 0){
    tblCharges[logRent,2] -> df[i, "parking"]
  tolower(tblCharges[,1]) == "bond" -> logRent
  if(sum(logRent) != 0){
    tblCharges[logRent,2] -> df[i, "bond"]
  tolower(tblCharges[,1]) == "bills" -> logRent
  if(sum(logRent) != 0){
    tblCharges[logRent,2] -> df[i, "bills"]
  lstDet_Clean[[i]] -> tblDetails
  #room type
  tolower(tblDetails[,1]) == "room type" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "room_type"]
  tolower(tblDetails[,1]) == "bathroom" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "bathroom"]
  tolower(tblDetails[,1]) == "furnishing" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "furnishing"]
  tolower(tblDetails[,1]) == "available" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "available"]
  #length of stay
  tolower(tblDetails[,1]) == "length of stay" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "length_of_stay"]
  tolower(tblDetails[,1]) == "gender" -> logDetails
  if(sum(logDetails) != 0){
    tblDetails[logDetails,2] -> df[i, "gender"]


All the information i needed to examine the rentals rates are in one dataframe. Now it’s a matter of plotting them, starting off with a break down of the number of posts. The next three plots will show:

1. The total number of posts per room type.
2. The total number of posts per room type and bathroom type
3. The total number of posts per room type, bathroom type, and gender.

As a side note, the “gender” field represents what sort of gender does the poster prefer to live with. The choices are male, female, couple, male/female but not couple, or anyone.

#Total number posts per room_type
ggplot(df, aes(room_type)) + 
  geom_bar(fill = "dark blue") + 
  theme_minimal() + 
  labs(x = "Room Type", 
       y = "Number of Posts") + 
  ggtitle("Number of Posts by Room Type")


#total number of posts per room type and bathroom type
ggplot(df, aes(room_type, fill = bathroom)) + 
  geom_bar(position = "dodge") + 
  theme_minimal() + 
  labs(x = "Room Type", 
       y = "Number of Posts", 
       fill = "Bathroom") + 
  ggtitle("Number of Posts by Room and Bathroom Type")


#total number of posts per room type, bathroom type, furnishing, and gender
ggplot(df, aes(room_type, fill = bathroom, color = furnishing)) + 
  geom_bar(width = 0.7, position = position_dodge(width = 0.9), 
           lwd = 0.7)+ 
  scale_color_brewer(palette = "Dark2") + 
  scale_fill_brewer(palette = "Set2") + 
  labs(x = "Room Type", 
       y = "Number of Posts", 
       color = "Furnishing",
       fill = "Bathroom") + 
  ggtitle("Number of Posts: By Room Type, Bathroom Type, Furnishing, and Gender") + 
  facet_wrap(~gender) + 
    panel.background = element_rect(fill = "grey"), 
    legend.position = c(0.85, 0.25)) + 
  scale_y_continuous(breaks = seq(0, max(df$weekly_rent), by = 100))


Click here to see the third plot a little clearer.

The obvious point you’d have to take away from this is that most people have no real preference when it comes to gender, given how the “Anyone Welcome” category is selected in most posts. If you’re looking to find a room to share with some one else, you might be out of luck due to the fact that a big portion of the total number of posts are about private rooms and not shared rooms, however most posts show that the bathrooms are shared.

The next two plots reveal the frequency distribution of the weekly rentals rates. The first plot will show the frequencies with all the posts in scope, while the second one only shows the weekly rent distribution for only shared room posts.

#Rental distribution
ggplot(df, aes(x=weekly_rent)) + 
  geom_histogram(aes(fill =..density..), 
                 binwidth = 50) + 
                      low = "green", 
                      high = "red") + 
  scale_x_continuous(breaks = seq(0, max(df$weekly_rent), by = 50)) + 
  labs(x = "Weekly Rent", y = "Frequency") + 
  guides(fill = FALSE) + ggtitle("Frequency Distribution of Rentals")


#Rental distribution for != private room
ggplot(filter(df, room_type != "Private room"), 
       aes(x=weekly_rent)) + 
  geom_histogram(aes(fill =..density..), 
                 binwidth = 20) + 
                      low = "green", 
                      high = "red") + 
  scale_x_continuous(breaks = seq(0, max(df$weekly_rent), by = 50)) + 
  labs(x = "Weekly Rent", 
       y = "Count") + 
  guides(fill = FALSE) + 
  ggtitle("Frequency Distribution of Rentals: Excluding Private Rooms")


The weekly rental range for most posts fall under the A$200 to A$300 range. If i isolate only the non-private rooms, the range drops down to A$150 to $200. I think a boxplot should give us a more clearer picture on the median weekly rental rate for each kind of post. The following plots show the median rentals and range in the following order:

1. By room type
2. By room type, bathroom type, and gender
3. By room type, bathroom type, gender, and furnishing

#Average rental per room type.
ggplot(df, aes(room_type, weekly_rent)) + geom_boxplot(fill = "yellow") + 
  labs(x = "Room Type", y = "Weekly Rent", title = "Median Rentals by Room Type") + 
  theme(panel.background = element_rect(fill = "grey")) + 


#Average rental per room type, bathroom, and gender
ggplot(df, aes(room_type, weekly_rent, fill = bathroom)) + geom_boxplot() + 
  labs(x = "Room Type", y = "Weekly Rent", 
       title = "Median Rentals: Room Type VS Bathroom Type VS Gender", 
       fill = "Bathroom") + 
  theme(panel.background = element_rect(fill = "grey"), 
    legend.position = c(0.85, 0.25)) + 
  ylim(0,600) + facet_wrap(~gender)


Higher resolution: Median Rentals: Room Type VS Bathroom Type VS Gender.

#Average rental (y) per room type (x), bathroom (fill), gender (facet), and furnishing (color)
ggplot(df, aes(room_type, weekly_rent, fill = bathroom, color = furnishing)) + 
  geom_boxplot(lwd = 0.7, 
               fatten = 1, 
               position = position_dodge(1)) + 
  labs(x = "Room Type", y = "Weekly Rent", 
       title = "Median Rentals: Room Type VS Bathroom Type VS Gender VS Furnishing", 
       fill = "Bathroom", 
       color = "Furnishing") + 
  theme(panel.background = element_rect(fill = "#f2f2f2"), 
        legend.position = c(0.85, 0.25)) + 
  ylim(0,600) + facet_wrap(~gender) + 
  scale_color_brewer(palette = "Set1") + 
  scale_fill_brewer(palette = "Pastel2") 


Higher resolution: Median Rentals: Room Type VS Bathroom Type VS Gender VS Furnishing.

You might be able to save a decent amount of money if you go for a private room with a shared bathroom rather than a room with an ensuite/private bathroom. If a private room is a little pricy, you can go for a shared room with an ensuite bathroom, since a shared bathroom is going to cost roughly the same amount. Judging from the third plot, the median weekly rental for most private rooms that are furnished where you get to have your own bathroom float around the A$300 to A$400 range. The best deal out there seems to be private room posts where the gender is tagged as “Anyone welcome”, the furnishing is “flexible with furnishing”, and the bathroom is “Own bathroom”.

There is one other point though, and that’s the question of location. I honestly don’t think that the most ideal room will cost the same if you compare between one near the city center and one that is in the outskirts. So i’m guessing there must be some kind of middle ground where you can get what you want for a reasonable rate.

First off, a rough density plot showing where most of the posts are concentrated.

#Density plot, total
SYD_map <- ggmap::get_map(location = "Sydney, Australia", 
                   source = "google", maptype = "roadmap", 
                   crop = FALSE, color = "bw", zoom = 11)

ggmap::ggmap(SYD_map) +
  stat_density2d(data=df, aes(x = lon, y = lat,
                              fill = ..level..,
                              alpha = ..level..), geom = "polygon") + 
  scale_fill_gradient(low = "#3BE819", high = "#B5170B") +  
  theme(axis.text=element_blank(), panel.grid=element_blank(), axis.title=element_blank(),
        axis.ticks.x=element_blank(), axis.ticks.y=element_blank(), legend.position="none",
        panel.background=element_blank(), plot.title = element_text(face="bold", colour="black")) +  
  labs(title = "Density plot of posts") +  


It looks like most of the posts are around the University of Sydney, Univeristy of Technology Sydney, and Surry Hills. However, the issue with a density plot using ggmaps is that i can’t break down this concentration. But you could do that using the Leaflet package.

#Leaflet cluster map
leaflet(df) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>% 
    lng = ~as.numeric(df$lon), 
    lat = ~as.numeric(df$lat), 
    clusterOptions = markerClusterOptions(), 
    radius = 3
  ) %>% 
  htmlwidgets::saveWidget(file = "Leaflet_Sydney.html", 
                          selfcontained = TRUE)

Bigger version: Leaflet Cluster Map.

If you’re not too familiar with a leaflet cluster map, you can perhaps try clicking on any of the clusters. You’ll notice that the clusters are then broken down further. It’s pretty obvious now that most of the posts are concentrated in the polygon shaped section making up Darlinghurst, Paddington, Randwick, Kensington, Mascot, Newtown, and Campterdown. To see if there really is any real difference between two posts with identical details but different locations, it’s best to just plot all the posts but map the weekly rental amount to a spectrum of colors.

#Leaflet circle markers mapped to rentals, plus html popup
filter(df, weekly_rent <= 700) -> df_new

#create column showing html popup details
      paste("Location: ", df_new$title, sep = ""),
      paste("Room Type: ", df_new$room_type, sep = ""),
      paste("Bathroom: ", df_new$bathroom, sep = ""),
      paste("Furnishing: ", df_new$furnishing, sep = ""),
      paste("Internet: ", df_new$internet, sep = ""),
      paste("Parking: ", df_new$parking, sep = ""),
      paste("Preference: ", df_new$gender, sep = ""),
      paste("Bills: ", df_new$bills, sep = ""),
      paste("Bond: ", df_new$bond, sep = ""),
      paste("Weekly Rent: ", df_new$weekly_rent, sep = ""),
      paste("Availability: ", df_new$available, sep = ""),
sep = "
") -> df_new$hPop colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), domain = df_new$weekly_rent) -> pal leaflet(df_new) %>% addProviderTiles("CartoDB.DarkMatter") %>% #Greyscale map addLegend("bottomright", pal = pal, values = ~weekly_rent, title = "Rental Price", labFormat = labelFormat(prefix = "AUD "), opacity = 1) %>% addCircleMarkers(lng = ~lon, lat = ~lat, #Latitudes and Longitudes radius = ~ifelse(weekly_rent <= 500, 4, 3), #Size of circles dependent on rental amount color = ~pal(weekly_rent), #color mapped to rental amount stroke = FALSE, fillOpacity = 0.5, popup = ~hPop) %>% htmlwidgets::saveWidget(file = "Leaflet_Sydney_Rentals.html", selfcontained = TRUE)

Bigger version: Leaflet Circle Makers with HTML Popup.

It would appear the suspicion that certain locations are more expensive than others could very well be true. If we start from the left and move on to the right towards the city center, the weekly rental amount gradually rises. This information, along with the average plots i made earlier, are normally enough for me to make an informed decision on what sort of posts should i be focusing on. But i’ve recently become more and more interested in plotting polylines. The reason being that it would be interesting to see if driving distance can be added in to the list of variables to be considered when moving to a place.

#Leaflet driving distances

#Location of company's offices
origin_lat <- -33.872795
origin_lon <- 151.207307

#list to be populated later
gResponses <- list()

#get data for each lat/lon in the dataframe
for(i in 1:nrow(df_new)){
googleway::google_directions(origin = c(origin_lat, origin_lon), 
                             destination = c(df_new[i, c("lat")], 
                                             df_new[i, "lon"]), 
                             key = Key, 
                             mode = "driving", 
                             simplify = TRUE) -> gResponses[[i]]

OK_Resp <- function(x){
  lstReturn <- list()
  for(i in 1:length(x)){
    if(x[[i]]$status == "OK"){
      x[[i]] -> lstReturn[[length(lstReturn) + 1]]

#extract only responses with a status OK
OK_Resp(gResponses) -> gResp_OK

#filter the dataframe with status OK
df_new[logResp,] -> df_new_plots

#extract duration from API response
       function(x){x$routes$legs[[1]]$duration$value/60}) -> duration

#extract distance from API response
       function(x){x$routes$legs[[1]]$distance$value/1000}) -> distance

#Add distance and duration to the data frame
df_new_plots$duration <- unlist(duration)
df_new_plots$distance <- unlist(distance)

#function to get only the polylines from the response
getPoly <- function(list_element){

#compile all polylines to one vector
lapply(gResp_OK, FUN = getPoly) %>% 
  unlist() %>% as.character() -> lstPoly

#decode all polylines and compile to list
lapply(lstPoly, FUN = decode_pl) -> lstCoords

data.table::rbindlist(lstCoords, idcol = "id") -> dfPoly

#into list with unique IDs
lst_lines <- lapply(unique(dfPoly$id), function(x){
  #Comment from stack over flow user:
  #"the order of the 'lon' and 'lat' fields is important"
  sp::Lines(sp::Line(dfPoly[id == x, .(lon, lat)]), ID = x)

spl_lst <- sp::SpatialLines(lst_lines)

filter(df_new_plots, weekly_rent <= 400) -> df_new_plots

#Map distances to colors
colorNumeric(c("#000000", "#faff00","#FF0000", "#17129e"), 
             domain = df_new_plots$distance,
             alpha = FALSE) -> pal

#Map rental rates to colors
colorNumeric(c("#ff00ee", "#00e5ff", "#ffff00"), 
             domain = df_new_plots$weekly_rent,
             alpha = FALSE) -> pal_rent

leaflet(spl_lst) %>%
  addProviderTiles("CartoDB.DarkMatterNoLabels") %>%
  addPolylines(opacity = 0.2, weight = 3, 
               color = pal(df_new_plots$distance)) %>% 
  addCircleMarkers(lng = df_new_plots$lon, 
                   lat = df_new_plots$lat, 
                   color = pal_rent(df_new_plots$weekly_rent), 
                   radius = 2, popup = df_new_plots$hPop) %>% 
  #legend for the rental amounts on the bottom right
  addLegend("bottomright", pal = pal_rent, 
            values = df_new_plots$weekly_rent, 
            title = "Rental Amount", 
            labFormat = labelFormat(prefix = "AUD ")) %>% 
  #legend for the driving distances on the top right
  addLegend("topright", pal = pal,
            values = df_new_plots$distance,
            title = "Driving Distance", 
            labFormat = labelFormat(suffix = " Kilometers")) %>% 
  htmlwidgets::saveWidget(file = "Leaflet_Sydney_Rent_Driving.html", 
                          selfcontained = FALSE)

Bigger version: Leaflet Circle Makers with HTML Popup and Polylines.

I still think that the map seems a bit too noisy, although it looks a little better if you zoom in closer. But regardless of how you look at this map or the data on the rental averages in this city, you’ll have to come to one conclusion, especially if you’re someone who’s living in Malaysia like myself. Rent in Sydney is just too damn high! Someone needs to contact Jimmy McMillan to open an Australian chapter of his famously hilarious, but very relatable political party.

As always, if you’d like a copy of the data set, do get in touch and i’ll try to send you a copy of the CSV file. 🙂

To leave a comment for the author, please follow the link and comment on their blog: r – Recommended Texts. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)