Using R to refine the search result of www.finn.no

December 12, 2012
By

(This article was first published on Category: R | Huidong Tian's Blog, and kindly contributed to R-bloggers)

www.finn.no is the most popular website in Norway. It supplies a lot of features, such as booking flight tickets, finding job, renting and sales of houses, cars and other properties, etc. I just have some experience with it. I sold and bought cars, apartment and some other stuff. It’s very convenient. But just one thing I feel not convenient: when you search for house/apartment for sale, there is no option of which year those house/apartment were built. To me, it’s important, because a new built house/apartment generally has reasonable structure, low energy consumption and more comfortable. It will be efficient if we can extract the house/apartment ads that built in special year range, e.g. 2000-2010, and display those ads automatically. Like the following:

My idea is:

  1. Use the “advance search” opinion for searching the house/apartment ads that fall in some conditions, such as region, price, type and size, and number of bedrooms.

  2. Download these ads and extract the year when the house/apartment were built together with other interesting information such as price, size, address, etc.

  3. Select the ads that fall a special build-year range, e.g. 2000-2010. Using Google Geocoding API to find the geographical location of the address.

  4. Display the result on Google Map using R package googleVis.

  5. Create a .bat file with context like R CMD BATCH C:\myRscript.R, and add that file to Task Schedule, so it can be executed at specified time span, like once per week.

The following is my R code:

Finn House
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
## Paste the URL of your search result
url <- “http://www.finn.no/finn/realestate/homes/result?keyword=&PRICE_FROM=&PRICE_TO=5000000&ESTATE_SIZE%2FLIVING_AREA_FROM=80&ESTATE_SIZE%2FLIVING_AREA_TO=&areaId=20045&areaId=20046&NO_OF_BEDROOMS=3&PLOT%2FAREARANGE_FROM=&PLOT%2FAREARANGE_TO=&rows=50&sort=1## If there is no “page” (by default) in URL, add it.
if (!grepl(“page=[[:digit:]]+, url)) {
  url <- paste(url, “page=1, sep =&)
}
## Load libraries needed
library(RCurl)
library(googleVis)
library(RgoogleMaps)
## Create a function for extracting xml fragment of interested information.
xml.tag <- function(xml = xml, tag.1 =<div”, tag.2 =</div>, ptn = “mod mtn mhn mbs”) {
  ind.1 <- data.frame(id = gregexpr(tag.1, xml)[[1]], v =  1)
  ind.2 <- data.frame(id = gregexpr(tag.2, xml)[[1]], v = -1)
  ind.3 <- rbind(ind.1, ind.2)
  ind.3 &lt;- ind.3[order(ind.3$id), ]<br />
  pos &lt;- data.frame(id = gregexpr(ptn, xml)[[1]], start = NA, end = NA)
  for (p in 1:nrow(pos)) {
    ind &lt;- ind.3[length(which(ind.3$id &lt; pos$id[p])):nrow(ind.3), ]
    m &lt;- i &lt;- 1
    repeat{
      i &lt;- i + 1
      m &lt;- m + ind$v[i]
      if (m == 0) break
    }
    pos$start[p] &lt;- ind$id[1]
    pos$end  [p] &lt;- ind$id[i]+nchar(tag.2)
  }
  tag &lt;- rep(NA, nrow(pos))
  for (i in 1:length(tag)) tag[i] &lt;- substr(xml, pos$start[i], pos$end[i])
  return(tag)
}</p>

<h2 id="downlaod-each-ad">Downlaod each ad;</h2>
<p>xml &lt;- getURL(url)
n &lt;- as.numeric(regmatches(xml, regexec(“resultlist-counter"&gt;([0-9]+)&lt;”, xml))[[1]][2])
Res &lt;- NULL
for (pg in 1:ceiling(n/50)) { print(pg)
  url.pg &lt;- gsub(“page=[[:digit:]]+”, paste(“page”, pg, sep = “=”), url)
  xml &lt;- xml.tag(xml = getURL(url.pg))
  # Transform html entity characters to displaying characters;
  xml &lt;- gsub(“\n|\t|\v”, “”, xml)
  xml &lt;- gsub(“ | ”,  “ “, xml)
  xml &lt;- gsub(“&amp;”,   “&amp;”, xml)
  xml &lt;- gsub(““”,  “’”, xml)
  xml &lt;- gsub(“””, “’”, xml)
  xml &lt;- gsub(“²”,  “2”, xml)
  xml &lt;- gsub(“'”,  “’”, xml)
  # Create a data frame for holding the information for one web page;
  res &lt;- data.frame(Size = rep(NA, length(xml)), Price = NA, Addr = NA, Img = NA, Title = NA, Link = NA, Year = NA)
  for (i in 1:nrow(res)) {
    # xml fragment for rome Size and Price per month;
    mbm &lt;- xml.tag(xml = xml[i], ptn = “line mbl”)
    mbm.Img &lt;- xml.tag(xml = xml[i], ptn = “img”)[1]
    mbm.Add &lt;- xml.tag(xml = xml[i], ptn = “unit size1of2 neutral”)
    mbm.Size &lt;- xml.tag(xml = mbm, ptn = “unit size1of3 keyinfo”)[1]
    mbm.Price &lt;- xml.tag(xml = mbm, ptn = “unit size1of3 lastUnit keyinfo”)
    ## XML containing special data
    Size &lt;- gsub(“^ +| +$”, “”, paste(regmatches(mbm.Size, gregexpr(“&lt;.<em>?&gt;”, mbm.Size), invert = T)[[1]], collapse = “”))
    Price &lt;- gsub(“^ +| +$”, “”, paste(regmatches(mbm.Price, gregexpr(“&lt;.</em>?&gt;”, mbm.Price), invert = T)[[1]], collapse = “”))
    Link &lt;- regmatches(mbm.Img, regexec(“(http.<em>?)", mbm.Img))[[1]][2]
    Img &lt;- regmatches(mbm.Img, regexec(&lt;img src="(.</em>?)", mbm.Img))[[1]][2]
    Addr &lt;- grep([[:alnum:]], regmatches(mbm.Add, gregexpr(&lt;.<em>?&gt;, mbm.Add), invert = T)[[1]], value = TRUE)[3]
    if (is.na(Addr))
    Addr &lt;- grep([[:alnum:]], regmatches(mbm.Add, gregexpr(&lt;.</em>?&gt;, mbm.Add), invert = T)[[1]], value = TRUE)[2]
    xml.ad &lt;- getURL(url = Link)
    Year &lt;- regmatches(xml.ad, regexec(&lt;dt&gt;Bygge.r&lt;/dt&gt;.<em>?&lt;dd&gt;([[:digit:]]{4})&lt;/dd&gt;, xml.ad))[[1]][2]
    Title &lt;- gsub(^ +| +$, “”, paste(regmatches(mbm.Img, gregexpr(&lt;.</em>?&gt;, mbm.Img), invert = T)[[1]], collapse = “”))
    # Extract useful information;
    res$Size[i] &lt;- Size
    res$Price[i] &lt;- gsub(“?|fra|til”, “”, Price)
    res$Title[i] &lt;- Title
    res$Img[i] &lt;- Img
    res$Addr[i] &lt;- Addr
    res$Link[i] &lt;- Link
    res$Year[i] &lt;- Year
  }
  Res &lt;- rbind(Res, res)
}
Res &lt;- Res[Res$Year &gt;= 2000 &amp; Res$Year =&lt; 2010 &amp; !is.na(Res$Year),]
## Geocoding the post nr. of Oslo using Google Geocoding API;
if (nrow(Res) &gt; 0) {
  gapi &lt;- “http://maps.googleapis.com/maps/api/geocode/xml?sensor=false&amp;address=  for (i in 1:nrow(Res)) { print(i)
    url &lt;- gsub(“ “,%20, paste(gapi, paste(Res$Addr[i], “Norway”, sep = “ “), sep = “”))
    url &lt;-  gsub(“Å|å”, “a”, url)
    url &lt;-  gsub(“Ø|ø”, “o”, url)
    url &lt;-  gsub(“Æ|æ”, “ae”, url)
    xml &lt;- getURL(url); Sys.sleep(.5)
    Res$Lon[i] &lt;- as.numeric(regmatches(xml, regexec(<lng>(.+?)</lng>, xml))[[1]][2])
    Res$Lat[i] &lt;- as.numeric(regmatches(xml, regexec(<lat>(.+?)</lat>, xml))[[1]][2])
  }
  Res &lt;- Res[!is.na(Res$Lat),]
  Res$LatLong &lt;- paste(Res$Lat, Res$Lon, sep = “:”)
  Res$Tip &lt;- paste(<a href=", Res$Link, "><img src=", Res$Img, " /></a>, sep ="”)
  Res$Tip &lt;- paste(Res$Tip, Res$Title, Res$Size, Res$Price, Res$Year, sep =<br />)
  M &lt;- gvisMap(Res, “LatLong” , “Tip”,
                options=list(showTip=TRUE, enableScrollWheel=TRUE,
                             mapType=’hybrid’, useMapTypeControl=TRUE,
                             width=800,height=400))</p>

<p>cat(M$html$chart, file = “c:/gmap.html”)
  browseURL( “c:/gmap.html”)
}

Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: Category: R | Huidong Tian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.