Site icon R-bloggers

Using R to refine the search result of www.finn.no

[This article was first published on Category: R | Huidong Tian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

www.finn.no is the most popular website in Norway. It supplies a lot of features, such as booking flight tickets, finding job, renting and sales of houses, cars and other properties, etc. I just have some experience with it. I sold and bought cars, apartment and some other stuff. It’s very convenient. But just one thing I feel not convenient: when you search for house/apartment for sale, there is no option of which year those house/apartment were built. To me, it’s important, because a new built house/apartment generally has reasonable structure, low energy consumption and more comfortable. It will be efficient if we can extract the house/apartment ads that built in special year range, e.g. 2000-2010, and display those ads automatically. Like the following:

< !-- Map generated in R 2.15.0 by googleVis 0.3.0 package --> < !-- Wed Dec 12 17:04:04 2012 --> < !-- jsHeader --> < !-- divChart -->

< !--more-->

My idea is:

  1. Use the “advance search” opinion for searching the house/apartment ads that fall in some conditions, such as region, price, type and size, and number of bedrooms.

  2. Download these ads and extract the year when the house/apartment were built together with other interesting information such as price, size, address, etc.

  3. Select the ads that fall a special build-year range, e.g. 2000-2010. Using Google Geocoding API to find the geographical location of the address.

  4. Display the result on Google Map using R package googleVis.

  5. Create a .bat file with context like R CMD BATCH C:\myRscript.R, and add that file to Task Schedule, so it can be executed at specified time span, like once per week.

The following is my R code:

< notextile>
Finn House
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
## Paste the URL of your search result
url &lt;- “http://www.finn.no/finn/realestate/homes/result?keyword=&amp;PRICE_FROM=&amp;PRICE_TO=5000000&amp;ESTATE_SIZE%2FLIVING_AREA_FROM=80&amp;ESTATE_SIZE%2FLIVING_AREA_TO=&amp;areaId=20045&amp;areaId=20046&amp;NO_OF_BEDROOMS=3&amp;PLOT%2FAREARANGE_FROM=&amp;PLOT%2FAREARANGE_TO=&amp;rows=50&amp;sort=1”
## If there is no “page” (by default) in URL, add it.
if (!grepl(“page=[[:digit:]]+”, url)) {
  url &lt;- paste(url, “page=1”, sep = “&amp;”)
}
## Load libraries needed
library(RCurl)
library(googleVis)
library(RgoogleMaps)
## Create a function for extracting xml fragment of interested information.
xml.tag &lt;- function(xml = xml, tag.1 = “&lt;div”, tag.2 = “&lt;/div&gt;”, ptn = “mod mtn mhn mbs”) {
  ind.1 &lt;- data.frame(id = gregexpr(tag.1, xml)[[1]], v =  1)
  ind.2 &lt;- data.frame(id = gregexpr(tag.2, xml)[[1]], v = -1)
  ind.3 &lt;- rbind(ind.1, ind.2)
  ind.3 &lt;- ind.3[order(ind.3$id), ]<br />
  pos &lt;- data.frame(id = gregexpr(ptn, xml)[[1]], start = NA, end = NA)
  for (p in 1:nrow(pos)) {
    ind &lt;- ind.3[length(which(ind.3$id &lt; pos$id[p])):nrow(ind.3), ]
    m &lt;- i &lt;- 1
    repeat{
      i &lt;- i + 1
      m &lt;- m + ind$v[i]
      if (m == 0) break
    }
    pos$start[p] &lt;- ind$id[1]
    pos$end  [p] &lt;- ind$id[i]+nchar(tag.2)
  }
  tag &lt;- rep(NA, nrow(pos))
  for (i in 1:length(tag)) tag[i] &lt;- substr(xml, pos$start[i], pos$end[i])
  return(tag)
}</p>

<h2 id="downlaod-each-ad">Downlaod each ad;</h2>
<p>xml &lt;- getURL(url)
n &lt;- as.numeric(regmatches(xml, regexec(“resultlist-counter"&gt;([0-9]+)&lt;”, xml))[[1]][2])
Res &lt;- NULL
for (pg in 1:ceiling(n/50)) { print(pg)
  url.pg &lt;- gsub(“page=[[:digit:]]+”, paste(“page”, pg, sep = “=”), url)
  xml &lt;- xml.tag(xml = getURL(url.pg))
  # Transform html entity characters to displaying characters;
  xml &lt;- gsub(“\n|\t|\v”, “”, xml)
  xml &lt;- gsub(“ | ”,  “ “, xml)
  xml &lt;- gsub(“&amp;”,   “&amp;”, xml)
  xml &lt;- gsub(““”,  “’”, xml)
  xml &lt;- gsub(“””, “’”, xml)
  xml &lt;- gsub(“²”,  “2”, xml)
  xml &lt;- gsub(“'”,  “’”, xml)
  # Create a data frame for holding the information for one web page;
  res &lt;- data.frame(Size = rep(NA, length(xml)), Price = NA, Addr = NA, Img = NA, Title = NA, Link = NA, Year = NA)
  for (i in 1:nrow(res)) {
    # xml fragment for rome Size and Price per month;
    mbm &lt;- xml.tag(xml = xml[i], ptn = “line mbl”)
    mbm.Img &lt;- xml.tag(xml = xml[i], ptn = “img”)[1]
    mbm.Add &lt;- xml.tag(xml = xml[i], ptn = “unit size1of2 neutral”)
    mbm.Size &lt;- xml.tag(xml = mbm, ptn = “unit size1of3 keyinfo”)[1]
    mbm.Price &lt;- xml.tag(xml = mbm, ptn = “unit size1of3 lastUnit keyinfo”)
    ## XML containing special data
    Size &lt;- gsub(“^ +| +$”, “”, paste(regmatches(mbm.Size, gregexpr(“&lt;.<em>?&gt;”, mbm.Size), invert = T)[[1]], collapse = “”))
    Price &lt;- gsub(“^ +| +$”, “”, paste(regmatches(mbm.Price, gregexpr(“&lt;.</em>?&gt;”, mbm.Price), invert = T)[[1]], collapse = “”))
    Link &lt;- regmatches(mbm.Img, regexec(“(http.<em>?)"”, mbm.Img))[[1]][2]
    Img &lt;- regmatches(mbm.Img, regexec(“&lt;img src="(.</em>?)"”, mbm.Img))[[1]][2]
    Addr &lt;- grep(“[[:alnum:]]”, regmatches(mbm.Add, gregexpr(“&lt;.<em>?&gt;”, mbm.Add), invert = T)[[1]], value = TRUE)[3]
    if (is.na(Addr))
    Addr &lt;- grep(“[[:alnum:]]”, regmatches(mbm.Add, gregexpr(“&lt;.</em>?&gt;”, mbm.Add), invert = T)[[1]], value = TRUE)[2]
    xml.ad &lt;- getURL(url = Link)
    Year &lt;- regmatches(xml.ad, regexec(“&lt;dt&gt;Bygge.r&lt;/dt&gt;.<em>?&lt;dd&gt;([[:digit:]]{4})&lt;/dd&gt;”, xml.ad))[[1]][2]
    Title &lt;- gsub(“^ +| +$”, “”, paste(regmatches(mbm.Img, gregexpr(“&lt;.</em>?&gt;”, mbm.Img), invert = T)[[1]], collapse = “”))
    # Extract useful information;
    res$Size[i] &lt;- Size
    res$Price[i] &lt;- gsub(“?|fra|til”, “”, Price)
    res$Title[i] &lt;- Title
    res$Img[i] &lt;- Img
    res$Addr[i] &lt;- Addr
    res$Link[i] &lt;- Link
    res$Year[i] &lt;- Year
  }
  Res &lt;- rbind(Res, res)
}
Res &lt;- Res[Res$Year &gt;= 2000 &amp; Res$Year =&lt; 2010 &amp; !is.na(Res$Year),]
## Geocoding the post nr. of Oslo using Google Geocoding API;
if (nrow(Res) &gt; 0) {
  gapi &lt;- “http://maps.googleapis.com/maps/api/geocode/xml?sensor=false&amp;address=”
  for (i in 1:nrow(Res)) { print(i)
    url &lt;- gsub(“ “, “%20”, paste(gapi, paste(Res$Addr[i], “Norway”, sep = “ “), sep = “”))
    url &lt;-  gsub(“Å|å”, “a”, url)
    url &lt;-  gsub(“Ø|ø”, “o”, url)
    url &lt;-  gsub(“Æ|æ”, “ae”, url)
    xml &lt;- getURL(url); Sys.sleep(.5)
    Res$Lon[i] &lt;- as.numeric(regmatches(xml, regexec(“<lng>(.+?)</lng>”, xml))[[1]][2])
    Res$Lat[i] &lt;- as.numeric(regmatches(xml, regexec(“<lat>(.+?)</lat>”, xml))[[1]][2])
  }
  Res &lt;- Res[!is.na(Res$Lat),]
  Res$LatLong &lt;- paste(Res$Lat, Res$Lon, sep = “:”)
  Res$Tip &lt;- paste(“<a href=", Res$Link, "><img src=", Res$Img, " /></a>”, sep = “"”)
  Res$Tip &lt;- paste(Res$Tip, Res$Title, Res$Size, Res$Price, Res$Year, sep = “<br />”)
  M &lt;- gvisMap(Res, “LatLong” , “Tip”,
                options=list(showTip=TRUE, enableScrollWheel=TRUE,
                             mapType=’hybrid’, useMapTypeControl=TRUE,
                             width=800,height=400))</p>

<p>cat(M$html$chart, file = “c:/gmap.html”)
  browseURL( “c:/gmap.html”)
}

Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Huidong Tian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.