How to buy a used car with R (part 2)

[This article was first published on Dan Knoepfle's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continued from Part 1.

Part 2: Digging into the Kelley Blue Book

The only thing better than a bit of data is a lot of data. Now that we can grab KBB values for a given trim of a given model in a given year, we set our ambitions higher: automating the collection of these values for all trims of a model over a set of years. To do so, let’s back up and recall how we got to the KBB results page:

Let’s suppose we’re still set on the Honda Accord and are considering the last ten model years. Going with “Search by: Year, Make & Model”, we get to the following self-explanatory screen:

kbb2.png

Choosing (2005, Honda, Accord) pushes us to the following address: http://www.kbb.com/used-cars/honda/accord/2005/. There, we are reminded that the KBB reports different values for retail, certified retail, private sellers, and trade-ins:

kbb4.png

Let’s go with “Private Party Value” for now; we end up at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value. We’re now presented with a plethora of different trims, enough to make us nostalgic for Henry Ford:

kbb5.png

Start with the “DX Sedan 4D”. We arrive at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/equipment?id=846. If the previous screen didn’t freak us out, this one definitely should—-but if we ignore the options at the bottom (which are set to their standard values for the given model year and trim), we’re left with the important parameters: the choice of automatic or manual transmission and the mileage (and the ZIP code, which I’ll discuss later).

I can’t drive stick, so I’m not particularly worried about changing the transmission from its default of Automatic. But if you wanted to, note that choosing Automatic with default options and 10,000 miles pushes you to http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&mileage=10000 whereas choosing Manual, 5-Spd with the same options and mileage gives http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/condition?id=846&equipment=35014|true&mileage=10000.

kbb6.png

Either way, we end up at a completely pointless page: no matter what you select, the results page gives values for all conditions.

kbb7.png

Say we select “Good”. The results page for the Automatic is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&mileage=10000 and the results page for the Manual, 5-Spd is located at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value/pricing-report?condition=good&id=846&equipment=35014|true&mileage=10000. If we want, we can tear off the “condition” field, in which case the default condition, Excellent, is highlighted.

So, if we want to grab results for a bunch of different years and trims, we need to figure out the id=846 part of the URL (and possibly the equipment=35014|true part if we’re after a manual transmission). Again, it’s time for Firebug. Back up to the trim selection page at http://www.kbb.com/used-cars/honda/accord/2005/private-party-value and load up Firebug. If we examine the links for the various trims, we see that the links for the available trims are contained within a div with id='UCPathTrim'.

The next step is to write some R code to parse the trim selection page and pull out the available trims and their corresponding id values. This will make use of some of the core functionality of the XML package.

The XML package and HTML documents

In the last post, we used the function readHTMLTable from the XML package to read the results from a webpage into an R data.frame. At the time, there was little mention of the technical details; now, we’re moving beyond convenient functions and into the great unknown.

The XML package, written by Professor Duncan Temple Lang of UC Davis, is a wrapper for libxml2. The package website, hosted by The Omega Project for Statistical Computing, is at http://www.omegahat.org/RSXML/, and the package listing on CRAN is located at http://cran.r-project.org/web/packages/XML/index.html.

At its core, the XML package is meant for parsing XML and HTML documents into tree structures and selecting and extracting or otherwise manipulating branches or nodes of the trees. Take a look at the HTML tab of Firebug again (on http://www.kbb.com/used-cars/honda/accord/2005/private-party-value), and note that the webpage consists of a tree of HTML tags. At its root, there’s a html node, with children head and body; within the body branch are nodes defining the structure of the document, including a branch descending from a div node (

) containing a branch descending from a span node () with leaf nodes like Accord DX Sedan 4D.

Now, moving to R, we’ll look at the tree produced by the XML package for this document. The first section of code should be fairly straightforward:

## download the webpage
kbbHTML <- readLines("http://www.kbb.com/used-cars/honda/accord/2005/private-party-value")

## load the XML package and parse the downloaded document
require(XML)
kbbTree <- htmlTreeParse(kbbHTML, asText = TRUE)

## get the root ('html') node
kbbRoot <- xmlRoot(kbbTree)

Each node object (class XMLNode) is also a list containing its immediate children as node objects.

> ## print the child nodes ('head' and 'body')
> print(summary(kbbRoot))
     Length Class   Mode
head 14     XMLNode list
body 19     XMLNode list

Thus, we can get the body of the document:

## select the 'body' child node using the usual R list element extraction syntax
kbbBody <- kbbRoot[["body"]]

Within the body, there’s a bunch of child nodes (the same ones we see in Firebug, of course):

> ## print the child nodes of the 'body'
> print(summary(kbbBody))
         Length Class          Mode
script   1      XMLNode        list
script   1      XMLNode        list
div      4      XMLNode        list
comment  0      XMLCommentNode list
script   0      XMLNode        list
script   1      XMLNode        list
script   1      XMLNode        list
script   1      XMLNode        list
noscript 1      XMLNode        list
comment  0      XMLCommentNode list
comment  0      XMLCommentNode list
script   0      XMLNode        list
div      2      XMLNode        list
script   0      XMLNode        list
script   1      XMLNode        list
comment  0      XMLCommentNode list
script   1      XMLNode        list
noscript 1      XMLNode        list
comment  0      XMLCommentNode list

Either by looking at the tree in Firebug or using summaries of the tree in R, we can identify the div node we’re looking for and access the corresponding node object in R:

## select our 'div id="UCPathTrim"...' node; instead of using node
## names (like 'div'), which aren't necessarily unique here, we use
## indices (we want the first child of the first child of the second
## child of the second child of the third child of 'body')
divUCPathTrim <- kbbBody[[3]][[2]][[2]][[1]][[1]]


> ## print the child nodes
> print(summary(divUCPathTrim))
     Length Class       Mode
h2   1      XMLNode     list
text 0      XMLTextNode list
span 9      XMLNode     list

We can then access the trim links, which are the leaf nodes of the span node under divUCPathTrim. Printing an XMLNode object outputs the raw HTML.

> ## print the HTML of the first of the link leaf nodes (children of the 'span' node)
> print(divUCPathTrim[["span"]][[1]])
<a href="/used-cars/honda/accord/2005/private-party-value/equipment?id=846" class="link_circle_arrow_blue">Accord DX Sedan 4D</a>

To get the node contents (here, the trim label), we use the xmlValue function:

> ## print the *contents* of this leaf node
> print(xmlValue(divUCPathTrim[["span"]][[1]]))
[1] "Accord DX Sedan 4D"

To get the link target (the ‘href’ attribute), we use the xmlAttrs function:

> ## print the 'href' attribute of this leaf node
> print(xmlAttrs(divUCPathTrim[["span"]][[1]])[["href"]])
[1] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"

There’s an easier way to select a set of nodes and apply functions over this set. To do so, we must learn a bit of XPath.

XPath

XPath is a query language for selecting sets of nodes from XML or XML-like documents (like HTML webpages). A nice quick introduction to XPath syntax is the w3schools.com article XPath Syntax. Open it in a tab, read it, and come back.

Done? Good. If we’re super lazy, we can use Firebug to generate an XPath expression to select a given node—just right click on the node and choose “Copy XPath”. Here’s the XPath expression for the second of the nine trim links:

/html/body/div/div[2]/div[2]/div/div/span/a[2]

To select all of the nine trim links, we simply chop off the “[2]” on the end (match all a nodes that are children of that span):

/html/body/div/div[2]/div[2]/div/div/span/a

If we want a short XPath expression, we can instead use something like this:

//div[@id = 'UCPathTrim']//a

That is, we select all a nodes that descend from any div node with attribute id='UCPathTrim'. In XPath syntax, “//nodename” selects descendant nodes named nodename while “/nodename” selects child nodes named nodename (immediate descendants). Using double forward slashes allows us to skip specifying intermediate nodes. Expressions within brackets are conditions, evaluated to booleans, specifying whether a node should or should not be included.

Is there any advantage to using one expression over the other? So long as the structure of the webpage doesn’t change, both will work; however, if the order of the nodes in the document changes, the former expression will fail, but the latter will continue to work (it selects on the div id attribute rather than its position in the document). Similarly, if the div id changes but the document structure otherwise remains unchanged (this is unlikely, but might happen if they messed around with their CSS styling or something), the former would continue working but the latter would fail.

We can create a fancier XPath expression using XPath functions that will continue to work so long as the KBB URL scheme stays the same. Since the rest of the code will depend on this remaining constant, our XPath expression should only fail at the same time as the rest of our code. A list of XPath functions can be found here. We’ll use the function contains(x, y), which returns true if string x contains string y (else false). Our XPath expression is:

//a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')]

This selects all links with target URLs containing ‘used-cars/honda/accord/2005/private-party-value/equipment’.

getNodeSet and xpathApply

To use XPath with the XML package, we need to parse the document a little differently. You see, the XML package can either parse the document into a tree structure of R objects (as we did above, using htmlTreeParse) or into a tree structure of pointers to C-level objects. In the latter case, the parsed structure is maintained as lower-level objects in memory, and is not immediately accessible in R. Indeed, incorrectly accessing the parsed document object can cause R to crash. However, parsing the document into this C-level structure internal to libxml2 permits the use of XPath expressions. For more, do help("xmlParse").

In practice, using XPath expressions with the XML package is fairly simple. We parse the document with htmlParse instead of htmlTreeParse, and select sets of nodes corresponding to XPath expressions using getNodeSet. We can then lapply or sapply over the resulting nodeset. If we only need to apply a single function, we can instead use xpathApply to apply a function to an XPath-defined set directly.

## parse the downloaded document to an XMLInternalDocument
kbbInternalTree <- htmlParse(kbbHTML, asText = TRUE)

## select nodes matching our XPath expression
xpath.expression <- "//a[contains(@href,'/used-cars/honda/accord/2005/private-party-value/equipment')]"
trim.nodes <- getNodeSet(doc = kbbInternalTree,
                         path = xpath.expression)


> ## the result is of class "XMLNodeSet", a list of 9 externalptr
> ## objects of class "XMLInternalElementNode"
> print(summary(trim.nodes))
      Length Class                  Mode       
 [1,] 1      XMLInternalElementNode externalptr
 [2,] 1      XMLInternalElementNode externalptr
 [3,] 1      XMLInternalElementNode externalptr
 [4,] 1      XMLInternalElementNode externalptr
 [5,] 1      XMLInternalElementNode externalptr
 [6,] 1      XMLInternalElementNode externalptr
 [7,] 1      XMLInternalElementNode externalptr
 [8,] 1      XMLInternalElementNode externalptr
 [9,] 1      XMLInternalElementNode externalptr


> ## we can now lapply or sapply over this list object
> print(lapply(trim.nodes, function(x) c(xmlValue(x), xmlAttrs(x)[["href"]])))
[[1]]
[1] " Accord DX Sedan 4D"                                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=846"

[[2]]
[1] " Accord EX Coupe 2D"                                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=863"

[[3]]
[1] " Accord EX Sedan 4D"                                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=859"

[[4]]
[1] " Accord EX-L Coupe 2D"                                               
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263736"

[[5]]
[1] " Accord EX-L Sedan 4D"                                               
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=263737"

[[6]]
[1] " Accord Hybrid Sedan 4D"                                          
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=868"

[[7]]
[1] " Accord LX Coupe 2D"                                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=856"

[[8]]
[1] " Accord LX Sedan 4D"                                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=850"

[[9]]
[1] " Accord LX Special Edition Coupe 2D"                              
[2] "/used-cars/honda/accord/2005/private-party-value/equipment?id=867"

Putting it all together

I’m getting tired, so let’s jump ahead to a complete function that retrieves all of the trims for a given year. If you’ve read and understood everything above, you should be able to figure out how the function works without much trouble (with the possible exception of the XPath expression, which needlessly uses regular expressions). Go wild with help(...) until it all makes sense.

getKBBYearTrims <- function(prefix, year, type = "private-party-value") {
  require(XML)

  kbbTrimPageURL <- sprintf("%s%i/%s", prefix, year, type)
  cat("Loading", kbbTrimPageURL, "\n")

  x <- readLines(kbbTrimPageURL)
  g <- htmlParse(x, asText=TRUE)

  xpath <- gsub("([http:/w.]+kbb\\.com/)(.*)", "//a[contains(@href, '\\2/equipment')]", kbbTrimPageURL)
  cat("XPath expression is:", xpath, "\n")

  trims <- getNodeSet(doc = g, path = xpath)
  trimlabels <- sapply(trims, xmlValue)
  trimids <- sapply(trims, function(node) sub(".*id=([[:digit:]]+)$", "\\1", xmlAttrs(node)[["href"]]))

  trimtable <- data.frame(year = year,
                          trim = trimlabels,
                          id = trimids,
                          stringsAsFactors = FALSE)
  return(trimtable)
}

The function works great for 2005 Accords:

> ## print trims and ids for 2005 Honda Accords
> print(getKBBYearTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", year = 2005))
Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] 
  year                                trim     id
1 2005                  Accord DX Sedan 4D    846
2 2005                  Accord EX Coupe 2D    863
3 2005                  Accord EX Sedan 4D    859
4 2005                Accord EX-L Coupe 2D 263736
5 2005                Accord EX-L Sedan 4D 263737
6 2005              Accord Hybrid Sedan 4D    868
7 2005                  Accord LX Coupe 2D    856
8 2005                  Accord LX Sedan 4D    850
9 2005  Accord LX Special Edition Coupe 2D    867

The following function wraps getKBBYearTrims to return a data.frame of trims for a set of model years.

getKBBTrims <- function(prefix, years, type = "private-party-value") {

  kbbTrimList <- lapply(years, function(year) getKBBYearTrims(prefix, year))
  kbbTrims <- do.call('rbind', kbbTrimList)

  return(kbbTrims)
}

Using it, we can try getting the trims for a series of model years:

> ## print trims and ids for years 2003 to 2007
> accord.trims <- getKBBTrims(prefix = "http://www.kbb.com/used-cars/honda/accord/", years = 2003:2007)
Loading http://www.kbb.com/used-cars/honda/accord/2003/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2003/private-party-value/equipment')] 
Loading http://www.kbb.com/used-cars/honda/accord/2004/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2004/private-party-value/equipment')] 
Loading http://www.kbb.com/used-cars/honda/accord/2005/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2005/private-party-value/equipment')] 
Loading http://www.kbb.com/used-cars/honda/accord/2006/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2006/private-party-value/equipment')] 
Loading http://www.kbb.com/used-cars/honda/accord/2007/private-party-value 
XPath expression is: //a[contains(@href, 'used-cars/honda/accord/2007/private-party-value/equipment')]
> print(accord.trims)
   year                                trim     id
1  2003                  Accord DX Sedan 4D   2488
2  2003                  Accord EX Coupe 2D   2496
3  2003                  Accord EX Sedan 4D   2498
4  2003                Accord EX-L Coupe 2D 263731
5  2003                Accord EX-L Sedan 4D 263730
6  2003                  Accord LX Coupe 2D   2495
7  2003                  Accord LX Sedan 4D   2492
8  2004                  Accord DX Sedan 4D   2664
9  2004                  Accord EX Coupe 2D   2671
10 2004                  Accord EX Sedan 4D   2676
11 2004                Accord EX-L Coupe 2D 263735
12 2004                Accord EX-L Sedan 4D 263734
13 2004                  Accord LX Coupe 2D   2669
14 2004                  Accord LX Sedan 4D   2663
15 2005                  Accord DX Sedan 4D    846
16 2005                  Accord EX Coupe 2D    863
17 2005                  Accord EX Sedan 4D    859
18 2005                Accord EX-L Coupe 2D 263736
19 2005                Accord EX-L Sedan 4D 263737
20 2005              Accord Hybrid Sedan 4D    868
21 2005                  Accord LX Coupe 2D    856
22 2005                  Accord LX Sedan 4D    850
23 2005  Accord LX Special Edition Coupe 2D    867
24 2006                  Accord EX Coupe 2D    741
25 2006                  Accord EX Sedan 4D    739
26 2006                Accord EX-L Coupe 2D 263727
27 2006                Accord EX-L Sedan 4D 263726
28 2006              Accord Hybrid Sedan 4D    744
29 2006                  Accord LX Coupe 2D    736
30 2006                  Accord LX Sedan 4D    734
31 2006                  Accord SE Sedan 4D    738
32 2006                  Accord VP Sedan 4D    737
33 2007                  Accord EX Coupe 2D  83835
34 2007                  Accord EX Sedan 4D  83834
35 2007                Accord EX-L Coupe 2D 263674
36 2007                Accord EX-L Sedan 4D 263675
37 2007              Accord Hybrid Sedan 4D  83836
38 2007                  Accord LX Coupe 2D  83833
39 2007                  Accord LX Sedan 4D  83829
40 2007                  Accord SE Sedan 4D  83832
41 2007                  Accord VP Sedan 4D  83827

Everything works great. What a shock.

To leave a comment for the author, please follow the link and comment on their blog: Dan Knoepfle's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)