by Srini Kumar, Director of Data Science at Microsoft
We tend to think of R and other such ML tools only in the context of the workplace, to do “weighty” things aimed at saving millions. A little judicious use of R may help us hugely in our personal lives too. The ideas of regression, classification trees etc. can be powerful tools in valuation, as I found out.
Recently, I was in a five-car accident on the infamous 101 in the San Francisco bay area. Luckily, none of us required an ambulance and all of us walked away. However, my car was, in insurance parlance, a "total loss". I was left wondering what I should expect as a check from my insurance company. I found the data I needed on the web, and used R to very quickly come up with a model to value the car. While its being astonishingly accurate was probably an exception, its placing the value in the ballpark illustrates how easy it is to use R for a quick yet reasonably accurate analysis.
First off, we need to recognize that our expected value should not be the blue book value of the car. It should be the amount we have to pay (discounting the taxes and other non-discretionary expenses) to get a similar car from a used (or in the higher end, "pre-owned") car from a car dealer. Therefore, I searched for all the available cars of that model and year available in the United States, and got a list of 70 cars from all over the country.
The only part that involved some drudgery was copying the location, mileage and the asking price for each car from that PDF file and putting it in a spreadsheet. A reasonable guess is that the car's value depends mostly on the mileage on it, and a reasonable assumption (which turned out to be a good one) that it also depended on where it is available.
The rest of the analysis was quite easy. Having read in the tab-delimited format data, I checked the mileage and price for different states by way of initial exploration. As an aside, I tend to use and recommend the tab separated format over the comma separated format always. Text fields rarely contain tabs, but contain commas far more frequently.
As we can see, there is too wide a spread among the states, and too little data from mine. Anyway, the simple linear regression on mileage and states yielded a model, and its prediction was $23,122.47. I had to guess the mileage on my car and guessed it reasonably accurately to about 45,000. After doing this, especially since the data points from my state were too few, I tried a decision tree to check for the dependence of price on state, and got this:
The decision tree algorithm evidently did not sense the state to be a factor to determine the price.
Armed with this knowledge, I waited to talk with the insurance company. To their credit, what they offered me was less than 0.4 percent off the predicted value! To be sure, there were unknowns. I had guessed my mileage, my car had been loaded with options, and I did not scrape any options from the publicly available data. An additional regression I did completely ignoring the state and focusing only on the mileage would have given me an error of a little over 3 percent. However, it was interesting that the modeling could predict the price to within about 4 percent, setting very well the stage for negotiations if it came to it, particularly since the alternative is to make subjective guesses.
So, the next time you need to value something, provided some data on it is available, you can, in less than an hour, come up with objective and defensible estimates to help you negotiate. Here is the R code, if you would like to try it yourself. The only thing is that you need to be able to find data on that item or a comparable one, which is easy to do.