**Data Until I Die!**, and kindly contributed to R-bloggers)

Recently a colleague of mine showed me a nauseating interactive scatterplot that plots mobile phones according to two dimensions of the user’s choice from a list of possible dimensions. Although the interactive visualization was offensive to my tastes, the JSON data behind the visualization was intriguing. It was easy enough to get the data behind it (see this link if you want an up to date copy and be sure to take out the “data=” from the start of the file! I pulled this data around noon on March 23rd.) so that I could start asking a simple question: Which of the available factors provided in the dataset were the most predictive of full mobile phone price?

I’ll present the graphs and then the predictive model first and then the code later on:

**Price by OS and Brand:**

Often when investigating a topic using data, we confirm things that we already knew to be true. This is certainly the case here with price by OS and brand. From the below boxplots we see that the bulk of iOS devices tend to be the most expensive, and that brand-wise Apple, Google, and Samsung seem to stick out.

**Price by Storage Capacity, RAM, and SD Card Capacity:**

Storage capacity is perhaps the least surprising to find as having such a sharply positive correlation with price. I think what is more surprising to me is that there aren’t more gradations of storage capacity in the higher range past 50 gigabytes. I’m guessing this is because the bulk of these phones (bearing in mind roughly 90% of these phones are in fact smart phones) are catered towards lower income folks. Can you guess which phones occupy the top-right-most position on the first graph? If your answer involved the iPhone 6 then you’re right on two counts!

As you can see, the correlation between RAM and price is pretty linear (with phones costing $171.54 more for each additional gigabyte of RAM) and that between SD Card capacity and price is linear past the large group of phones with 0 SD Card capacity (with phones costing $3.64 more for each additional gigabyte of SD Card Capacity).

**Price by Screen Size, Battery, and Weight:**

The next factors that I think one would naturally think of when considering the price of a mobile phone are all related to how big the thing is. Smart phones these days have a lot of physical presence just by dint of their screen size alone. Add to the large screen size the batteries that are used to support such generous displays and you also get an impressive variety of weights to these phones.

In fact, for every additional inch of screen size to these phones, you can expect an additional .81504 ounces and 565.11 mAh of battery capacity. My own humble little smartphone (an HTC Desire 601) happens to be on the smaller and lighter side of the spectrum as far as screen size and weight goes (4.5 inches screen size, or 33rd percentile; 4.59 ounces or 26th percentile) but happens to have a pretty generous battery capacity all things considered (2100 mAh, or 56th percentile).

While positive correlations can be seen between Price and all these 3 factors, battery was the most correlated with Price, next to screen size and then weight. There’s obviously a lot of variability in price when you look at the phones with the bigger screen sizes, as they probably tend to come packed with a variety of premium extra features that can be used to jack up the price.

**Putting it all together in a model:**

Finally, let’s lump all of the factors provided in the data set into a model, and see how well it performs on a testing sample. I decided on an 80/20 training/testing split, and am of course using Max Kuhn’s fabulous caret package to do the dirty work. I ran a gbm model, shown below, and managed to get an R squared of 60.4% in the training sample, so not bad.

Stochastic Gradient Boosting 257 samples 23 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 173, 173, 171, 171, 172, 171, ... Resampling results across tuning parameters: interaction.depth n.trees RMSE Rsquared RMSE SD Rsquared SD 1 50 150.1219 0.5441107 45.36781 0.1546993 1 100 147.5400 0.5676971 46.03555 0.1528225 1 150 146.3710 0.5803005 45.00296 0.1575795 2 50 144.0657 0.5927624 45.46212 0.1736994 2 100 143.7181 0.6036983 44.80662 0.1787351 2 150 143.4850 0.6041207 45.57357 0.1760428 3 50 148.4914 0.5729182 45.27579 0.1903465 3 100 148.5363 0.5735842 43.41793 0.1746064 3 150 148.8497 0.5785677 43.39338 0.1781990 Tuning parameter 'shrinkage' was held constant at a value of 0.1 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 150, interaction.depth = 2 and shrinkage = 0.1.

Now let’s look at the terms that came out as the most significant in the chosen model. Below we see some unsurprising findings! Storage, battery, weight, RAM, and whether or not the phone uses iOS as the top 5. I guess I’m surprised that screen size was not higher up in the priority list, but at least it got in 6th place!

gbm variable importance only 20 most important variables shown (out of 41) Overall att_storage 100.0000 att_battery_mah 59.7597 att_weight 46.5410 att_ram 27.5871 att_osiOS 26.9977 att_screen_size 21.1106 att_sd_card 20.1130 att_brandSamsung 9.1220

Finally, let’s look at how our model did in the testing sample. Below I’ve shown you a plot of actual versus predicted price values. The straight line is what we would expect to see if there were a perfect correlation between the two (obviously not!!) while the smoothed line is the trend that we actually do see in the scatter plot. Considering the high R squared in the testing sample of 57% (not too far off from the training sample) it’s of course a nice confirmation of the utility of this model to see the smooth line following that perfect prediction line, but I won’t call be calling up Rogers Wireless with the magical model just yet!

In fact, before I close off this post, it would be remiss of me not to investigate a couple of cases in this final graph that look like outliers. The one on the bottom right, and the one on the top left.

The one on the bottom right happens to be a Sony Xperia Z3v Black with 32GB of storage space. What I learned from checking into this is that since the pricing data on the source website is pulled from amazon.com, sometimes instead of pulling the full regular price, it happens to pull the data on a day when a special sale or service agreement price is listed. When I pulled the data, the Xperia was listed at a price of $29.99. Today, on April 6th, the price that you would get if you looked it up through the source website is .99! Interestingly, my model had predicted a full price of $632.17, which was not very far off from the full price of $599.99 that you can see if you go on the listing on amazon.com. Not bad!

Now, how about the phone that cost so much but that the model said shouldn’t? This phone was none other than the Black LG 3960 Google Nexus 4 Unlocked GSM Phone with 16GB of Storage space. The price I pulled that day was a whopping $699.99 but the model only predicted a price of $241.86! Considering the specs on this phone, the only features that really seem to measure up are the storage (16GB is roughly in the 85th percentile for smart phones) and the RAM (2 GB is roughly in the 93rd percentile for smart phones). Overall though, the model can’t account for any other qualities that Google might have imbued into this phone that were not measured by the source website. Hence, this is a formidable model outlier!

If you take out the Sony Xperia that I mentioned first, the Adjusted R squared value goes up from 57% to 74%, and the Residual Standard Error decreases from $156 to $121. That’s a lot of influence for just one outlier that we found to be based on data quality alone. Wow!

Reflecting on this exercise, the one factor that I wished were collected is processor speed. I’m curious how much that would factor into pricing decisions, but alas this information was unavailable.

Anyway, this was fun, and I hope not too boring for you, the readers. Thanks for reading!!

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Until I Die!**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...