**Win-Vector Blog » R**, and kindly contributed to R-bloggers)

We use R to take a *very* brief look at the distribution of e-book sales on Amazon.com.

Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey’s analysis tries to break sales down by declared category and source, but there are a lot of difficulties due to the quality of the tags in the data. A lot of the questions we would like to look into (such as do reviews drive sales or sales drive reviews) are not practical unless we had a more longitudinal data set that includes many observations on a repeated set of books over time.

However, we can try to relate one type of reported outcome (sales rank, a number Amazon visibly shares on ebook product pages) and number of sales (a harder to find quantity). Note: we are not really doing any *predictive* modeling as we are not trying to predict future sales from features, but instead we are just try to learn an approximate relation between two different encodings of outcomes (sales count and sales rank).

We share down the steps to convert the Excel data to a usable R format here on GitHub. A quick use of the data is as follows:

```
```library('RCurl')
url <- paste('https://raw.github.com/WinVector/',
'Examples/master/',
'AmazonBookData/amazonBookData.Rdata',sep='')
load(rawConnection(getBinaryURL(url)))

The data is now in a dataframe named “`d`

“. The crude analysis we want to do is to relate `Kindle.eBooks.Sales.Rank`

to `Daily.Units.Sold`

. We will do this on “log-log” paper (where famously most anything looks like a line).

```
```model <- lm(log(Daily.Units.Sold)~log(Kindle.eBooks.Sales.Rank),
data=d)
d$EstLogUnitsSold <- predict(model,newdata=d)
library('ggplot2')
ggplot(data=d,aes(x=log(Kindle.eBooks.Sales.Rank))) +
geom_point(aes(y=log(Daily.Units.Sold))) +
geom_line(aes(y=EstLogUnitsSold))

The line fit looks plausible for ebooks in the sales-rank range around 200 through 150,000. Lets take a quick look at the model:

```
```print(model)
Call:
lm(formula = log(Daily.Units.Sold) ~ log(Kindle.eBooks.Sales.Rank),
data = d)
Coefficients:
(Intercept) log(Kindle.eBooks.Sales.Rank)
11.5063 -0.9334

This is roughly saying `Daily.Units.Sold ~ exp(11.5 - 0.93*log(Kindle.eBooks.Sales.Rank))`

or (with a little algebra): `Daily.Units.Sold ~ 99339.64 / Kindle.eBooks.Sales.Rank^0.93`

.

This isn’t too far from the following easy rule of thumb: `Daily.Units.Sold ~ 100000 / Kindle.eBooks.Sales.Rank`

. Applying this we would expect a typical ebook ranked at position 100,000 to sell about 1 copy a day. Now we don’t want to read too much into this, as fitting a line onto log-log paper is a classic example of heavy-handed econometrics (in econometrics you often force the structure of the results by model selection, see “Bad models and the end of the world” for some enjoyable vitriol on abuses of the idea).

However this rule of thumb is consistent with Chris Anderson’s point in the The Long Tail. The fact we see a plausible power law over a large range means we can (crudely) estimate the entire expected sales of an infinite sized catalog as: `sum_{rank=1...infinity} 99339.64 rank^pow`

. In our case `pow=-0.93`

which is `≥ -1`

: meaning the sum diverges or the total is infinite. If `pow`

had been something smaller (like `pow=-2`

) then even an infinite catalog would only have a finite total value. But in this case the theory says the ebook distributor can grow their total revenue to just about any level, if they can add enough books cheaply (they don’t get overwhelmed by diminishing revenue returns early).

Amazon clearly wants the large revenue found in the popular (or “head” books), but you can see that it is plausible they will always have more opportunity to grow their business by increasing coverage (and making the handling of) many less popular products (the so-called “long tail”). Not a new observation, but fun to be able to pull it quickly from shared data.

(Funny side note. This sort of analysis can be stretched to say that the expected lifetime sales of any book that stays in print forever is infinite. Suppose our book starts at rank-A and each day k more books are written and they all are more popular than our book. Then the modeled total unit sales of our book is `sum_{rank=A,A+k,A+2k...infinity} 100000/rank`

which also diverges (though would stay bounded if we added a reasonable discount term for future value). Mostly we are showing you can push these analyses way too far; to get better results you need to correctly model more of the market.)

**leave a comment**for the author, please follow the link and comment on his blog:

**Win-Vector Blog » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...