Hard drive occupation prediction with R – part 3 – Predicting future ranges

(This article was first published on Avulsos by Penz - Articles tagged as R , and kindly contributed to R-bloggers)

On the second article, we saw how to use a Monte
Carlo simulation generate sample of disk space delta for future dates and
calculate the distribution probability of zeroing free space in the future.

In this article, we will see how we can plot the evolution of predicted
distribution for the occupied disk space. Instead of answering que question “how
likely is that my disk space will zero before date X?,” we will answer
“how much disk space will I need by date X, and with what probability?”

The input data

This file has the dataset we will use as example. It’s
the same we used in the second part. The graph below shows it:

We now import this data into R:

  
  duinfo <- read.table('duinfospike.dat',
  		colClasses=c("Date","numeric"),
  		col.names=c("day","usd"))
  attach(duinfo)
  totalspace <- 450000
  today <- tail(day, 1)
  

We then build our simulations for the next 4 months:

  # Number of Monte Carlo samples
  numsimulations <- 10000
  
  # Number of days to simulate
  numdays    <- 240
  
  # Simulate:
  simulate <- function(data, ndays) {
  	delta <- diff(data)
  	dssimtmp0 <- replicate(numsimulations, tail(data, 1))
  	dssimtmp  <- dssimtmp0
  	f <- function(i) dssimtmp <<- dssimtmp + replicate(numsimulations, sample(delta, 1, replace=TRUE))
  	cbind(dssimtmp0, mapply(f, seq(1, ndays)))
  }
  dssim <- simulate(usd, numdays)
  
  # Future days:
  fday <- seq(today, today+numdays, by='day')
  

Visualizing the possible scenarios

What king of data have we built in our simulations? Each simulation is
built by sampling from the delta samples and adding to the current disk space
for each day in the simulated period. We can say that each individual simulation
is a possible scenario for the next 4 months. The graph bellow shows the
first 5 simulations:

  plot(fday, dssim[1,], ylim=c(min(dssim[1:5,]), max(dssim[1:5,])), ylab='usd', xlab='day', xaxt='n', type='l')
  axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F')
  lines(fday, dssim[2,])
  lines(fday, dssim[3,])
  lines(fday, dssim[4,])
  lines(fday, dssim[5,])
  abline(h=totalspace, col='gray')

From this graph we can clearly see that the range of possible values for the
used disk space grows with time. All simulations start with the same value – the
used disk space for today – and grow apart as we sample from the delta pool.

We can also plot all simulations in a single graph:

  plot(fday, dssim[1,], ylim=c(min(dssim), max(dssim)), ylab='usd', xlab='', xaxt='n', type='l')
  axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F')
  f <- function(i) lines(fday, dssim[i,])
  mapply(f, seq(2, numdays))
  abline(h=totalspace, col='gray')

This plot gives us an idea of the overall spread of the data, but it fails to
show the density. There are 10000 black lines there, with many of them
overlapping one another.

Visualizing the distribution for specific days

There is another way to look at our data: we have created, for each day, a
sample of the possible used disk spaces. We can take any day of the simulation
and look at the density:

  dssimchosen <- list(density(dssim[,5]), density(dssim[,15]), density(dssim[,45]), density(dssim[,120]))
  colors <- rainbow(length(dssimchosen))
  xs <- c(mapply(function(d) d$x, dssimchosen))
  ys <- c(mapply(function(d) d$y, dssimchosen))
  plot(dssimchosen[[1]], xlab='usd', ylab='dens',
  	xlim=c(min(xs),max(xs)), ylim=c(min(ys),max(ys)), col=colors[1], main='')
  lines(dssimchosen[[2]], col=colors[2])
  lines(dssimchosen[[3]], col=colors[3])
  lines(dssimchosen[[4]], col=colors[4])
  abline(v=totalspace, col='gray')
  legend('top', c('5 day', '15 days', '45 days', '120 days'), fill=colors)

By looking at this graph we can see the trend:

  • The curves are getting flatter: we are getting more possible values for
    occupied disk space.
  • The curves are moving to the right: we have more simulations with higher
    occupied disk space values.

Visualizing the evolution of the distribution

So far, we have seen how we can visualize some simulations along the 4 months
and how we can visualize the distribution for some specific days.

We can also plot the distribution of the values for each day in the simulated 4
months. We can’t use the kernel density plot or the histogram, as they use both
axes, but there are other options, most of them involving some abuse of the
built-in plot functions.

Boxplot

We can use the boxplot function to create a boxplot for each day in R in a
very straightforward way:

  boxplot(dssim, outline=F, names=seq(today, as.Date(today+numdays), by='day'), ylab='usd', xlab='day', xaxt='n')
  abline(h=totalspace, col='gray')

The boxplots glued together form a shape that shows us the distribution of our
simulations at any day:

  • The thick line in the middle of the graph is the median
  • The darker area goes from the first quartile to the third – which means that
    50% of the samples are in that range
  • The lighter area has the maximum and minimum points, if they are within 1.5
    IQR of the upper/lower
    quartile. Points out of this range are considered outliers and are not
    plotted.

Quantile lines

We can use the quantile function to calculate the values of each
quantile per day, and plot the lines:

  q <- 6
  f <- function(i) quantile(dssim[,i], seq(0, 1, 1.0/q))
  qvals <- mapply(f, seq(1, numdays+1))
  colors <- colorsDouble(rainbow, q+1)
  plot(fday, qvals[1,], ylab='usd', xlab='day', xaxt='n', type='l', col=colors[1], ylim=c(min(qvals), max(qvals)))
  mapply(function(i) lines(fday, qvals[i,], col=colors[i]), seq(2, q+1))
  axis.Date(1, day, at=seq(min(fday), max(fday), 'week'), format='%F')
  abline(h=totalspace, col='gray')

The advantage of this type of graph over the boxplot is that it is parameterized
by q. This variable tells us the number of parts that we should divide our
sample in. The lines above show us the division. If q is odd, the middle
line is exactly the median. If q is 4, the lines will draw a shape similar
to that of the boxplot, the only difference being the top and bottom line, that
will include outliers – the boxplot filters outliers by using the IQR as
explained above.

In the code above, we have used the colorsDouble function to generate a
sequence of colors that folds in the middle:

  colorsDouble <- function(colorfunc, numcolors) {
  	colors0 <- rev(colorfunc((1+numcolors)/2))
  	c(colors0, rev(if (numcolors %% 2 == 0) colors0 else head(colors0, -1)))
  }

Quantile areas

We can also abuse the barplots function to create an area graph. We have to
eliminate the bar borders, zero the distance between them and plot a white bar
from the axis to the first quartile, if appropriate:

  q <- 7
  f <- function(i) {
  	qa <- quantile(dssim[,i], seq(0, 1, 1.0/q))
  	c(qa[1], diff(qa))
  }
  qvals <- mapply(f, seq(1, numdays+1))
  colors <- c('white', colorsDouble(rainbow, q))
  barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0,
  	names.arg=seq(min(fday), max(fday), 'day'), ylim=c(min(dssim), max(dssim)))
  abline(h=totalspace, col='gray')

In this case, using an odd q makes more sense, as we want to use the same
colors for the symmetric intervals. With an even q, there would either be a
larger middle interval with two quantiles or a broken symmetry. The code above
builds a larger middle interval when given an even q.

Quantile heat map

If we increase q and use heat.colors in a quantile area plot, we get
something similar to a heat map:

  q <- 25
  f <- function(i) {
  	qa <- quantile(dssim[,i], seq(0, 1, 1.0/q))
  	c(qa[1], mapply(function(j) qa[j] - qa[j-1], seq(2, q+1)))
  }
  qvals <- mapply(f, seq(1, numdays+1))
  colors <- c('white', colorsDouble(heat.colors, q))
  barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0,
  	names.arg=seq(min(fday), max(fday), 'day'), ylim=c(min(dssim), max(dssim)))
  abline(h=totalspace, col='gray')

Visualizing past, present and future

We can also plot our data in the same graph as our simulations, by extending the
axis of the barplot and using the points function:

  quantheatplot <- function(x, sim, ylim) {
  	q <- 25
  	simstart <- length(x) - length(sim[1,])
  	f <- function(i) {
  		if (i < simstart)
  			replicate(q+1, 0)
  		else {
  			qa <- quantile(sim[,i-simstart], seq(0, 1, 1.0/q))
  			c(qa[1], diff(qa))
  		}
  	}
  	qvals <- mapply(f, seq(1, length(x)))
  	colors <- c('white', colorsDouble(heat.colors, q))
  	barplot(qvals, ylab='usd', xlab='day', col=colors, border=NA, space=0,
  		names.arg=x, ylim=ylim)
  	abline(h=totalspace, col='gray')
  }
  quantheatplot(c(day, seq(min(fday), max(fday), 'day')), dssim, ylim=c(min(c(usd, dssim)), max(dssim)))
  points(usd)
  abline(h=totalspace, col='gray')

Training set, validation set

Cross-validation is
a technique that we can use to validate the use of Monte Carlo on our data.

We first split our data in two sets: the training set and the validation set. We
than use only the first in our simulations, and plot the second over. We can
then see graphically if the data fits our simulation.

Let’s use the first two months as the training set, and the other three months
as the validation set:

  # Number of days to use in the training set
  numdaysTrain <- 60
  numdaysVal   <- length(day) - numdaysTrain
  
  dssim2 <- simulate(usd[seq(1, numdaysTrain)], numdaysVal-1)
  allvals <- c(usd, dssim2)
  quantheatplot(day, dssim2, c(min(allvals), max(allvals)))
  points(usd)

Looks like using only the first two months already gives us a fair simulation.
What if we used only a single month, when no disk cleanup was performed?

  # Number of days to use in the training set
  numdaysTrain <- 30
  numdaysVal   <- length(day) - numdaysTrain
  
  dssim3 <- simulate(usd[seq(1, numdaysTrain)], numdaysVal-1)
  allvals <- c(usd, dssim3)
  quantheatplot(day, dssim3, c(min(allvals), max(allvals)))
  points(usd)

If we do regular disk cleanups, we must have at least one of them in our
training set to get realistic results. Our training set is not representative
without it.

This also tests our cross-validation code. A common mistake is using the
whole data set as the training set and as the validation set. That is not
cross-validation.

Conclusions

We can use Monte Carlo simulations not only to generate a distribution
probability of an event as we did in the previous
article, but also to predict a possible range of future values. In this article,
disk space occupation is not the most interesting example, as we are usually
more interested in knowing when our used disk space will reach a certain value
than in knowing the most probable values in time. But imagine that the data
represents the number of miles traveled in a road trip or race. You can then not
only see when you will arrive at your destination, but also the region where you
will probably be at any day.

There are plenty of other uses for this kind of prediction. Collect the data,
look at it and think if it would be useful to predict future ranges, and if it
makes sense with the data you have. Predictions based on the evidence can be
even used to support a decision or a point of view, just keep mind that
you can only use the past if you honestly don’t think anything different is
going to happen.

To leave a comment for the author, please follow the link and comment on his blog: Avulsos by Penz - Articles tagged as R .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.