On some environments, disk space usage can be pretty predictable. In this post,
we will see how to do a linear regression to estimate when free space will reach
zero, and how to assess the quality of such regression, all using
R – the
statistical software environment.
The first thing we need is the data. By running a simple
(date --utc; df -k; echo) >> /var/dflog.txt
everyday at 00:00 by cron, we will have more than enough, as that will store the
date along with total, free and used space for all mounted devices.
On the other hand, that is not really easy to parse in R, unless we learn more
about the language. In order to keep this post short, we invite the reader to
use his favorite scripting language (or python) to process that into a file with
the day in the first column and the occupied space in the second, and a row for
YYYY-MM-DD free space YYYY-MM-DD free space (...)
This format can be read and parsed in R with a single command.
This is the data file we will use as source for the results
provided in this article. Feel free to download it and repeat the procedures.
All number in the file are in MB units, and we assume an HD of 500GB. We will
call the date the free space reaches 0 as the df0.
After running R in the shell prompt, we get the usual license and basic help
The first step is to import the data:
> duinfo <- read.table('duinfo.dat', colClasses=c("Date","numeric"), col.names=c("day","usd")) > attach(duinfo) > totalspace <- 500000
The variable duinfo is now a list with two columns: day and usd. The
attach command allows us to use the column names directly. The
totalspace variable is there just for clarity in the code.
We can check the data graphically by issuing:
> plot(usd ~ day, xaxt='n') > axis.Date(1, day, format='%F')
That gives us an idea on how predictable the usage of our hard drive is.
From our example, we get:
We can now create and take a look at our linear model object:
> model <- lm(usd ~ day) > model
Call: lm(formula = usd ~ day) Coefficients: (Intercept) day -6424661.2 466.7
The second coefficient in the example tells us that we are consuming about 559 MB of disk space per day.
We can also plot the linear model over our data:
The example plot, with the line:
Evaluating the model
R provides us with a very generic command that generates statistical information
about objects: summary. Let’s use it on our linear model objects:
Call: lm(formula = usd ~ day) Residuals: Min 1Q Median 3Q Max -3612.1 -1412.8 300.7 1278.9 3301.0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6.425e+06 3.904e+04 -164.6 <2e-16 *** day 4.667e+02 2.686e+00 173.7 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1697 on 161 degrees of freedom Multiple R-squared: 0.9947, Adjusted R-squared: 0.9947 F-statistic: 3.019e+04 on 1 and 161 DF, p-value: < 2.2e-16
To check the quality of a linear regression, we focus on the residuals, as
they represent the error of our model. We calculate them by subtracting the
expected value (from the model) from the sampled value, for every sample.
Let’s see what what each piece of information above means: the first is the
of the residuals. That tells us the maximum and minimum error, and that 75% of
the errors are between -1.4 GB and 1.3 GB. We then get the results of a
Student’s t-test of
the model coefficients against the data. The last column tells us roughly how
probable seeing the given residuals is, assuming that the disk space does not
depend on the date – it’s the
p-value. We usually accept an
hypothesis when the p-value is less than 5%; in this example, we have a large
margin for both coefficients. The last three lines of the summary give us more
measures of fit: the
r-squared values – the closest
to 1, the better; and the general p-value from the f-statistics, less than 5%
In order to show how bad a linear model can be, the summary bellow was generated
by using 50GB as the disk space and adding a random value between -1GB and 1GB
Call: lm(formula = drand$usd ~ drand$day) Residuals: Min 1Q Median 3Q Max -1012.97 -442.62 -96.19 532.27 1025.01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17977.185 33351.017 0.539 0.591 drand$day 2.228 2.323 0.959 0.340 Residual standard error: 589.7 on 84 degrees of freedom Multiple R-squared: 0.01083, Adjusted R-squared: -0.0009487 F-statistic: 0.9194 on 1 and 84 DF, p-value: 0.3404
It’s easy to notice that, even though the five-number summary is narrower, the
p-values are greater than 5%, and the r-squared values are very far from 1. That
happened because the residuals are not normally distributed.
Now that we are (hopefully) convinced that our linear model fits our data
well, we can use it to predict hard-disk shortage.
Until now, we represented disk space as a function of time, creating a model
that allows us to predict the used disk space given the date. But what we really
want now is to predict the date our disk will be full. In order to do that, we
have to invert the model. Fortunately, all statistical properties (t-tests,
f-statistics) hold in the inverted model.
> model2 <- lm(day ~ usd)
We now use the predict function to extrapolate the model.
> predict(model2, data.frame(usd = totalspace)) 1 14837.44
But… when is that? Well, that is the numeric representation of a day in R,
which is the number of days since 1970-01-01. To get the human-readable day, we
> as.Date(predict(model2, data.frame(usd = totalspace)), origin="1970-01-01") 1 "2010-08-16"
There we are: df0 will be at the above date if the
current pattern holds until then, with 95% statistical significance.
The linear model can give us the predicted hard disk space usage at any future
date, as long as collected data pattern is linear. If the data we collected
has a break point – some disk cleanup or software installation – the model will
not give good results. We will usually see that in the analysis, but we should
also always look at the graph.
In the next part of this article we will see a more
robust method for df0 prediction. We will also sacrifice our ability
to see the used space vs time in order to get a statistical distribution for the
date of exhaustion, which is a lot more useful.
- http://www.cyclismo.org/tutorial/R/index.html: R tutorial
- http://www.r-tutor.com/: An R introduction to statistics
- http://cran.r-project.org/doc/contrib/Lemon-kickstart/index.html: Kickstarting R
- http://data.princeton.edu/R/linearModels.html: “Linear models” page of
Introduction to R.