# Better than Average

August 31, 2010
By

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

 The NIST's The Engineering Statistics Handbook includes an Introduction to Time Series Analysis which provides a great way of demonstrating how R can be used to make such calculations.  This post replicates the analysis of the data set introduced under Averaging Methods using R. As you might expect, Time Series Analysis is a broad subject that has been investigated in depth elsewhere.  If you need more information, a book such as Time Series Analysis and Its Applications provides a much more in depth look at the mathematical theory involved as well as providing practical examples of the use of R for analysis and forecasting.But back to the NIST handbook... the data set they used represents supplier deliveries to a warehouse.  The calculations that follow demonstrate how to perform the calculations they do in this section of the handbook using R.

Supplier Amount(in 1000 of $) 1 9 2 8 3 9 4 12 5 9 6 12 7 11 8 7 9 13 10 9 11 11 12 10 Simple Average (Mean) In R the series can be represented as a vector. v=c(9,8,9,12,9,12,11,7,13,9,11,10) The average of the series is 10. mean(v) The "error" amount that each entry in the vector differs from the mean can be calcuated as follows. s - mean(s) This value can serve as the basis for a measure to ascertain how well a model fits (Error Squared). (v - mean(v))^2 Finally, the sum or mean of these results can be used to compute values that represent the overall fit (or amount of error) for the estimate. sum((v - mean(v))^2) # SSE" is the sum of the squared errors. mean((v - mean(v))^2) # MSE" is the mean of the squared errors. Now that we have a simple values that indicate how good an estimate for a set is, we can test with other values. Rather than writing out an entire calculation each time, we can create a function in R and apply the function to each value in a vector. sse = function(x, series){sum((series - x)^2)} mse = function(x, series){mean((series - x)^2)} To compare the estimate (10) with 7, 9, and 12. sapply(c(7,9,10,12),sse,v) sapply(c(7,9,10,12),mse,v) Analyzing Time Series Data A time series is simply a sequence of data points in time. Time series data has unique characteristics which allow it to be processed in a similar manner regardless of the underlying data represented. Many disciplines deal with this type of data including statistics, signal processing, econometrics and mathematical finance. Such data appears in business in relationship to sales forecasting, budgetary analysis, yield projections, and in the process / quality control arena. In other blog entries, they are used in relation to stock market analysis and economic data. They are relevant to web sites and are available through tools like Google Analytics. So time series data is widely applicable but has common features regardless of its application. It can be analyzed to identify its characteristics and patterns. This often leads to forecasting in which a model is used to predict future events based upon past data. All time series data has the following common qualities: • a natural temporal ordering • often events that are close together are generally more closely related than those further apart • in most cases, past values are assumed to influence future values (rather than the other way around) • usually spaced at uniform intervals The data set we are working with is a bit odd to consider as a time series - a supplier is not a unit of time. However, it is useful for making the point that a "simple" average (or mean) of all past observations is only a useful estimate for when there are no trends. Not sure what to make of this. I emailed the government and asked for clarification. Will post the answer here if I receive a response. In R, a vector can be cast to a time series object as follows: s=as.ts(c(9,8,9,12,9,12,11,7,13,9,11,10)) Moving Average A moving average is described in the NIST Handbook and is also referred to as "smoothing" - a term that comes up in ggplot2 (geom_smooth). There are a myriad of functions available in R that involves some sort of lagged calculation of a series of numbers. A simple example that almost does the trick involves rollapply: rollapply(s, 3, mean) This works, but it is not clear that the first two entries were skipped. Better to use a library that has additional checks coded in... library(TTR) SMA(s,3) If you take a look at the code inside... you can get an idea of the additional verification and error checking (which accounts for missing values at the beginning of the list). To view the source, simply input the function name without any parenthesis: SMA You can drill down into the internally called methods in this case: runMean runSum With this method available, we can calculate the Error and the Error Squared: s - SMA(s,3) # Error (s - SMA(s,3))^2 # Error Squared Note that the calculated mean replaced missing entries as zeroes... x=((s - SMA(s,3))^2) x[ is.na(x) ] <- 0 mean(x) Oh - in case you were interested in the plot: library(ggplot2) df = as.data.frame(as.ts(v)) df$idx = as.numeric(rownames(df))

df$x= as.numeric(df$x)
qplot(data=df, idx, x) + geom_line() + geom_smooth()