(This article was first published on

**Econometrics Beat: Dave Giles' Blog**, and kindly contributed to R-bloggers)I was in (yet another) session with my analyst, "Jane", the other day, and quite unintentionally the conversation turned, once again, to the subject of "semi-log" regression equations.

After my previous~~rant to~~

Right at the outset, let me state quite categorically that lots of people estimate semi-logarithmic regressions for the wrong reasons. I'm not condoning what they do. That's their problem - I have enough of my own! However, if they're going to insist on doing this, then I'm going to insist that they be consistent when it comes to using their estimated model for forecasting purposes!

I mean, we wouldn't use an inconsistent

Let me tell you all about it.........

Suppose that we're using a regression model of the form

After my previous

**discussion with her**about this matter, I've tried to stay on the straight and narrow. It's better for my blood pressure, apart from anything else! Anyway, somehow how we got back this topic, and she urged me to get some related issues off my chest. This*is*therapy, after all!Right at the outset, let me state quite categorically that lots of people estimate semi-logarithmic regressions for the wrong reasons. I'm not condoning what they do. That's their problem - I have enough of my own! However, if they're going to insist on doing this, then I'm going to insist that they be consistent when it comes to using their estimated model for forecasting purposes!

I mean, we wouldn't use an inconsistent

*estimator*, would we? So why should we put up with an inconsistent*modeller*?Let me tell you all about it.........

Suppose that we're using a regression model of the form

log(y

_{t}) = β_{1}+ β_{2}x_{2t}+ β_{3}x_{3t}+ ......+ β_{k}x_{kt}+ ε_{t},where log(.) denotes the

*natural*logarithm. One or more of the regressors may also be in logarithmic form, but that's irrelevant for the purposes of this post. We just need them to be non-random.Having estimated the model by (say) OLS, we have the "fitted" (predicted) values for the dependent variable. That is, we have values of [log(y

_{t})]* = b_{1}+ b_{2}x_{2t}+ ..... + b_{k}x_{kt}, where b_{i}is the OLS estimator for β_{i}(i = 1, 2, ......, k).Usually, of course, we'd be more interested in fitted values expressed in terms of the

*original data*- that is, in terms of y itself, rather than log(y).So, all we have to do is take the inverse of the logarithmic transformation, and the fitted values of interest will be y

Actually, No!

_{t}* = exp[log(y_{t})]* . Right?Actually, No!

Sure, we could do that, but we'd be introducing a bias into the predictions that we can easily avoid.

So, what's going on here? Well, you'll notice that I didn't state any assumptions about the random error term, ε

_{t}, in the regression model. That was bad - those assumptions are really just as important as the specification of the structural part of the equation. So, let's tidy things up a bit. Specifically, let's assume that the errors have a zero mean, and are homoskedastic and serially independent. Crucially, let's also assume that they are*Normally distributed*.This last assumption is key to what follows. If ε

_{t}is Normally distributed, then log(y_{t}) is also Normally distributed, assuming that the regressors are non-random. In turn, this implies that y_{t}itself must follow a*Log-Normal*distribution.We need to be aware of the following key relationships between these Normal and Log-Normal distributions. In general, if X ~ N[μ , σ

^{2}], then Y = exp[X] ~ Log-N[(m , v], where "m" and "v" are the mean and variance of the Log-Normal distribution. Its well-known that:- m = exp[μ + σ
^{2}/ 2] - v = [exp(σ
^{2}) - 1] exp[2μ + σ^{2}]

In our case, μ = E[ε

_{t}] = 0 ; and σ^{2}= var.[ε_{t}].In particular, this implies that E[y

_{t}] = exp[β_{0}+ β_{1}x_{1t}+ β_{2}x_{2t}+ ......+ β_{k}x_{kt}+ (σ^{2}/ 2)].So, when we generate our predictions ("fitted values") of y

_{t}, based on our log-linear model,

*really*we should create them as:

y

_{t}* = exp{[log(y

_{t})]* + ( s

^{2}/ 2)},

where

[log(y

_{t})]* = [ b

_{0}+ b

_{1}x

_{1t}+ ... + b

_{k}x

_{kt }],

and s

^{2}is the usual unbiased estimator of σ

^{2}, based on the OLS estimates of the semi-log model.

To be more specific,

s

^{2}= Σ[log(y

_{t}) - b

_{1}- b

_{2}x

_{2t}- ....... - b

_{k}x

_{kt}]

^{2}/ (n - k),

where the range of summation is for t = 1, 2, ...., n.

The naive approach of generating the predictions simply as y

_{t}* = exp[log(y

_{t})]* ignores a term, and this will distort the predictions. In fact, it will distort them in a

*downwards*direction, as the term we're ignoring must be positive in sign. Of course, just how important this distortion is, in practice, will depend on the sale of our data, and "signal-to-noise" ratio for our model.

Let's take a look at a couple of simple empirical examples, first using R, and then using EViews. (Jane has been encouraging me to be more "open" in my choice of software!)

The R example uses the well-known "Airplane Passengers" (AP) time-series, and is based loosely on the analysis of Cowperwait and Metcalf (2009, pp. 109-118). The R script is available on this blog's

**code page**, and it can be opened with any text editor. The logarithm of AP is regressed against a quadratic time trend and a bunch of Sine and Cosine terms of the form SIN(2πit) and COS(2πit); i = 1, 2, ..., 5:

The time series "APF" is the series of naive within-sample predictions, obtained by simply exponentiating the fitted values for log(AP). The time-series "APFAD" incorporates the adjustment term discussed above. In this particular case, s = 0 04811, so there's not much difference between APF and APFAD:

However, the sum of the squared (within-sample) prediction errors, based on AP and APF is 24936.24, while that based on AP and APFAD is 24840.21. So, there's a bit of improvement when we take the adjustment into account.

Now let's repeat this exercise using EViews. The associated workfile is on the

**code page**for this blog, and the data are in a csv file on the

**data page**. Here are the regression results using EViews:

(There are some small differences between the R and EViews results - this probably reflects the fact that the data were created in R, written to a file to only a limited number of decimal places, and then read into EViews.)

Then, I've selected the "Forecast" tab, and then chosen to forecast "AP", rather than "log(AP)", over the sample period:

The "modified" forecasts are easily obtained with the command:

**series apfad = apf*exp(0.5*@se^2)**,

**and here are the results:**

There are several interesting questions that I haven't dealt with here.

For example, what does the result, v = [exp(σ

^{2}) - 1] exp[2μ + σ

^{2}] , imply for possible modifications to the calculation of forecast

*intervals*?

Second, what if the errors are

*non-normal*? Can (how should) we modify our naive forecasts in such cases?

So many questions....... so little time!

**Reference**

**Cowperwait, P. S. P. and A. V. Metcalf, 2009.**

*Introductory Time Series With R*. Springer, Dordrecht.

© 203, David E. Giles

To

**leave a comment**for the author, please follow the link and comment on his blog:**Econometrics Beat: Dave Giles' Blog**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...