# Another view of ordinary regression

July 8, 2013
By

(This article was first published on PirateGrunt » R, and kindly contributed to R-bloggers)

This is something I’ve been meaning to write for ages. My formal training for most things is limited. Like a lot of folks, I’m an autodidact. This is good in that I’m always learning and always studying those things that I enjoy. At the same time, it means that I take in information in a highly unstructured, probably inefficient way. Linear regression is a fantastic example of this. I feel as though I’ve known about it- from adding trendlines to Excel charts, to pondering significance of multiple parameters- for decades. My formal understanding of it- the Zen of OLS, if you will- has never really been there.

It was some time ago, that I concluded that all of statistical inference amounts to solving an optimization problem. The target function is typically, though not always, likelihood. (This epiphany was likely driven by a seminar I attended which was given by Stuart Klugman. We spent a lot of time using Excel’s Solver add-in. It really helped to de-mystify loss distributions.) If you know- or can guess- what probability distribution is involved, you can optimize for parameters and start to make inferences. Linear regression is a natural extension of this. Simply make a linear transform and solve the optimization. The cool thing is that we’ve introduced a constraint on the function, which allows us to guess the parameters of the linear transform.

Let’s take a step back. First, we’ll generate a sample of normally distributed random variables and set up a likelihood function.

set.seed(1234)
N = 100
E = rnorm(N, mean = 0, sd = 2)

lnLike = function(x, mu, sigma)
{
n = length(x)
lnLike = -n / 2 * log(2*pi)
lnLike = lnLike - n/2 * log(sigma ^2)
lnLike = lnLike - 1/(2*sigma^2)*sum((x - mu)^2)
lnLike
}


With this in hand, we can see what values for mu and sigma will lead to particular values of the likelihood function. (Log-likelihood, that is.)

# for a fixed sd, plot the likelihood function for various mu's
testMu = -50:50 / 100
likelihood = numeric(length(testMu))
for (i in 1:length(likelihood)){
likelihood[i] = lnLike(e, testMu[i], 1)
}

plot(likelihood ~ testMu, pch = 19)
abline(v = 0)
abline(v = testMu[likelihood == max(likelihood)])

testMu[likelihood == max(likelihood)]

# for a fixed mu, plot the likelihood function for various sigma's
testSigma = 50:150 / 100
likelihood = numeric(length(testSigma))

for (i in 1:length(testSigma)){
likelihood[i] = lnLike(e, 0, testSigma[i])
}

plot(likelihood ~ testSigma, pch = 19)
abline(v = 1)
abline(v = testSigma[likelihood == max(likelihood)])

testSigma[likelihood == max(likelihood)]


These plots look like the following:

Note the big asymetry for the sigma values. Also note that it’s fairly flat for a sizeable range around its true value. The one dimensional cases help us understand some of the marginal behavior, but the likelihood function exists in three dimensions. We can get a sense of that using the contour function. First, we’ll have to build up a set of values.

params = expand.grid(mu = testMu, sigma = testSigma)

params$Likelihood = mapply(lnLike, params$mu, params$sigma, MoreArgs = list(x = e)) z = matrix(params$Likelihood, length(testMu), length(testSigma))

filled.contour(x=testMu, y=testSigma, z=z, color.palette = heat.colors, xlab = "mu", ylab = "sigma")


Which produces this cool plot:

We used the max function above to identify the best fits for mu and sigma when holding the other parameter fixed. For the two-dimensional case (and also to measure in between the test values) we need something more sophisticated. Here, we can use the optim function. As is usually the case, Markus Gesman over at Mages‘ blog got there first, with a nice overview of optim. We’ll add a wrapper to our likelihood function so that we can fit for both parameters.

lnLike2 = function(x, par)
{
mu = par[1]
sigma = par[2]

lnLike(x, mu, sigma)
}
optim(par = c(-1,4), fn = lnLike2, control = list(fnscale = -1), x = e)


Now the fun part. Let’s alter the E variable by first adding a constant and assuming that mu is zero. That assumption allows us to use the likelihood function unaltered, knowing that the mu parameter which is returned will correspond to the constant.

B0 = 5
Y = B0 + e

optim(par = c(-1,4), fn = lnLike2, control = list(fnscale = -1), x = Y)

fit = lm(Y ~ 1)
fit$coefficients[[1]]  We’ve also fit using the lm function. Note that the parameters come out the same. Let’s try adding another term, this time with a slope parameter: X = as.double(1:length(E)) B1 = 1.5 Y = B0 + B1 * X + E lnLike3 = function(par, Y, X) { B0 = par[1] B1 = par[2] sigma = par[3] x = Y - B0 - B1 * X mu = 0 lnLike(x, mu, sigma) } optim(par = c(3, 2, 4), fn = lnLike3, control = list(fnscale = -1), Y = Y, X = X) fit = lm(Y ~ 1 + X) fit$coefficients


Once again, we’ve got agreements with the coefficients, although the values aren’t exactly the same. This is either due to the optimization algorithm or something ridiculous I’ve done. Either way, comments are welcome.

As usual, this is probably old hat to most folks. But for an enthusiastic novice like me, this is loads of fun. It means that when using a linear model, one need not assume a normal distribution. It’s possible to test the sensitivity of results against any other distribution. More fundamentally, it takes me closer to that Zen-like state of understanding. It’s been decades, but I’m finally starting to grok linear regression.