Veterinary Epidemiologic Research: Linear Regression Part 3 – Box-Cox and Matrix Representation

March 11, 2013
By

(This article was first published on denis haine » R, and kindly contributed to R-bloggers)

In the previous post, I forgot to show an example of Box-Cox transformation when there’s a lack of normality. The Box-Cox procedure computes values of \lambda which best “normalises” the errors.

\lambda value Transformed value of Y
2 Y^2
1 Y
0.5 \sqrt{Y}
0 \ln{Y}
-0.5 \frac{1}{\sqrt{Y}}
-1 \frac{1}{Y}
-2 \frac{1}{Y^2}

For example:

lm.wpc2 <- lm(wpc ~ vag_disch + milk120 + milk120.sq, data = daisy3)
library(MASS)
boxcox(lm.wpc2, plotit = TRUE)

boxcox

The plot indicates a log transformation.

Matrix Representation
We can use a matrix representation of the regression equation. The regression equation y_i = \beta_0 + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \epsilon_i, i = 1, \dots, n can be written as y = X\beta + \epsilon where y = (y_1 \dots y_n)^T , \epsilon = (\epsilon_1 \dots \epsilon_n)^T , \beta = (\beta_0, \dots, \beta_2)^T and
X = \begin{pmatrix}  1 & x_{11} & x_{12} \\  1 & x_{21} & x_{22} \\  \vdots & \vdots & \vdots \\  1 & x_{n1} & x_{n2}  \end{pmatrix}  .

With the following regression:

lm.wpc3 <- lm(wpc ~ milk120 + hs100_ct, data = daisy3)
(lm.wpc3.sum <- summary(lm.wpc3))

Call:
lm(formula = wpc ~ milk120 + hs100_ct, data = daisy3)

Residuals:
Interval from wait period to conception 
   Min     1Q Median     3Q    Max 
-71.83 -36.57 -15.90  23.86 211.18 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 75.884063   6.115894  12.408   <2e-16 ***
milk120     -0.001948   0.001858  -1.049    0.294    
hs100_ct    18.530340   2.106920   8.795   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 50.53 on 1522 degrees of freedom
Multiple R-squared: 0.0495,	Adjusted R-squared: 0.04825 
F-statistic: 39.63 on 2 and 1522 DF,  p-value: < 2.2e-16 

anova(lm.wpc3)
Analysis of Variance Table

Response: wpc
            Df  Sum Sq Mean Sq F value Pr(>F)    
milk120      1    4865    4865  1.9053 0.1677    
hs100_ct     1  197512  197512 77.3518 <2e-16 ***
Residuals 1522 3886309    2553                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

We start by making the X matrix and response y :

X <- cbind(1 , daisy3[, c(8, 24)])
# or alternatively: model.matrix(lm.wpc3)
y <- daisy3$wpc

Then we get the X^{T}X :

XX <- t(X) %*% X
# or alternatively: crossprod(X, X)
XX
                     1       milk120      hs100_ct
1           1525.00000     4907834.4     -29.87473
milk120  4907834.39954 16535517546.0 -120696.45468
hs100_ct     -29.87473     -120696.5     576.60900

We calculate the inverse of X^{T}X :

XX.inv <- solve(t(X) %*% X)
# or alternatively: qr.solve(XX)
# or also: lm.wpc3.sum$cov.unscaled
XX.inv
                     1       milk120      hs100_ct
1         1.464864e-02 -4.348902e-06 -1.513557e-04
milk120  -4.348902e-06  1.351675e-09  5.761289e-08
hs100_ct -1.513557e-04  5.761289e-08  1.738495e-03

Now to get the coefficients estimates:

XX.inv %*% t(X) %*% y
# or alternatively: solve(t(X) %*% X, t(X) %*% y)
# or also: XX.inv %*% crossprod(X, y)
# or: qr.solve(X, y)
                 [,1]
1        75.884062540
milk120  -0.001948437
hs100_ct 18.530340385

The fitted values:

crossprod(t(X), qr.solve(X, y)

SSE and MSE;

SSE <- crossprod(y, y) - crossprod(y, yhat)
SSE
        [,1]
[1,] 3886309
 
MSE <- SSE / (length(y) - 3)
MSE
         [,1]
[1,] 2553.423

The residual standard error:

sqrt(sum(lm.wpc3$res^2) / (1525 - 3))

The standard errors of the coefficients and the R^2 :

sqrt(diag(XX.inv)) * 50.5314
          1     milk120    hs100_ct 
6.115893490 0.001857794 2.106920139

1 - sum(lm.wpc3$res^2) / sum((y - mean(y))^2)
[1] 0.04949679

To leave a comment for the author, please follow the link and comment on his blog: denis haine » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.