(This article was first published on

**Econometric Sense**, and kindly contributed to R-bloggers)**Regression Basics**

y= b0 + b1 *X ‘regression line we want to fit’

The method of least squares minimizes the squared distance between the line ‘y’ and

individual data observations yi.

That is minimize: ∑ e

_{i}^{2}= ∑ (y_{i }– b_{0}– b_{1}X_{i})^{2}with respect to b_{0}and b_{1}.This can be accomplished by taking the partial derivatives of ∑ e

_{i}^{2}with respect to each coefficient and setting it equal to zero.∂ ∑ e

_{i}^{2 }/ ∂ b_{0}= 2 ∑ (y_{i }– b_{0}– b_{1}X_{i}) (-1) = 0∂ ∑ e

_{i}^{2 }/ ∂ b_{1}= 2 ∑(y_{i }– b_{0}– b_{1}X_{i}) (-X_{i}) = 0Solving for b

_{0}and b_{1}yields the ‘formulas’ for hand calculating the estimates:b

_{0}= y_{bar}– b_{1}X_{bar}b

_{1}= ∑ (( X_{i }– Xbar) (y_{i}– ybar)**)**/ ∑ ( X_{i }– Xbar) = [ ∑X_{i}Y_{i}– n xbar*ybar] / [∑X^{2}– n Xbar^{2}] = S( X,y) / SS(X)

**Example with Real Data:**

Given real data, we can use the formulas above to derive (by hand /caclulator/excel) the estimated values for b0 and b1, which give us the line of best fit, minimizing

**∑ e**_{i}^{2}= ∑ (y_{i }– b_{0}– b_{1}X_{i})^{2}.n= 5

∑X

_{i}Y_{i}= 146∑X

^{2 }= 55Xbar = 3

Ybar =8

b1 = [ ∑X

_{i}Y_{i}– n xbar*ybar] / [∑X^{2}– n Xbar^{2}] (146-5*3*8)/(55-5*3^{2}) = 26/10 =**2.6**b0= y

_{bar}– b_{1}X_{bar}= 8-2.6*3 =**.20****You can verify these results in PROC REG in SAS.**

/* GENEARATE DATA */

**DATA**REGDAT;

INPUT X Y;

CARDS;

1 3

2 7

3 5

4 11

5 14

;

**RUN**;

/* BASIC REGRESSION WITH PROC REG */

**PROC**

**REG**DATA = REGDAT;

MODEL Y = X;

**RUN**;

**Similarly this can be done in R using the ‘lm’ function:**

#------------------------------------------------------------

# regression with canned lm routine

#------------------------------------------------------------

# read in data manually

x <- c(1,2,3,4,5) # read in x -values

y <- c(3,7,5,11,14) # read in y-values

data1 <- data.frame(x,y) # create data set combining x and y values

# analysis

plot(data1$x, data1$y) # plot data

reg1 <- lm(data1$y~data1$x) # compute regression estimates

summary(reg1) # print regression output

abline(reg1) # plot fitted regression line

**Regression Matrices**

Alternatively, this problem can be represented in matrix format.

We can then formulate the least squares equation as:

**y = Xb**

where the ‘errors’ or deviations from the fitted line can be formulated by the matrix :

**e =**(

**y – Xb)**

The matrix equivalent of ∑ e

_{i}^{2}becomes**(y – Xb)’ (y – Xb)**=**e’e****=**

**(y – Xb)’ (y – Xb) = y’y –**2

*** b’X’y + b’X’Xb**

Taking partials, setting = 0, and solving for

**b**gives:d

**e’e**/ d**b = –**2*** X’y +**2*** X’Xb =**02

**X’Xb =**2**X’y****X’Xb = X’y**

**b = (X’X)**

^{-1}

**X’y**which is the matrix equivalent to what we had before:

[ ∑X

_{i}Y_{i}– n xbar*ybar] / [∑X^{2}– n Xbar^{2}] = S( X,y) / SS(X)

** These computations can be carried out in SAS via PROC IML commands:**

/* MATRIX REGRESSION */

**PROC**

**IML**;

/* INPUT DATA AS VECTORS */

yt = {

**3****7****5****11****14**} ; /* TRANSPOSED Y VECTOR */x0t = j(

**1**,**5**,**1**); /* ROW VECTOR OF 1’S */x1t = {

**1****2****3****4****5**}; /* X VALUES */xt =x0t//x1t; /* COMBINE VECTORS INTO TRANSPOSED X-MATRIX */

PRINT yt x0t x1t;

/* FORMULATE REGRESSION MATRICES */

y= yt`; /* VECTOR OF DEPENDENT VARIABLES */

x =xt`; /* FULL X OR DESIGN MATRIX */

beta = inv(x`*x)*x`*y; /* THE CLASSICAL REGRESSION MATRIX */

PRINT beta;

TITLE ‘REGRESSION MATRICES VIA PROC IML’;

**QUIT**;

**RUN**;

OUTPUT

**The same results can be obtained in R as follows: **

#------------------------------------------------------------

# matrix programming based approach

#------------------------------------------------------------

# regression matrices require a column of 1's in order to calculate

# the intercept or constant, create this column of 1's as x0

x0 <- c(1,1,1,1,1) # column of 1's

x1 <- c(1,2,3,4,5) # original x-values

# create the x- matrix of explanatory variables

x <- as.matrix(cbind(x0,x1))

# create the y-matrix of dependent variables

y <- as.matrix(c(3,7,5,11,14))

# estimate b = (X'X)^-1 X'y

b <- solve(t(x)%*%x)%*%t(x)%*%y

print(b) # this gives the intercept and slope - matching exactly

# the results above

To

**leave a comment**for the author, please follow the link and comment on their blog:**Econometric Sense**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...