Polynomial regression techniques

[This article was first published on Statistic on aiR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Suppose we want to create a polynomial that can approximate better the following dataset on the population of a certain Italian city over 10 years. The table summarizes the data:

$$\begin{tabular}{|1|1|}\hline Year & Population\\ \hline 1959&4835\\ 1960&4970\\ 1961&5085\\ 1962&5160\\ 1963&5310\\ 1964&5260\\ 1965&5235\\ 1966&5255\\ 1967&5235\\ 1968&5210\\ 1969&5175\\ \hline \end{tabular}$$

First we import the data into R:

Year <- c(1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969)
Population <- c(4835, 4970, 5085, 5160, 5310, 5260, 5235, 5255, 5235, 5210, 5175)


Now we create the dataframe named sample1:

sample1 <- data.frame(Year, Population)
sample1

   Year Population
1  1959       4835
2  1960       4970
3  1961       5085
4  1962       5160
5  1963       5310
6  1964       5260
7  1965       5235
8  1966       5255
9  1967       5235
10 1968       5210
11 1969       5175


At this point may be useful to chart these values, to observe the trend and take an idea of the final polynomial function. For convenience we modify the column Year, creating a neighborhood of zero, thus:

sample1$Year <- sample1$Year - 1964
sample1

   Year Population
1    -5       4835
2    -4       4970
3    -3       5085
4    -2       5160
5    -1       5310
6     0       5260
7     1       5235
8     2       5255
9     3       5235
10    4       5210
11    5       5175


Put the values on a chart

plot(sample1$Year, sample1$Population, type="b")




At this point we can start with the search for a polynomial model that adequately approximates our data. First, we specify that we want a polynomial function of X, ie a raw polynomial , is different from the orthogonal polynomial. This is an important addition because the controls and the results will change in the two cases R. So we want a function of X like:

$$f(x)=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+ ... +\beta_nx^n$$

At what degree of the polynomial stop? Depends on the degree of precision that we seek. The greater the degree of the polynomial, the greater the accuracy of the model, but the greater the difficulty in calculating; we must also verify the significance of coefficients that are found. But let's get straight to the point.

In R for fitting a polynomial regression model (not orthogonal), there are two methods, among them identical. Suppose we seek the values of beta coefficients for a polynomial of degree 1, then 2nd degree, and 3rd degree:

fit1 <- lm(sample1$Population ~ sample1$Year)
fit2 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2))
fit3 <- lm(sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3))


Or we can write more quickly, for polynomials of degree 2 and 3:

fit2b <- lm(sample1$Population ~ poly(sample1$Year, 2, raw=TRUE))
fit3b <- lm(sample1$Population ~ poly(sample1$Year, 3, raw=TRUE))


The function poly is useful if you want to get a polynomial of high degree, because it avoids explicitly write the formula. If we specify raw=TRUE, the two methods provide the same output, but if we do not specify raw=TRUE (or rgb(153, 0, 0);">raw=F), the function poly give us the values of the beta parameters of an orthogonal polynomials, which is different from the general formula I wrote above, although the models are both effective.

Let's look at the output.

summary(fit2)
## or summary(fit2b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-46.888 -18.834  -3.159   2.040  86.748 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5263.159     17.655 298.110  < 2e-16 ***
sample1$Year        29.318      3.696   7.933 4.64e-05 ***
I(sample1$Year^2)  -10.589      1.323  -8.002 4.36e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 38.76 on 8 degrees of freedom
Multiple R-squared: 0.9407,     Adjusted R-squared: 0.9259 
F-statistic: 63.48 on 2 and 8 DF,  p-value: 1.235e-05 


The output of summary(fit2b) is the same. We obtained the values of beta0 (5263,159), beta1 (29,318) and beta2 (-10,589), which appear to be significant AII 3. The equation of polynomial of degree 2 of our model is:

$$f(x)=5263.1597+29.318x-10.589x^2$$

If we want a polynomial of 3rd degree, we have:

summary(fit3)
## of summary(fit3b)

Call:
lm(formula = sample1$Population ~ sample1$Year + I(sample1$Year^2) + 
    I(sample1$Year^3))

Residuals:
    Min      1Q  Median      3Q     Max 
-32.774 -14.802  -1.253   3.199  72.634 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5263.1585    15.0667 349.324 4.16e-16 ***
sample1$Year        14.3638     8.1282   1.767   0.1205    
I(sample1$Year^2)  -10.5886     1.1293  -9.376 3.27e-05 ***
I(sample1$Year^3)    0.8401     0.4209   1.996   0.0861 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 33.08 on 7 degrees of freedom
Multiple R-squared: 0.9622,     Adjusted R-squared: 0.946 
F-statistic: 59.44 on 3 and 7 DF,  p-value: 2.403e-05 


The equation is:

$$f(x)=5263.1585+14.3638x-10.5886x^2+0.8401x^3$$

In the latter case, however, the coefficients beta1 and beta3 are not significant, then the best model is the polynomial of 2nd degree. Furthermore look at the Multiple R-squared: in the 2nd degree model it is 94.07%, while in the 3rd degree model it is 96.22%. It seems that there has been an increase in accuracy of the model, but it is a significant increase? We can compare the two model with an ANOVA table:

anova(fit2, fit3)

Analysis of Variance Table

Model 1: sample1$Population ~ sample1$Year + I(sample1$Year^2)
Model 2: sample1$Population ~ sample1$Year + I(sample1$Year^2) + I(sample1$Year^3)
  Res.Df     RSS Df Sum of Sq      F Pr(>F)  
1      8 12019.8                             
2      7  7659.5  1    4360.3 3.9848 0.0861 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


Since the p-value is greater than 0.05, we accept the null hypothesis: there wasn't a significant improvement of the model.

The biggest problem now is to represent graphically the result. In fact, R does not exist (as far as I know) a function for plotting polynomials found. We must therefore proceed with graphic artifacts still valid, but somewhat laborious.

First, we plotted the values, with the command seen before. This time only display the lines and not points, for convenience graphics:

plot(sample1$Year, sample1$Population, type="l", lwd=3)


Now add to this chart the progress of the 2nd degree polynomial, in this way:

points(sample1$Year, predict(fit2), type="l", col="red", lwd=2)


The function predict() compute the Y values given the X values. The the coordinates are linked with lines. Is not plotted the continuous, but the discrete. With a few values, this method is highly debilitating.



Let's add the graph of the polynomial of 3rd degree:

points(sample1&Year, predict(fit3), type="l", col="blue", lwd=2)




As you can see the two models have very similar trends.

If we would instead obtain the graph of continuous functions obtained, we proceed in this manner. First you create the polynomial equation we previously found:

pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]


Remember that:
- coefficient[1] = beta0
- coefficient[2] = beta1
- coefficient[3] = beta2
and so on.

At this point we plotted the coordinates of sample1 and then the created curve with curve(x):

plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)


The point, however, disappear, but we can replace them with the command points:

points(sample1$Year, sample1$Population, type="p", lwd=3)


A note: you must follow the order of commands as I have described, otherwise the function curve creates a wrong graph. So summarizing the commands to get the continuous function, and the experimental points on the same graph are the following:

plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol2 <- function(x) fit2$coefficient[3]*x^2 + fit2$coefficient[2]*x + fit2$coefficient[1]
curve(pol2, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)


The graph we get is the following:



Now draw the graph of the polynomial of 3rd degree:

plot(sample1$Year, sample1$Population, type="p", lwd=3)
pol3 <- function(x) fit3$coefficient[4]*x^3 + fit3$coefficient[3]*x^2 + fit3$coefficient[2]*x + fit3$coefficient[1]
curve(pol3, col="red", lwd=2)
points(sample1$Year, sample1$Population, type="p", lwd=3)



To leave a comment for the author, please follow the link and comment on their blog: Statistic on aiR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)