R examine objects tutorial

[This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing‘s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control of the software.
I am a fan of the R. The R software does a number of incredible things and is the result of a number of good design choices. However, you can’t fully benefit from R if you are not already familiar the internal workings of R. You can quickly become familiar with the internal workings of R if you learn how to inspect the objects of R (as an addition to using the built in help system). Here I give a concrete example of how to use the R system itself to find answers, with or without the help system. R documentation has the difficult dual responsibility of attempting to explain both how to use the R software and explain the nature of the underlying statistics; so the documentation is not always the quickest thing to browse.

First let’s give R the commands to build a fake data set that has a variable y that turns out to be 3 times x (another variable) plus some noise:

> n <- 100
> x <- rnorm(n)
> y <- 3*x + 0.2*rnorm(n)
> d <- data.frame(x,y)

This data set (by design) has a nearly a linear relation between x and y. We can plot
the data as follows:

> library(ggplot2)
> ggplot(data=d) + geom_point(aes(x=x,y=y))


With data like this the most obvious statistical analysis is a linear regression. R can very quickly perform the linear regression and report the results.

> model <- lm(y~x,data=d)
> summary(model)

lm(formula = y ~ x, data = d)

     Min       1Q   Median       3Q      Max
-0.41071 -0.12762 -0.00651  0.10240  0.62772 

            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02609    0.02102  -1.241    0.217
x            2.99150    0.02202 135.858   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2102 on 98 degrees of freedom
Multiple R-squared: 0.9947,	Adjusted R-squared: 0.9947
F-statistic: 1.846e+04 on 1 and 98 DF,  p-value: < 2.2e-16

We can read the report and see that the estimated fit formula is: y = 2.99150*x – 0.02609 (which is very close to the true formula y = 3*x) . At this point the analysis is done (if the goal of the analysis is to just print the results). However, if we want to use the results in a calculation we need to get at the numbers shown in above printout. This printout contains a lot of information (such as the estimate fit coefficients, the standard errors, the t-values and the significances) that a statistician would want to see and want to use in further calculations. But it is unclear how to get at these numbers. For example: how do you get the “standard errors” (the numbers in the “Std. Error” column) from the returned model? Are we forced to cut and paste them from the printed report? What can you do?

The documentation nearly tells us what we need to know. help(lm) yields:

The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

To a newer R user this may not be clear (as there are technical issues from both R and statistics quickly being run through). However, the experienced R user would immediately recognize from this help that what is returned form summary(model) is an object (not just a blob of text) and that looking at the class of the returned object (which turns out to be summary.lm) might tell them what they would need to know.


[1] "summary.lm"
> help(summary.lm)


coefficients: a p x 4 matrix with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value. Aliased coefficients are omitted.

But if you are not very familiar with R you might miss that the summary function returns a useful object (instead of blob of text). Also you might only know to look at help(summary) which does not describe the location of the desired standard errors (but does have a reference to summary.lm, so if you are patient you might find it). We describe how to find the information you need by using R’s object inspection facilities. This is a “doing it the hard way” technique for when you do not understand the help system or you are using a package with less complete help documentation.

First (using the techniques described in the slides: Survive R) examine the model to see if the standard errors are there:

> names(model)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"
 [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"    

> model$coefficients
(Intercept)           x
-0.02609243  2.99150259

We found the coefficients, but did not find the standard errors. Now we know the standard errors are reported by summary(model), so they must be somewhere. Instead of performing a wild goose chase to find the standard errors let’s instead trace how the summary method works to find where it gets them. If we type print(summary) we don’t get any really useful information. This is because summary is a generic method and we need to know what type-qualified name the summary of a linear model is called.

> class(model)
[1] "lm"

So we see our model is of type lm so the summary(model) call would use a summary method called summary.lm (which as we saw is also the returned class of the summary(model) object). As we mentioned the solution is in help(summary.lm), but if the solution had not been there we could still make progress: we could dump the source of the summary.lm method:

> print(summary.lm)
function (object, correlation = FALSE, symbolic.cor = FALSE,
    class(ans) <- "summary.lm"

We actually deleted the bulk of the print(summary.lm) result because the important thing to notice is that the method is huge and that it returns an object instead of a blob of text. The fact that the method summary.lm was huge means that it is likely calculating the things it reports (confirming that the standard errors are not part of the model object). The fact that an object is returned means that what we are looking for may sitting somewhere in the summary waiting for us. To find what we are looking for we convert the summary into a list (using the unclass() method) and look for something with the name or value we are looking for:

> unclass(summary(model))
lm(formula = y ~ x, data = d)
               Estimate Std. Error    t value      Pr(>|t|)
(Intercept) -0.02609243 0.02102062  -1.241278  2.174662e-01
x            2.99150259 0.02201930 135.858209 2.095643e-113

And we have found it. The named slot summary(model)$coefficients is in fact a table that has what we are looking for in the second column. We can create a new list that will let us look up the standard errors by name (for the variable x and for the intercept):

> stdErrors <- as.list(summary(model)$coefficients[,2])

Now that we have the stdErrors in list form we can look up the numbers we wanted by name.

> stdErrors['x']
[1] 0.0220193

> stdErrors['(Intercept)']
[1] 0.02102062

And we finally have the standard errors. But why did we want the standard errors? In this case I wanted the standard errors so I could plot the fit model and show the uncertainty of the model. As, is often the case, R already has a function that does all of this. Also (as is often the case) the R function that does this asks the right statistical question (instead of the obvious question) and can draw error bars that display the uncertainty of future predictions. The uncertainty in future prediction is in fact different than the uncertainty of the estimate (what was most obvious to calculate from the standard errors) and (after some reflection) is what I really wanted. Having these sort of distinctions already thought out is why we are using a statistics package like R instead of just coding everything up. These calculations are all trivial to implement- but remembering to perform the calculations that answer the right statistical questions can be difficult. The built in R solution of plotting the the fit model (black line) and the region of expected prediction uncertainty (blue lines) is as follows:

> pred <- predict.lm(model,interval='prediction')
> dfit <- data.frame(x,y,fit=pred[,1],lwr=pred[,2],upr=pred[,3])
> ggplot(data=dfit) + geom_point(aes(x=x,y=y)) +
    geom_line(aes(x,fit)) +
    geom_line(aes(x=x,y=lwr),color='blue') + geom_line(aes(x=x,y=upr),color='blue')


And we are done.

Related posts:

  1. Survive R
  2. The cranky guide to trying R packages
  3. R annoyances

To leave a comment for the author, please follow the link and comment on their blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)