Describing Data: Frequently Used Commands

May 13, 2011

(This article was first published on Coffee and Econometrics in the Morning, and kindly contributed to R-bloggers)

Obtaining a coherent numerical summary of data is a common task, and it is common to want to port these summary statistics into a table of results. When I am in interactive mode with my data, I use the summary() command applied to my data frame. For example, the following code loads and summarizes a data frame on Yogurt advertising and prices:

library(Ecdat) ## Econometrics Data (useful!)
data(Yogurt) ## Loads Yogurt from Ecdat
summary(Yogurt) ## Summarizes Yogurt

For each quantitative variable, the summary() command provides a five-number summary (min, max, Q1, Q3, median) plus the mean. For categorical variables, the counts of each level are provided. This provides an excellent summary measure of each variable, but you may prefer a richer set of information (especially when it comes to typing up tables).

I recently discovered a great way to obtain a richer set of information on a data frame. This method involves using the psych library, which contains functions describe() and Continuing with the code from above, here is the basic syntax:

describe(Yogurt) ## Describes in more detail the Yogurt data frame

Suppose you also want to break your summary statistics into two (or four) tables for comparison sake (perhaps to illustrate stark differences across select subsets of your data). The command is a convenient technique to break the data down by the levels of a factor. Here’s an example with on the Yogurt data., Yogurt$choice)

Finally, you may want to port your data into LaTeX format and/or select particular summary statistics from the list. I wrote a function that serves as a convenience interface to and toLatex(). As toLatex() does not work directly on objects created using, you might find this helpful.

If you do not like knowing about the kurtosis of your data, you could read up on the options of to learn about how to shut it down. If you’re going to port it into a LaTeX table anyway, you could also just modify the code I wrote here to eliminate the summary statistics you don’t want and produce LaTeX output.

FYI: Quick R has a nice summary of some other methods for summarizing data. Of the methods at Quick R that I didn’t describe, pastecs looks most like a method I would use.

To leave a comment for the author, please follow the link and comment on their blog: Coffee and Econometrics in the Morning. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.