R Tutorial Series: Summary and Descriptive Statistics

November 1, 2009
By

(This article was first published on R Tutorial Series, and kindly contributed to R-bloggers)

Summary (or descriptive) statistics are the first figures used to represent nearly every dataset. They also form the foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple methods, they are essential to the analysis process. This tutorial will explore the ways in which R can be used to calculate summary statistics, including the mean, standard deviation, range, and percentiles. Also introduced is the summary function, which is one of the most useful tools in the R set of commands.

Tutorial Files

Before we start, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. This dataset contains hypothetical age and income data for 20 subjects. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached.

Mean

In R, a mean can be calculated on an isolated variable via the mean(VAR) command, where VAR is the name of the variable whose mean you wish to compute. Alternatively, a mean can be calculated for each of the variables in a dataset by using the mean(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the mean function.

  1. > #calculate the mean of a variable with mean(VAR)
  2. > #what is the mean Age in the sample?
  3. > mean(Age)
  4. [1] 32.3
  5. > #calculate the mean of all variables in a dataset with mean(DATAVAR)
  6. > #what is the mean of each variable in the dataset?
  7. > mean(dataset)
  8. Age...... Income
  9. 32.3..... 34000.0

Standard Deviation

Within R, standard deviations are calculated in the same way as means. The standard deviation of a single variable can be computed with the sd(VAR) command, where VAR is the name of the variable whose standard deviation you wish to retrieve. Similarly, a standard deviation can be calculated for each of the variables in a dataset by using the sd(DATAVAR) command, where DATAVAR is the name of the variable containing the data. The code sample below demonstrates both uses of the standard deviation function.

  1. > #calculate the standard deviation of a variable with sd(VAR)
  2. > #what is the standard deviation of Age in the sample?
  3. > sd(Age)
  4. [1] 19.45602
  5. > #calculate the standard deviation of all variables in a dataset with sd(DATAVAR)
  6. > #what is the standard deviation of each variable in the dataset?
  7. > sd(dataset)
  8. Age.............. Income
  9. 19.45602.... 32306.10175

Range

Minimum and Maximum

Keeping with the pattern, a minimum can be computed on a single variable using the min(VAR) command. The maximum, via max(VAR), operates identically. However, in contrast to the mean and standard deviation functions, min(DATAVAR) or max(DATAVAR) will retrieve the minimum or maximum value from the entire dataset, not from each individual variable. Therefore, it is recommended that minimums and maximums be calculated on individual variables, rather than entire datasets, in order to produce more useful information. The sample code below demonstrates the use of the min and max functions.

  1. > #calculate the min of a variable with min(VAR)
  2. > #what is the minimum age found in the sample?
  3. > min(Age)
  4. [1] 5
  5. > #calculate the max of a variable with max(VAR)
  6. > #what is the maximum age found in the sample?
  7. > max(Age)
  8. [1] 70

Range

The range of a particular variable, that is, its maximum and minimum, can be retrieved using the range(VAR) command. As with the min and max functions, using range(DATAVAR) is not very useful, since it considers the entire dataset, rather than each individual variable. Consequently, it is recommended that ranges also be computed on individual variables. This operation is demonstrated in the following code sample.

  1. > #calculate the range of a variable with range(VAR)
  2. > #what range of age values are found in the sample?
  3. > range(Age)
  4. [1] 5....70

Percentiles

Values from Percentiles (Quantiles)

Given a dataset and a desired percentile, a corresponding value can be found using the quantile(VAR, c(PROB1, PROB2,…)) command. Here, VAR refers to the variable name and PROB1, PROB2, etc., relate to probability values. The probabilities must be between 0 and 1, therefore making them equivalent to decimal versions of the desired percentiles (i.e. 50% = 0.5). The following example shows how this function can be used to find the data value that corresponds to a desired percentile.

  1. > #calculate desired percentile values using quantile(VAR, c(PROB1, PROB2,...))
  2. > #what are the 25th and 75th percentiles for age in the sample?
  3. > quantile(Age, c(0.25, 0.75))
  4. 25%....... 75%
  5. 17.75..... 44.25

Note that quantile(VAR) command can also be used. When probabilities are not specified, the function will default to computing the 0, 25, 50, 75, and 100 percentile values, as shown in the following example.

  1. > #calculate the default percentile values using quantile(VAR)
  2. > #what are the 0, 25, 50, 75, and 100 percentiles for age in the sample?
  3. > quantile(Age)
  4. 0%...... 25%...... 50%...... 75%...... 100%
  5. 5.00... 17.75...... 30.00... 44.25..... 70.00

Percentiles from Values (Percentile Rank)

In the opposite situation, where a percentile rank corresponding to a given value is needed, one has to devise a custom method. To begin, consider the steps involved in calculating a percentile rank.

  1. count the number of data points that are at or below the given value
  2. divide by the total number of data points
  3. multiply by 100

From the preceding steps, the formula for calculating a percentile rank can be derived: percentile rank = length(VAR[VAR <= VAL]) / length(VAR) * 100, where VAR is the name of the variable and VAL is the given value. This formula makes use of the length function in two variations. The first, length(VAR[VAR <= VAL]), counts the number of data points in a variable that are below the given value. Note that the "<=" operator can be replaced with other combinations of the <, >, and = operators, supposing that the function were to be applied to different scenarios. The second, length(VAR), counts the total number of data points in the variable. Together, they accomplish steps one and two of the percentile rank computation process. The final step is to multiply the result of the division by 100 to transform the decimal value into a percentage. A sample percentile rank calculation is demonstrated below.

  1. > #calculate the percentile rank for a given value using the custom formula: length(VAR[VAR <>
  2. > #in the sample, an age of 45 is at what percentile rank?
  3. > length(Age[Age <= 45]) / length(Age) * 100
  4. [1] 75

Summary

A very useful multipurpose function in R is summary(X), where X can be one of any number of objects, including datasets, variables, and linear models, just to name a few. When used, the command provides summary data related to the individual object that was fed into it. Thus, the summary function has different outputs depending on what kind of object it takes as an argument. Besides being widely applicable, this method is valuable because it often provides exactly what is needed in terms of summary statistics. A couple examples of how summary(X) can be used are displayed in the following code sample. I encourage you to use the summary command often when exploring ways to analyze your data in R. This function will be revisited throughout the R Tutorial Series.

  1. > #summarize a variable with summary(VAR)
  2. > summary(Age)

The output of the preceding summary is pictured below.

  1. > #summarize a dataset with summary(DATAVAR)
  2. > summary(dataset)

The output of the preceding summary is pictured below.

Complete Summary Statistics Analysis

To see a complete example of how summary statistics can be used to analyze data in R, please download the summary statistics analysis example (.txt) file.

Up Next: Zero-Order Correlations

Thank you for participating in the Summary and Descriptive Statistics tutorial. I hope that it has been useful to your work with R and statistics. Please let me know of any feedback, questions, or requests that you have in the comments section of this article. Our next guide will be on the topic of Zero-Order Correlations.

To leave a comment for the author, please follow the link and comment on his blog: R Tutorial Series.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.