Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

(This article was first published on The Chemical Statistician » R programming, and kindly contributed to R-bloggers)

Introduction

Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today’s post highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.  I will use the built-in data set “InsectSprays” to illustrate these functions, because it contains categorical (character) and continuous (numeric) data, and that allows me to show different ways of exploring these 2 types of data.

If you have a favourite command for exploring data frames that is not in this post, please share it in the comments!

This post continues a recent series on exploratory data analysis.  Previous posts in this series include

 

Useful Functions for Exploring Data Frames

Use dim() to obtain the dimensions of the data frame (number of rows and number of columns).  The output is a vector.

> dim(InsectSprays)
[1] 72 2

 

Use nrow() and ncol() to get the number of rows and number of columns, respectively.  You can get the same information by extracting the first and second element of the output vector from dim(). 

> nrow(InsectSprays) 
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2

Use head() to obtain the first n observations and tail() to obtain the last n observations; by default, n = 6.  These are good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.

> head(InsectSprays, n = 5)
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A

 

Let s be the number of observations.  If you use a negative number for the “n” option in head(), you will obtain the first s+n observations.  In the following example, since s = 72 and s = -62, the following command will return the first 10 observations; the calculation is

s+n = 72 + (-62) = 10.

> head(InsectSprays, n = -62)
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A
7     10     A
8     23     A
9     17     A
10    20     A

 

Analogously, if you use a negative number for the “n” option in tail(), you will get the last s+n observations.  For example, the following command will return the last 10 observations.

> tail(InsectSprays, n = -62)
   count spray  
63    15     F
64    22     F
65    15     F
66    16     F
67    13     F
68    10     F
69    26     F
70    26     F
71    24     F
72    13     F

 

The names() function will return the column headers.

> names(InsectSprays)

[1] "count" "spray"

 

The str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column.  In this example, “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.  

> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

 

To obtain all of the categories or levels of a categorical variable, use the levels() function.

> levels(InsectSprays$spray)
[1] "A" "B" "C" "D" "E" "F"

 

When applied to a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together.  For a continuous (numeric) variable like “count”, it returns the 5-number summary.  (Read my previous post to learn how fivenum() and summary() return different 5-number summaries.)   If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them.  In this example, there are no missing values for “count”, so there is no display for the number of NA’s.  For a categorical variable like “spray”, it returns the levels and the number of data in each level.  

> summary(InsectSprays)
count            spray
Min.   : 0.00    A:12
1st Qu.: 3.00    B:12
Median : 7.00    C:12
Mean   : 9.50    D:12
3rd Qu.:14.25    E:12
Max.   :26.00    F:12

 

Are there any other functions for exploring data frames that you like?  If so, please share them in the comments!


Filed under: Descriptive Statistics, R programming Tagged: 5-number summary, data, data analysis, data frame, descriptive statistics, dim(), five-number summary, head(), ncol(), nrow(), R, R programming, statistics, str(), summary statistics, summary(), tail()

To leave a comment for the author, please follow the link and comment on his blog: The Chemical Statistician » R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.