R Function of the Day: rle

September 22, 2009
By

(This article was first published on sigmafield - R, and kindly contributed to R-bloggers)

Edit: This post originally appeared on my Wordpress blog on September 22, 2009. I present it here in its original form.

The R Function of the Day series will focus on describing in plain language how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.

Today, I will discuss the rle function.

What situation is rle useful in?

The rle function is named for the acronym of "run length encoding". What does the term "run length" mean? Imagine you flip a coin 10 times and record the outcome as "H" if the coin lands showing heads, and "T" if the coin lands showing tails. You want to know what the longest streak of heads is. You also want to know the longest streak of tails. The run length is the length of consecutive types of a flip. If the outcome of our experiment was "H T T H H H H H T H", the longest run length of heads would be 5, since there are 5 consecutive heads starting at position 4, and the longest run length for tails would be 2, since there are two consecutive heads starting at position 2. If you just have 10 flips, it is pretty easy to simply eyeball the answer. But if you had 100 flips, or 100,000, it would not be easy at all. However, it is very easy with the rle function in R! That function will encode the entire result into its run lengths. Using the example above, we start with 1 H, then 2 Ts, 5 Hs, 1 T, and finally 1 H. That is exactly what the rle function computes, as you will see below in the example.

How do I use rle?

First, we will simulate the results of a the coin flipping experiment. This is trivial in R using the sample function. We simulate flipping a coin 1000 times.


> ## generate data for coin flipping example 
> coin <- sample(c("H", "T"), 1000, replace = TRUE)
> table(coin) 
coin
  H   T 
501 499  
> head(coin, n = 20)
 [1] "T" "H" "T" "T" "T" "H" "T" "H" "T" "T" "H" "T" "H" "T"
[15] "T" "T" "H" "H" "H" "H" 

We can see the results of the first 20 tosses by using the head (as in "beginning", nothing to do with coin tosses) function on our coin vector.

So, our question is, what is the longest run of heads, and longest run of tails? First, what does the output of the rle function look like?


> ## use the rle function on our SMALL EXAMPLE above
> ## note results MATCH what I described above... 
> rle(c("H", "T", "T", "H", "H", "H", "H", "H", "T", "H"))
Run Length Encoding
  lengths: int [1:5] 1 2 5 1 1
  values : chr [1:5] "H" "T" "H" "T" "H" 
> ## use the rle function on our SIMULATED data
> coin.rle <- rle(coin)
> ## what is the structure of the returned result? 
> str(coin.rle)
List of 2
 $ lengths: int [1:500] 1 1 3 1 1 1 2 1 1 1 ...
 $ values : chr [1:500] "T" "H" "T" "H" ...
 - attr(*, "class")= chr "rle" 
> ## sort the data, this shows the longest run of
> ## ANY type (heads OR tails)
> sort(coin.rle$lengths, decreasing = TRUE)
  [1] 9 8 7 7 7 7 7 6 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 5 5 5
 [28] 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 [55] 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [82] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[109] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
[136] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[163] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[190] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[217] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[244] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[271] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[298] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[325] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[352] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[379] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[406] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[433] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[460] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[487] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
> ## use the tapply function to break up
> ## into 2 groups, and then find the maximum
> ## within each group
> 
> tapply(coin.rle$lengths, coin.rle$values, max)
H T 
9 8  

So in this case the longest run of heads is 9 and the longest run of tails is 8. The tapply function was discussed in a previous R Function of the Day article.

Summary of rle

The rle function performs run length encoding. Although it is not used terribly often when programming in R, there are certain situations, such as time series and longitudinal data analysis, where knowing how it works can save a lot of time and give you insight into your data.

Tags: 

To leave a comment for the author, please follow the link and comment on his blog: sigmafield - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.