Compute longest increasing/decreasing subsequence using Rcpp

[This article was first published on Mango Solutions Shop, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Jinjing Xie – Consultant, Shanghai.

The challenge here is interesting but also very clear: for a given un-ordered vector of numbers, select from it a longest increasing (or decreasing) subsequence. This subsequence may not be unique.

For example, in the following sequence,

62 49 42 54 59 39 35 48 57 60

One obvious longest increasing subsequence is 48, 57, 60, which has length 3;

The (unique) longest decreasing subsequence is 62, 49, 42, 39, 35, which has length 5.

For a very long vector, say having 10000 numbers, is it the case that all increasing and decreasing sequences are very short? This question was actually answered 80 years ago, the longest increasing or decreasing subsequence has ( O(sqrt{n}) ) length.

Interested readers can find the proof in Erdos P. and Szekeres G., 1935 for following theorem: any sequence of ( n^2+1 ) distinct integers has either an increasing or a decreasing subsequence of length ( n+1 ).

That means for a general vector of length ( m ), there is either an increasing or decreasing subsequence of length at least ( sqrt{m-1} ).

Once we know that there is always a longest increasing or decreasing subsequence of at least a certain size, how do we find it? At a glance this may seem difficult. Actually there is a good dynamic programming algorithm using only ( O(nlog(n)) ) time.

R version

The following code has been implemented following Wiki page.

<span style="color: #0000ff;">library</span>(compiler)
longest_subseq.R = cmpfun(<span style="color: #0000ff;">function</span>(x) {
    P = integer(length(x))
    M = integer(length(x) + <span style="color: #0000ff;">1</span>)
    L = newL = <span style="color: #0000ff;">0</span>
    <span style="color: #0000ff;">for</span> (i <span style="color: #0000ff;">in</span> seq_along(x) - <span style="color: #0000ff;">1</span>) {
        lo = <span style="color: #0000ff;">1</span>
        hi = L
        <span style="color: #0000ff;">while</span> (lo <= hi) {
            mid = (lo + hi)%/%<span style="color: #0000ff;">2</span>
            <span style="color: #0000ff;">if</span> (x[M[mid + <span style="color: #0000ff;">1</span>] + <span style="color: #0000ff;">1</span>] < x[i + <span style="color: #0000ff;">1</span>]) {
                lo = mid + <span style="color: #0000ff;">1</span>
            } <span style="color: #0000ff;">else</span> {
                hi = mid - <span style="color: #0000ff;">1</span>
            }
        }
        newL = lo
        P[i + <span style="color: #0000ff;">1</span>] = M[newL]
        <span style="color: #0000ff;">if</span> (newL > L) {
            M[newL + <span style="color: #0000ff;">1</span>] = i
            L = newL
        }<span style="color: #0000ff;"> else if</span> (x[i + 1] < x[M[newL + <span style="color: #0000ff;">1</span>] + <span style="color: #0000ff;">1</span>]) {
            M[newL + <span style="color: #0000ff;">1</span>] = i
        }
    }
    k = M[L + <span style="color: #0000ff;">1</span>]
    re = integer(L)
    <span style="color: #0000ff;">for</span> (i <span style="color: #0000ff;">in</span> L:<span style="color: #0000ff;">1</span>) {
        re[i] = k + <span style="color: #0000ff;">1</span>
        k = P[k + <span style="color: #0000ff;">1</span>]
    }
    re
})
longest_subseq.R(c(<span style="color: #0000ff;">5</span>, <span style="color: #0000ff;">1</span>, <span style="color: #0000ff;">3</span>, <span style="color: #0000ff;">2</span>, <span style="color: #0000ff;">4</span>))

## [1] 2 4 5

 

Now let’s play with a larger vector.

 

x = rnorm(<span style="color: #0000ff;">1000</span>)
plot(x, main = <span style="color: #008000;">"Longest Increasing (blue) and Decreasing (red) Subsequencesnfor 1000 random numbers"</span>, 
    ylab = <span style="color: #008000;">"Random Value"</span>, pch = <span style="color: #0000ff;">16</span>, col = "<span style="color: #008000;">#0000004D</span>")
<span style="color: #008000;"># Longest Increasing Sequence</span>
ind = longest_subseq.R(x)
points(ind, x[ind], pch = <span style="color: #0000ff;">16</span>, col = "<span style="color: #008000;">blue</span>")
<span style="color: #008000;"># Longest Decreasing Sequence</span>
rind = longest_subseq.R(-x)
points(rind, x[rind], pch = <span style="color: #0000ff;">16</span>, col = "<span style="color: #008000;">red</span>")

 

1.png

 

How about 250,000 length vectors?

 

y = rnorm(<span style="color: #0000ff;">250000</span>)
system.time(ind1 <- longest_subseq.R(y))

##    user  system elapsed 
##    3.36    0.01    3.48

 

A bit slow, right?

But 250,000 is typically not long for a time series.

 

Rcpp version

<span style="color: #0000ff;">library</span>(Rcpp)

## Warning: package 'Rcpp' was built under R version 3.0.3

cppcode <span style="color: #008000;"><- "n#include <vector>n#include <Rcpp.h>nusing namespace std;ntypedef int index_type;nn// [[Rcpp::export]]nvector<index_type> longest_subseq(SEXP X){n    Rcpp::NumericVector x(X);ntvector<index_type> P(x.size(), 0);ntvector<index_type> M(x.size()+1, 0);ntindex_type L(0), newL;ntfor(index_type i=0; i < x.size(); ++i) {nttindex_type lo(1), hi(L), mid;nttwhile( lo <= hi) {ntttmid = (lo + hi) / 2;ntttif (x[M[mid]] < x[i] ) {nttttlo = mid + 1;nttt} else {ntttthi = mid - 1;nttt}ntt}nttnewL = lo;nttP[i] = M[newL - 1];nttif (newL > L) {ntttM[newL] = i;ntttL = newL;ntt} else if (x[i] < x[M[newL]]) {ntttM[newL] = i;ntt}nt}n    vector<index_type> re(L);ntindex_type k(M[L]);ntfor(index_type i=L-1; i>=0; --i){nttre[i] = k + 1;nttk = P[k];nt}n    return(re);n}"
</span>
sourceCpp(code = cppcode)

 

The sourceCpp will export the cpp function longest_subseq as R function longest_subseq, now we can test it.

 

system.time(ind2 <- longest_subseq(y))

##    user  system elapsed 
##    0.03    0.00    0.03

all.equal(ind1, ind2)

## [1] TRUE

 

That’s a big improvement.

Now we can play with some time series data sets.

 

data(EuStockMarkets)
matplot(EuStockMarkets, main = <span style="color: #008000;">"Longest Increasing SubsequencenEU Stock Market Data (1991–1998)"</span>, 
    col = <span style="color: #0000ff;">2:5</span>, pch = "o", cex = <span style="color: #0000ff;">0.3</span>)
<span style="color: #0000ff;">for</span> (i <span style="color: #0000ff;">in 1:4</span>) {
    ind = longest_subseq(EuStockMarkets[, i])
    points(ind, EuStockMarkets[ind, i], type = <span style="color: #008000;">"l"</span>, col = i + <span style="color: #0000ff;">1</span>)
}
i = <span style="color: #0000ff;">3</span>
ind = longest_subseq(EuStockMarkets[, i])
<span style="color: #0000ff;">for</span> (j <span style="color: #0000ff;">in</span> ind) lines(c(j, j), c(<span style="color: #0000ff;">0</span>, EuStockMarkets[j, i]), col = i + <span style="color: #0000ff;">1</span>, lty = <span style="color: #0000ff;">3</span>)
legend("<span style="color: #008000;">topleft</span>", legend = colnames(EuStockMarkets), col = <span style="color: #0000ff;">2:5</span>, pch = "o")

 

2.png

 

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions Shop.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)