Simple pattern detection in numerical data

[This article was first published on yaRb, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while ago I helped out a colleague who was testing some methods to detect economic (in)activity of companies, based on their quarterly tax declarations (our institute has access to such data). One of the thing my colleague wanted was to detect simple patterns, such as a bunch of zero-declarations, followed by positive ones, or the other way around. After fiddling around for half an hour I came up with the following solution which is so “typically R” that I decided share it, so here it goes.

In stead of directly detecting the patterns, I decided to write a distance function which counts the number of cells in a row which do not comply with some pattern. A pattern can be specified with

 1 (value is positive)
-1 (value is negative)
 0 (value is zero)
NA (value is missing)

A vector of values indicates a pattern. Below is the function:


# Number of cells in a row not matching a simple pattern
#
# x: numeric data.frame  or matrix
# pattern: row pattern to detect, vector of length(ncol(x)) with
# entry : meaning
#  -1   : < 0
#   1   : > 0
#   NA  : NA 
#   0   : 0
#
patternDist <- function(x,pattern){
    apply(sign(x), 1,
        function(row){
            v <- row - pattern
            v[is.na(pattern) & is.na(row)] <- 0
            v[xor(is.na(pattern), is.na(row))] <- 1
            sum( v != 0 )
    })
}




The type of numbers to be detected was so simple that it was sufficient to use the sign function. Also, it's kinda funny to note that the whole function consists of a single apply-statement (which is never a goal in itself of course, but nevertheless...). I think it is a nice example of the power of R. Here's an example of how to use it.


set.seed(1)
x <- matrix(rnorm(40),nrow=10)
x[sample(40,4)] <- 0
x[sample(40,4)] <- NA

x <- as.data.frame(x)
x
           V1          V2          V3          V4
1  -0.6264538  1.51178117  0.91897737          NA
2   0.1836433  0.38984324  0.78213630 -0.10278773
3  -0.8356286  0.00000000  0.07456498  0.38767161
4   1.5952808 -2.21469989 -1.98935170 -0.05380504
5          NA  1.12493092  0.61982575 -1.37705956
6  -0.8204684  0.00000000 -0.05612874 -0.41499456
7   0.4874291 -0.01619026 -0.15579551 -0.39428995
8          NA  0.00000000          NA -0.05931340
9   0.5757814  0.82122120 -0.47815006  1.10002537
10 -0.3053884  0.59390132  0.41794156  0.76317575

patternDist(x,pattern=c(-1,1,1,NA))

[1] 0 2 2 4 2 3 4 4 3 1



And indeed, only the first row completely complies with the pattern negative-positive-positive-missing.

To leave a comment for the author, please follow the link and comment on their blog: yaRb.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)