Subsampling for dummies

July 8, 2010
By

(This article was first published on Timothée Poisot » R, and kindly contributed to R-bloggers)

A couple months ago, I was writing code for a paper that required some intense subsampling (something along the lines of several thousands of replicates on several thousands matrices). I decided to do the whole thing in R (I must confess that I don’t know how to live without it…), and as you can guess, it is really simple.

The whole idea was to pick at random submatrices from within one bigger matrix. For the sake of simplicity, I assume that my matrix will be square (i.e. same number of rows and columns), with 20 rows and 20 columns, and that approximately 40 percent of the positions will be filled with something that is not a zero. Also, the maximal value is 1.

It is easy to simulate a matrix with these properties in R, using a binomial distribution to fill the matrix at random places:

mat <- matrix(rbinom(400,4,0.35),nrow=20,ncol=20)/4

What I want to do is to see if my current sampling is large enough to get a correct idea of the « real » situation. It means that even if I decrease the sample size a little bit, I will have a comparable result. Let’s say that I really want to know how many non-zero interactions are in each row, something we will call generality, and define as a one-liner :

generality = function(x) sum(x[x>0])/length(x)

The general code for subsampling my big matrix will be

Result <- NULL
Nrep <- 100
for(Size in 3:(nrow(mat)-2))
{
  for(rep in 1:Nrep)
  {
    nmat <- mat[sample(c(1:nrow(mat)),Size), sample(c(1:nrow(mat)),Size)]
    STAT <- mean(apply(nmat,1,generality))
    Result <- rbind(Result,rbind(c(Size,rep,STAT)))
  }
}
Result <- as.data.frame(Result)
colnames(Result) <- c(‘Size’,‘Rep’,‘Stat’)

And voilà. This function simply randomly sample our matrix, and apply the function we want (here on each line). At the end, we change the matrix in a data.frame, that we can plot with lattice, for example.

The result is not exactly good looking because the data are as devoid of structure as possible, but the goal was just to illustrate how easy it is to build a subsampling routine. I’m not sure this is the best way to proceed, but I was happy with the speed of the code, and really happy with the time I spent on implementing the whole thing.

To leave a comment for the author, please follow the link and comment on his blog: Timothée Poisot » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.