Subsampling for dummies

[This article was first published on Timothée Poisot » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A couple months ago, I was writing code for a paper that required some intense subsampling (something along the lines of several thousands of replicates on several thousands matrices). I decided to do the whole thing in R (I must confess that I don’t know how to live without it…), and as you can guess, it is really simple.

The whole idea was to pick at random submatrices from within one bigger matrix. For the sake of simplicity, I assume that my matrix will be square (i.e. same number of rows and columns), with 20 rows and 20 columns, and that approximately 40 percent of the positions will be filled with something that is not a zero. Also, the maximal value is 1.

It is easy to simulate a matrix with these properties in R, using a binomial distribution to fill the matrix at random places:

mat <- matrix(rbinom(400,4,0.35),nrow=20,ncol=20)/4

What I want to do is to see if my current sampling is large enough to get a correct idea of the « real » situation. It means that even if I decrease the sample size a little bit, I will have a comparable result. Let’s say that I really want to know how many non-zero interactions are in each row, something we will call generality, and define as a one-liner :

generality = function(x) sum(x[x>0])/length(x)

The general code for subsampling my big matrix will be

Result <- NULL
Nrep <- 100
for(Size in 3:(nrow(mat)-2))
  for(rep in 1:Nrep)
    nmat <- mat[sample(c(1:nrow(mat)),Size), sample(c(1:nrow(mat)),Size)]
    STAT <- mean(apply(nmat,1,generality))
    Result <- rbind(Result,rbind(c(Size,rep,STAT)))
Result <-
colnames(Result) <- c(‘Size’,‘Rep’,‘Stat’)

And voilà. This function simply randomly sample our matrix, and apply the function we want (here on each line). At the end, we change the matrix in a data.frame, that we can plot with lattice, for example.

The result is not exactly good looking because the data are as devoid of structure as possible, but the goal was just to illustrate how easy it is to build a subsampling routine. I’m not sure this is the best way to proceed, but I was happy with the speed of the code, and really happy with the time I spent on implementing the whole thing.

To leave a comment for the author, please follow the link and comment on their blog: Timothée Poisot » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)