Code Snippet: Extracting a Subsample from a Large File

August 1, 2014
By

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.”  With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply.  And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible.  In other words, functions like read.table() are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea.  (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every kth record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) {
   ci <- file(infile,"r")
   co <- file(outfile,"w")
   if (header) {
      hdr <- readLines(ci,n=1)
      writeLines(hdr,co)
   }
   recnum = 0
   numout = 0
   while (TRUE) {
      inrec <- readLines(ci,n=1)
      if (length(inrec) == 0) { # end of file?
         close(co) 
         return(numout)
      }
   recnum <- recnum + 1
   if (recnum %% k == 0) {
      numout <- numout + 1
      writeLines(inrec,co)
   }
  }
}

Very straightforward code.  We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file().  Each inputted record is a character string.  To sense the end-of-file condition on the input file, we test whether the input record has length 0.  (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()).  This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.


To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.