**Mad (Data) Scientist**, and kindly contributed to R-bloggers)

Last week a reader of the **r-help** mailing list posted a query titled “Importing random subsets of a data file.” With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.

Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply. And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.

I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible. In other words, functions like **read.table()** are out.

I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea. (If you do know this number, I’ll outline an alternative approach at the end of this post.)

Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every **k**^{th} record is sufficiently “random” for you.

So, here is the code (downloadable from here):

subsamfile <- function(infile,outfile,k,header=T) { ci <- file(infile,"r") co <- file(outfile,"w") if (header) { hdr <- readLines(ci,n=1) writeLines(hdr,co) } recnum = 0 numout = 0 while (TRUE) { inrec <- readLines(ci,n=1) if (length(inrec) == 0) { # end of file? close(co) return(numout) } recnum <- recnum + 1 if (recnum %% k == 0) { numout <- numout + 1 writeLines(inrec,co) } } }

Very straightforward code. We use **file()** to open the input and output files, and read in the input file one line at a time, by specifying the argument **n = 1** in the first call to **file()**. Each inputted record is a character string. To sense the end-of-file condition on the input file, we test whether the input record has length 0. (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting **n = 1**.)

On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running **wc -l infile** (either directly or via R’s **system()**). This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like **cullrecs <- sample(1:ntotrecs,m,replace=FALSE)** where **m** is the desired number of records to extract, and then whenever **recnum** matches the next element of **cullrecs**, we’d write that record to **outfile**.

Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...