Easy way of determining number of lines/records in a given large file using R

February 10, 2010
By

(This article was first published on Econometrics_Help, and kindly contributed to R-bloggers)

Dear Readers,

Today I would like to post the easy way of determining number of lines/records in any given large file using R.

Directly to point.

1) If data set is small let say less than 50MB or around in R one can read it with ease using:
length(readLines("xyzfile.csv"))

2) But if data set is too large say more than 1GB then reading through R throws the memory limit problem, since R takes all the records into memory and outputs the requested.

3) So, how to determine number of lines for large data set without getting into memory problems.

a) First for let's say of size about half GB or one million records/observations (assuming you are having 2GB RAM on your PC) the below code easily determine number of records with no memory related errors:

testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines

b) Next, even for size larger than half GB one can determine the number of records by bzipping the file and running the code as follows:
testcon <- file("xyzfile.csv.bz2",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines

Second method has an advantage of disk space efficiency R from 2.10 version can
directly read zip files.

Thus, from next time wish readers will follow these easy method.

Have a nice programing with R.

To leave a comment for the author, please follow the link and comment on his blog: Econometrics_Help.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.