(This article was first published on

Dear Readers,**Econometrics_Help**, and kindly contributed to R-bloggers)Today I would like to post the easy way of determining number of lines/records in any given large file using R.

Directly to point.

1) If data set is small let say less than 50MB or around in R one can read it with ease using:

length(readLines("xyzfile.csv"))

2) But if data set is too large say more than 1GB then reading through R throws the memory limit problem, since R takes all the records into memory and outputs the requested.

3) So, how to determine number of lines for large data set without getting into memory problems.

a) First for let's say of size about half GB or one million records/observations (assuming you are having 2GB RAM on your PC) the below code easily determine number of records with no memory related errors:

testcon <- file("xyzfile.csv",open="r")

readsizeof <- 20000

nooflines <- 0

( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )

nooflines <- nooflines+linesread )

close(testcon)

nooflines

b) Next, even for size larger than half GB one can determine the number of records by bzipping the file and running the code as follows:

testcon <- file("xyzfile.csv.bz2",open="r")

readsizeof <- 20000

nooflines <- 0

( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )

nooflines <- nooflines+linesread )

close(testcon)

nooflines

Second method has an advantage of disk space efficiency R from 2.10 version can

directly read zip files.

Thus, from next time wish readers will follow these easy method.

Have a nice programing with R.

To

**leave a comment**for the author, please follow the link and comment on his blog:**Econometrics_Help**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...