Easy way of determining number of lines/records in a given large file using R

February 10, 2010
By

(This article was first published on Econometrics_Help, and kindly contributed to R-bloggers)

Dear Readers,

Today I would like to post the easy way of determining number of lines/records in any given large file using R.

Directly to point.

1) If data set is small let say less than 50MB or around in R one can read it with ease using:
length(readLines(“xyzfile.csv”))

2) But if data set is too large say more than 1GB then reading through R throws the memory limit problem, since R takes all the records into memory and outputs the requested.

3) So, how to determine number of lines for large data set without getting into memory problems.

a) First for let’s say of size about half GB or one million records/observations (assuming you are having 2GB RAM on your PC) the below code easily determine number of records with no memory related errors:

testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines

b) Next, even for size larger than half GB one can determine the number of records by bzipping the file and running the code as follows:
testcon <- file("xyzfile.csv.bz2",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 )
nooflines <- nooflines+linesread )
close(testcon)
nooflines

Second method has an advantage of disk space efficiency R from 2.10 version can
directly read zip files.

Thus, from next time wish readers will follow these easy method.

Have a nice programing with R.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics_Help.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



RStudio homepage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training



http://www.eoda.de







ODSC

ODSC

CRC R books series











Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)