R function for reading big tables

November 20, 2010

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

HugeFileLoader = function(path, sep = “\t”, skip = 0, header = T, nrows = 10){

### counts the number of lines using shell wc command, and converts the output to numeric
line.count = paste(“wc -l “, path, sep = “”)
row.count = as.numeric(strsplit(system(line.count, intern = T), split=” “)[[1]][1]) – skip

### reads in first five lines of the file and determines the type of each column
first5rows = read.table(path, header = TRUE, nrows = nrows, skip = skip, sep = sep)
tab.classes = sapply(first5rows, class)

### reads in the data
tab = read.table(path, header=header, colClasses=tab.classes, comment.char=”#”, nrows=row.count, skip=skip, sep=sep)

If you are using R on a Mac, you have to change the index when parsing wc -l output ([[1]][1]), because it returns a space as the first character, while on a linux machine it returns the number of lines.

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)