R function for reading big tables

November 20, 2010
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

HugeFileLoader = function(path, sep = "\t", skip = 0, header = T, nrows = 10){

### counts the number of lines using shell wc command, and converts the output to numeric
line.count = paste("wc -l ", path, sep = "")
row.count = as.numeric(strsplit(system(line.count, intern = T), split=" ")[[1]][1]) - skip

### reads in first five lines of the file and determines the type of each column
first5rows = read.table(path, header = TRUE, nrows = nrows, skip = skip, sep = sep)
tab.classes = sapply(first5rows, class)

### reads in the data
tab = read.table(path, header=header, colClasses=tab.classes, comment.char="#", nrows=row.count, skip=skip, sep=sep)
return(tab)
}

If you are using R on a Mac, you have to change the index when parsing wc -l output ([[1]][1]), because it returns a space as the first character, while on a linux machine it returns the number of lines.

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.