Do basic R operations much faster in bash [Slightly off-topic]

January 25, 2016
By

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

R is great, and you can do a LOT OF stuff with it.

However, sometimes you want to do really basic stuff with huge or a lot of files. At work, I have to do that a lot because I am mostly dealing with language data that often needs some pre-processing.

Most of these operations are done much, much faster on the level of the operating system (preferably in Bash on Linux or Unix, i.e. Mac OS). And since R tries to load everything into working memory, these functions might also help you to do stuff with files that are too big for your RAM.

This blog post is some kind of cheat sheet for me to remember some of the bash functions that prove very useful to me. (Most of the functions are quite basic for an advanced user of Linux or Unix, I guess).

Disclaimer: Most of these calls were adapted from different StackExchange questions. There are really lots of very helpful posts. Thanks to the community!

Superfast subset of a tabulated text file (it might also be gzipped!):
[z]grep -E >
could include your separators. If is tab-separated, use -P for Perl-like regular expressions (only works with grep, not with zgrep?).

Superfast extraction of the first column from a tab-separated file:
cut -f1 >
Just replace with * if you want to extract the first column from each file and write them all into the same .

Write unique rows of a file into a new file:
sort | uniq >
Yes, there is no “e” after uniq! You have to sort first.

Get list of files from a directory really fast – this has to be inserted into an R script to get a list of files:
files <- system(paste0(“ls -f “, source.path), intern = T)
I used this to get a list of 1.6 million file names. It was A LOT faster than the built-in R function dir().

To be continued.

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)