Site icon R-bloggers

Do basic R operations much faster in bash [Slightly off-topic]

[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R is great, and you can do a LOT OF stuff with it.

However, sometimes you want to do really basic stuff with huge or a lot of files. At work, I have to do that a lot because I am mostly dealing with language data that often needs some pre-processing.

Most of these operations are done much, much faster on the level of the operating system (preferably in Bash on Linux or Unix, i.e. Mac OS). And since R tries to load everything into working memory, these functions might also help you to do stuff with files that are too big for your RAM.

This blog post is some kind of cheat sheet for me to remember some of the bash functions that prove very useful to me. (Most of the functions are quite basic for an advanced user of Linux or Unix, I guess).

Disclaimer: Most of these calls were adapted from different StackExchange questions. There are really lots of very helpful posts. Thanks to the community!

Superfast subset of a tabulated text file (it might also be gzipped!):
[z]grep -E <regex pattern> <from file> > <to file>
<regex pattern> could include your separators. If <from file> is tab-separated, use -P for Perl-like regular expressions (only works with grep, not with zgrep?).

Superfast extraction of the first column from a tab-separated file:
cut -f1 <from file> > <to file>
Just replace <from file> with * if you want to extract the first column from each file and write them all into the same <to file>.

Write unique rows of a file into a new file:
sort <from file> | uniq > <to-file>
Yes, there is no “e” after uniq! You have to sort <from file> first.

Get list of files from a directory really fast – this has to be inserted into an R script to get a list of files:
files <- system(paste0(“ls -f “, source.path), intern = T)
I used this to get a list of 1.6 million file names. It was A LOT faster than the built-in R function dir().

To be continued.

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.