R is great, and you can do a LOT OF stuff with it.
However, sometimes you want to do really basic stuff with huge or a lot of files. At work, I have to do that a lot because I am mostly dealing with language data that often needs some pre-processing.
Most of these operations are done much, much faster on the level of the operating system (preferably in Bash on Linux or Unix, i.e. Mac OS). And since R tries to load everything into working memory, these functions might also help you to do stuff with files that are too big for your RAM.
This blog post is some kind of cheat sheet for me to remember some of the bash functions that prove very useful to me. (Most of the functions are quite basic for an advanced user of Linux or Unix, I guess).
Disclaimer: Most of these calls were adapted from different StackExchange questions. There are really lots of very helpful posts. Thanks to the community!
Superfast subset of a tabulated text file (it might also be gzipped!):
Superfast extraction of the first column from a tab-separated file:
Write unique rows of a file into a new file:
Yes, there is no “e” after uniq! You have to sort
Get list of files from a directory really fast – this has to be inserted into an R script to get a list of files:
files <- system(paste0("ls -f ", source.path), intern = T)
I used this to get a list of 1.6 million file names. It was A LOT faster than the built-in R function dir().
To be continued.