Tips and Tools you may need for working on BIG data

[This article was first published on One Tip Per Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Nowadays everyone is talking about big data. As a genomic scientist, I could feel hungry of a collection of tools more specialized for the mediate-to-big data we deal everyday.

Here are some tips I found useful when getting, processing or visualizing large data set:

1. How to download data faster than wget?

We can use wget to download the data to local disk. If it’s large, we can download with other faster alternative, such as axel, aria2.

2. Process the data in parallel with hidden option in GNU commands

  • If you have many many files to process, and they are independent, you can process them in a parallel manner. GNU has a command called parallel. Lindenbaum Pierre wrote a nice notebook for “GNU Parallel in Bioinformatics“, worthy to read. 
  • Many commonly used commands also have a hidden option to run in a parallel way. For example, GNU sort command has –parallel=N to set it with multiple cores. 
  • You can set -F when doing grep -f on a large seed file. People also suggest to set export LC_ALL=C line to get X2 speed.

3. In R, there are several must-have tips for large data, e.g. data.table
  • If using read.table(), set stringsAsFactors = F and colClass. See the example here
  • use fread(), not read.table(). Some more details here. But so far, fread() does not support reading *.gz file directly. Use fread(‘zcat file.gz’)
  • use data.table, rather data.frame. Learn the difference online here.
  • There is a nice View for how to process data in parallel in R:, but I have not followed them practically. Hopefully there will be some easy tutorials there, or I become less procrastinated to learn some of them … At least I can start with foreach()
4. How to open scatter plot with too many points in Illustrator?

This is really a problem for me as we usually have a figure with >30k dots (i.e. each dot is a gene). Even though they are highly overlapping each other, opening it in Illustrator is extremely slow. Here is a tip:
From that, probably a better idea is to “compress” the data before plotting it, such as merge the overlapped ones if they overlapped some %.
or this one:
or this one:

Still working on the post…

To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)