My 15 practical tips for a bioinformatician

[This article was first published on One Tip Per Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Tips below are based on the lessons I learnt from making mistakes during my years of research. It’s purely personal opinion. Order doesn’t mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via –tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a ‘-‘ in the name will be transferred to ‘.’ in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/how-to-write-a-good-readme
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don’t save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka “digital garbage”) you will generate soon.
  9. Using parallel as much as you can, e.g. using “LC_ALL=C sort –parallel=24 –buffer-size=5G” for sorting (https://www.biostars.org/p/66927/), as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others’ credit. This applies to this post, of course 🙂
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that’s how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medical-researchers-have-figured-out-how-much-time-is-okay-to-spend-sitting-each-day/

To leave a comment for the author, please follow the link and comment on their blog: One Tip Per Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)