Bioinformatics Tutorial with Exercises in R (part 1)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
 Bioinformatics is an interdisciplinary field of study that combines the field of biology with computer science to understand biological data. Bioinformatics is generally used in laboratories as an initial or final step to get the information. This information can subsequently be utilized for the wet lab practices. However, it can also be used as a standalone computational method to get the answers to relevant biological questions. Bioinformatics has greatly accelerated the discovery process and is used in both academics as well as industry to understand genomics & proteomics data, determine the protein structure, RNA structures, analyzing sequence data for molecular biological work, gene identification, to name a few applications. Over the years it has grown into a complex field of discipline requiring specific education and professional training.
Bioinformatics is an interdisciplinary field of study that combines the field of biology with computer science to understand biological data. Bioinformatics is generally used in laboratories as an initial or final step to get the information. This information can subsequently be utilized for the wet lab practices. However, it can also be used as a standalone computational method to get the answers to relevant biological questions. Bioinformatics has greatly accelerated the discovery process and is used in both academics as well as industry to understand genomics & proteomics data, determine the protein structure, RNA structures, analyzing sequence data for molecular biological work, gene identification, to name a few applications. Over the years it has grown into a complex field of discipline requiring specific education and professional training.
We begin with a simple introduction in sequence data. This tutorial assumes that readers have basic knowledge about the central dogma and biological molecules such as protein, DNA and RNA; and also understand that they are biopolymers naturally or synthetically produced for a specific role. Wikipedia is a great resource to get the basic introduction on these. However, I will also write a few sentences in my exercises to refresh the knowledge of the readers. Numerous programming languages have been used for the development of Bioinformatics, ranging from low end languages such as Java, C/C++ to high end scripting languages such as perl, python and the R statistical language. Bioinformatics also involves extensive database management implementation for storage, query and updating the sequence and numerical data. Most of the Bioinformatics software can be implemented either on a Windows, Mac or Linux platform. This tutorial also assumes that the reader has some understanding about R programming, RStudio and installation of packages. We will use numerous packages both common as well as strictly developed for Bioinformatics.
The open source community known as Bioconductor specifically develops the Bioinformatics tools using R for the analysis and comprehension of high-throughput genomic data. It boasts to have two releases each year, 1296 software packages, and an active user community. It has packages developed for application ranging from basic sequence alignment to recent years’ NextGen sequencing machines. We hope that we will able to discuss more about it in later chapters, when not only readers would be able to appreciate its relevance but also utilize it in their daily examples. Let us begin with some little steps by installing a few packages on our machines. I personally prefer to begin with a fresh installation to make sure their are no conflicts from the preinstalled packages.
To begin with let us first install a package known as Biostrings by running the following command on your RStudio.
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
This would take up to 5 minutes depending upon your internet connection and computer. Note that typing install.packages(“biostrings”) in RStudio might not result into success because of version issues. As usual use  library(Biostrings) command to load the package.
Deoxyribonucleic acid, or DNA, stores biological information which is expressed in form of RNA inside the cells. The two antiparallel DNA strands are termed polynucleotides and they are composed of simpler monomer units called nucleotides. Each nucleotide is composed of one of four nitrogen-containing nucleobases—either cytosine (C), guanine (G), adenine (A), or thymine (T). The randomly arranged nucleotide finally write up the code of life in the form of strict vocabulary.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Using the following command, create a list of DNA exactly the same as the one given below, then list the outcome by writing the vector myDNA.
myDNA <- c("ATGTTGCATTCATTAATTAAGAACGACCCAATACA") 
 myDNA 
Exercise 2
In recent years there has been exponential advancement in the field of DNA sequencing. High throughput methods have been developed to speed up the sequencing project. Still the basic DNA sequencing is a technology that reads each nucleotide step-by-step by chemical methods to decipher what order of letters A, T, G and C were placed that resulted into the specific DNA sequence. However, for this problem we have given you a short result from sequencing data. In a typical laboratory sequencing results, some of the DNA sequence could look something like this “CTGATTT-GATGGTC-NAT” where apart from ATGC we could see some dashes (skipped) and N (unknown) nucleotide. Copy these into mySequencing and print the result. Use the length() command to find out the length. Which of the following is the length
A [1] 1
B [2] 19 
C [1] 0 
Exercise 3
Now use the package specific command called  DNAString  to copy the seq
 myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT") 
Now use  length  command to find the length of the myDNASeq. Compare the difference between  mySequencing  and  myDNASeq .
Exercise 4
Use the  class()  command to analyze the datatype of the myDNA and myDNASeq. Which of the following represents the result
A
 class(myDNASeq)
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
> class(myDNA)
[1] "integr"
B
> class(myDNASeq)
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
> class(myDNA)
[1] "character"
C None of the above
D Both
Exercise 5
In this problem let us attempt to paste the  myDNA  and  myDNASeq  sequences using  paste  and analyse the results. Try to understand the syntax of  paste  command using  ?paste  command. Copy the result into  pastedDNA  and print it by typing  pastedDNA  .
Try to analyze the  pastedDNA  using  class  command.
Exercise 6
Did you notice that final result was a  character class and not the  Biostring  as expected? This would make  pastedDNA  not usable for biostring for any purpose. What happens is that  Biostrings  introduces a new data structure hierarchy which is different than the  vector  datatype of R. It has few sequence constructors such as  DNAString(),  RNAString(),  AAString()  for type of biomolecules (DNA, RNA and Amino Acids). This results into the fact that  Biostrings  objects cannot be compared with R strings ( myDNA == myDNASeq  is an invalid command). You will always end up with  FALSE  upon attempting to compare.
Next we would look at some basic transformations in Biostrings that can be implemented on DNA data. These transformations are  reverse() ,  complement() ,  reverseComplement()  and  translate() . Run the  myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT")  again. Run the  complement(myDNASeq) . Analyze the data.
Exercise 7
Did you find out what as happened in previous problem? This just created the complement of each nucleotide. The DNA usually exists as a double helix with both strands running antiparallel. Each base of ATGC is paired with some base on complementary strand. A has preference for T (and vise versa), G has preference for C (and vise versa). So the complementary strand is going to have the nucleotide that is pairs preferentially. However, the unknown nucleotide N gets written as N because the sequencer could not tell what it was.
In next problem run  reverse(myDNASeq) 
Exercise 8
Did you notice what has happened? Did nucletide in DNA sequences got read from back to front and not front to back? Or not. Each complementary strand is usually written from back to front because the complementary strands are anti-parallel. This because the sign of each strand is opposite. This is basically because the strand which basically is a polymer of nucleotide either has 5’ or 3’ end (more about it later). So the +ve strand runs from 5’-3’ and -ve strand runs from 3’-5’. In the next problem run  reverseComplement(myDNASeq) 
Exercise 9
Did you notice what happened. The DNA sequences not only got complementary but also reversed. This is how double stranded DNA exists in nature.
Run  alphabetFrequency(myDNASeq) 
Exercise 10
The values in problem 9 gave you the frequency of occurrence of each nucleotide in this DNA. This is an important thing to know when analyzing the DNA.
Finally run translate(myDNASeq) . This would yield the hypothetical protein sequence that myDNASeq would produce. Afterall that is one of the important role of DNA, to code for the protein. What did you get?
In our next exercises we will work little more with Biostrings to analyze the DNA at little more. Happy learning.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
