Extracting upstream regions of a RefSeq human gene list in R using Bioconductor

[This article was first published on Computational Biology Blog in fasta format, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Suppose that you want to do local mapping of upstream regions of a given RefSeq IDs in a particular genome in R using Bioconductor. Download the script here.

In this case, you may take a look at the Bioconductor AnnotationData Packages here: http://www.bioconductor.org/packages/release/data/annotation/

The goal of this post is that for example I have the following RefSeq IDs and want to extract 250 bases upstream of each gene in a single list with another useful information such the entrez.id, the symbol and the gene description.

# RefSeqs IDs:
gene.list.refseq <- c("NM_003588","NM_001145436", "NM_001135188","NM_020760","NM_173362", "NM_198393","NM_022736","NM_025074","NM_033449","NM_015726", "NM_022110","NM_016478","NM_020634","NM_002291","NM_000418", "NM_001862","NM_017752","NM_006591","NM_000124","NM_144610") # How many bases upstream of each gene:     bases.upstream <- 250

Before starting, please download the following packages in R:

       source("http://bioconductor.org/biocLite.R")
       biocLite("BSgenome.Hsapiens.UCSC.hg19")                              
       biocLite("Biostrings")
       biocLite("org.Hs.eg.db")

Load the function using:

source("extract.five.utr.sequence.R")

And finally make the computations:

output.sequences <- extract.five.utr.sequence(gene.list.refseq,bases.upstream)

CODE:

<code style="color: white; word-wrap: normal;"># ******************************************************************************<br />#       FUNCTION: extract.five.utr.sequence | 29/JULY/2012 | BENJAMIN TOVAR<br /># ******************************************************************************<br />extract.five.utr.sequence <- function(gene.list.refseq,<br />                                      number.bases.upstream=1000){<br />    <br />    <br />    # *****************************************************************<br />    # RUNNING NOTES: Please download this packages from Bioconductor<br />    # http://www.bioconductor.org/packages/release/data/annotation/<br />    # ***************************************************************** <br />    #   source("http://bioconductor.org/biocLite.R")<br />    #   biocLite("BSgenome.Hsapiens.UCSC.hg19")                                <br />    #   biocLite("Biostrings")<br />    #   biocLite("org.Hs.eg.db")<br />                                      <br />    # *****************************************************************<br />    # RUNNING EXAMPLE<br />    # *****************************************************************                              <br />    #<br />    ## Extract 250 bases upstream of each gene in gene.list.refseq <br />    #<br />    # gene.list.refseq <- c("NM_003588","NM_001145436", "NM_001135188","NM_020760","NM_173362",   <br />    #                    "NM_198393","NM_022736","NM_025074","NM_033449","NM_015726",   <br />    #                    "NM_022110","NM_016478","NM_020634","NM_002291","NM_000418",   <br />    #                    "NM_001862","NM_017752","NM_006591","NM_000124","NM_144610") <br />    #<br />    # bases.upstream <- 250<br />    #  <br />    # result <- extract.five.utr.sequence(gene.list.refseq,bases.upstream)                             <br />                             <br />    # *****************************************************************<br />    # LOAD THE LIBRARIES<br />    # ***************************************************************** <br />    cat("Loading libraries",date(),"\n")   <br />    # human genome DNA sequences<br />        library(BSgenome.Hsapiens.UCSC.hg19) <br />    # human genome wide annotations <br />        library(org.Hs.eg.db) <br />    # load IDs, symbol and gene descriptions<br />        refseq.id <- toTable(org.Hs.egREFSEQ)<br />        symbol <- toTable(org.Hs.egSYMBOL)<br />        gene.description <- toTable(org.Hs.egGENENAME)<br />    <br />    # *****************************************************************<br />    # LOAD THE UPSTREAM SEQUENCES<br />    # *****************************************************************<br /><br />    if(number.bases.upstream <= 1000){<br />        # load the 1000 bases upstream of all genes in the human genome <br />        # in the BSgenome.Hsapiens.UCSC.hg19 package<br />        upstream <- Hsapiens$upstream1000<br />    }<br />    <br />    if(number.bases.upstream > 1000 && number.bases.upstream <= 2000){<br />        # load the 1000 bases upstream of all genes in the human genome <br />        # in the BSgenome.Hsapiens.UCSC.hg19 package<br />        upstream <- Hsapiens$upstream2000<br />    }<br />    <br />    if(number.bases.upstream > 2000 && number.bases.upstream <= 5000){<br />        # load the 1000 bases upstream of all genes in the human genome <br />        # in the BSgenome.Hsapiens.UCSC.hg19 package<br />        upstream <- Hsapiens$upstream5000<br />    }        <br />    <br />    if(number.bases.upstream > 5000){<br />        cat("ERROR: The number of bases upstream of 5'UTR is too large (max value = 5000)\n")<br />        return(0);<br />    } <br />    <br />    if(number.bases.upstream < 1){<br />        cat("ERROR: The number of bases upstream of 5'UTR cannot be less than 1 nucleotide (min value = 1)\n")<br />        return(0);<br />    }                   <br />        <br />    # extract the names of each gene<br />    names.genes.upstream <- names(upstream)<br /><br />    # extract the refseq of each gene  <br />    refseq.upstream.id <- vector("character",length(names.genes.upstream))<br />    for(i in 1:length(names.genes.upstream)){<br />        refseq.upstream.id[i] <- gsub(", ","_",toString(unlist(strsplit(names.genes.upstream[i],<br />                                      "\\_"))[c language="(1,2)"][/c],sep=""))<br />    }    <br />    names(refseq.upstream.id) <- 1:length(refseq.upstream.id)<br /><br />    # *****************************************************************<br />    # CREATE OUTPUT OBJECT<br />    # *****************************************************************    <br />    output.params <- c("entrez.id","symbol","gene.description","five.UTR.sequence")<br />                        <br />    output.list <- list()<br />    for(i in 1:length(gene.list.refseq)){<br />        output.list[[i]] <- list()<br />        for(j in 1:length(output.params)){<br />            output.list[[i]][[j]] <- list()<br />        }<br />        names(output.list[[i]]) <- output.params<br />    }<br />    names(output.list) <- gene.list.refseq<br /><br />    # *****************************************************************<br />    # LET THE COMPUTER DECIDE<br />    # *****************************************************************     <br />    cat("Extracting data",date(),"\n")<br />    <br />    for(i in 1:length(gene.list.refseq)){<br />        cat(rep(".",1))<br /><br />        # extract the entrez.id of each gene in geneList <br />        output.list[[i]]$entrez.id <- refseq.id[which(refseq.id$accession==gene.list.refseq[i]),]$gene_id<br /><br />        # extract the symbol of each gene in geneList <br />        output.list[[i]]$symbol <- symbol[which(symbol==output.list[[i]]$entrez.id),]$symbol<br />        <br />        # extract the gene description of each gene in geneList <br />        output.list[[i]]$gene.description <- gene.description[which(gene.description==output.list[[i]]$entrez.id),]$gene_name  <br />        <br />        # Return the index of the target genes<br />        index.target.gene <- sequence <- NA;<br />        index.target.gene <- as.numeric(names(refseq.upstream.id[which(refseq.upstream.id==gene.list.refseq[i])]))<br />        <br />        # NOTE: some genes have repeated RefSeq.id which corresponds to the same gene in other Chromosomes.<br />        # by using index.target.gene[1] we are selecting only the first RefSeq entry.<br />        sequence <- upstream[index.target.gene[1]]<br />        sequence.length <- nchar(sequence)<br />        start.position <- ((sequence.length-number.bases.upstream)+1)<br />        sequence <- substring(toString(sequence),1:sequence.length,1:sequence.length)<br />        sequence <- sequence[start.position:sequence.length]<br />        output.list[[i]]$five.UTR.sequence <- sequence<br />    }<br />    cat("\n")<br />    cat("Computations DONE",date(),"\n")<br />    return(output.list)<br />}<br /></code>

Benjamin.

To leave a comment for the author, please follow the link and comment on their blog: Computational Biology Blog in fasta format.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)