(more on) Pattern Matching for Transcription Factor Binding Sites

February 2, 2011
By

(This article was first published on rforcancer, and kindly contributed to R-bloggers)

I have published some initial script scribblings on this task about a week ago. After another week I'm posting some better formed and annotated code. The Biostrings and BSGenomes packages are new to me and I've gone through many many iterations and experimenations to arrive at the as yet incomplete code below. Whilst the packages seem powerful I'm yet to get the hang of the object structures and methods for the various DNAStrings, DNAStringSets, Views etc etc..-- and work out how best to functionally program with them. Indeed my effort to avoid simple loops has probably caused me all sorts of problems....ho hum.

1. The task is take a list of promoters (human upstream 1000 from BSGenomes).

2. Annotate these with accessible HGNC symbols using biomaRt

3. Create a list of DNAStrings which are putative transcription factor binding sites

4. Match the list of patterns against the list of promoters.

5. return the matches in an intelligible form.

Presently point 5 still needs a lot of work... And I need to do more work on different mismatch and insertion (indel) parameters for the match functions. 

## I am firstly downloading upstream sequences using BSGenome and the latest human sequence build ##
library(BSgenome)
library(BSgenome.Hsapiens.UCSC.hg19)
ls('package:BSgenome.Hsapiens.UCSC.hg19')
## this creates subject a so-called DNAStringSet of upstream gene sequences i.e. promoters##
subject=Hsapiens$upstream1000
 
## I use biomaRt here to pull out the annotation data for the refseq DNAs that are extracted from
## the BSGenome upstream sequences- especially the familiar HGNC symbol
 
mart <- useMart('ensembl')
mart<-useDataset("hsapiens_gene_ensembl",mart)
atts=c('chromosome_name', 'start_position', 'strand', 'hgnc_symbol', 'ens_hs_gene', 'refseq_dna')
 
## this gobbledeygook just extracts a name string splits it into a list 
## then gets the first element (the refseq) back and unlists it to vector nm
nm=unlist(lapply(strsplit(names(subject), split='_up'), function(x)x[1]))
gb= getBM(attributes=atts, filters='refseq_dna', values=nm, mart=mart)
 
## NB Lookout! biomaRt neither returns answers (i.e rows) for all the query values (refseqs)
## nor does it return them in the same order as the query (they are alphabetic) 
## -- so you cant just reorder results and use the same index
 
## match loks for a first match this gets all matches OFC
match.all=function(query, set)
	{
	wh=vector()
	for(i in query) 
		wh=c(wh, which(set==i))
	return(wh)	
	}
 
## these are genes of particular interest TO ME- ie. gene symbols
my_genes=c('PROX1', 'FOXC2', 'SOX18', 'ANPEP', 'NR2F2')
 
## so for the right index we go hgnc -> refseq -> position of refseq in BioStringSet
my_refseqs=gb$refseq[match.all(my_genes, gb$hgnc_symbol)]
my_refseqs_index=match.all(my_refseqs, nm)
 
## I make a smaller subest of interesting gene promoters 'subject2'
subject2=subject[my_refseqs_index]
 
 
## the query string and its reverse complement - for now: will experiment later - ##
CMARE = DNAString('TGCTGACGTCAGCA')
CMARE_complement = complement(CMARE)
TRE= DNAString('TGACTCA')
TRE_complement= complement(TRE)
CRE= DNAString('TGACGTCA')
CRE_complement= complement(TRE)
TMARE=DNAString('TGCTGACTCAGCA')
TMARE_complement=complement(TMARE)
 
## this is a Degenerate MARE where IUPAC DNA code N is any base
DMARE=DNAString('TGCNNNNNNNNGCA')
DMARE_complement=complement(DMARE)
 
## I am using the IUPAC DNA code 'W' for A/T there are 3 'W' but this is just an approx guess from the lit.
## hence the name ATMARE
ATMARE=DNAString('WWWTGCTGAC')
ATMARE_complement=complement(ATMARE)
 
## make  big NAMED list of these DNAStrings
DNASTRLIST=list(CMARE=CMARE, CMARE_complement=CMARE_complement, 
TRE=TRE, TRE_complement=TRE_complement,
CRE=CRE, CRE_complement=CRE_complement,
TMARE=TMARE, TMARE_complement=TMARE_complement,
DMARE=DMARE, DMARE_complement=DMARE_complement,
ATMARE=ATMARE, ATMARE_complement=ATMARE_complement)
 
 
 
 
 
## extract the accession name with a bit of messing about ##
nm=names(subject)
nm=strsplit(nm, '_up')
nm=unlist(lapply(nm, function(x)x[1]))
 
 
matching=function(myDNAString, subject, max.mismatch=2)
{
 
 
vmatch=vmatchPattern(myDNAString, subject, max.mismatch=max.mismatch)
wh=which(countIndex(vmatch)>0)
 
for(i in wh)
	{
	if(exists('match_list'))
		{
		match_list=c(match_list, list(Views(subject[[i]], vmatch[[i]])))
		names(match_list)[length(match_list)]=names(subject[i])
		}
	if(!exists('match_list'))
		{
		match_list= list(Views(subject[[i]], vmatch[[i]]))
		names(match_list)[length(match_list)]=names(subject[i])
		}
 
	}
return(match_list)
}
 
 
lapply(DNASTRLIST, function(eachMARE) matching(eachMARE, subject2))

Created by Pretty R at inside-R.org

 

This code gives a list of results that looks something like this (attenuated) :

> test[3]
$TRE
$TRE$CMARE
MIndex object of length 8
 
$TRE$CMARE_complement
MIndex object of length 8
 
$TRE$TRE
MIndex object of length 8
 
$TRE$TRE_complement
MIndex object of length 8
 
$TRE$CRE
MIndex object of length 8
 
$TRE$CRE_complement
MIndex object of length 8
 
$TRE$TMARE
MIndex object of length 8
 
$TRE$TMARE_complement
MIndex object of length 8
 
$TRE$DMARE
MIndex object of length 8
 
$TRE$DMARE_complement
MIndex object of length 8
 
$TRE$ATMARE
MIndex object of length 8
 
$TRE$ATMARE_complement
MIndex object of length 8
 
$TRE$`NM_002763_up_1000_chr1_214160860_f chr1:214160860-214161859`
  Views on a 1000-letter DNAString subject
subject: CGGCGAACTCGATCAGCTGTATCTGGGAAATGAAAAAAGAAAAAGAAAAAAAAAA...GTCCCACTGCTCCCTGCACCGCGTAAGTATCTTCTTCTTCCCCTCGTGAGTCCCT
views:
    start end width
[1]   136 142     7 [TGGCTCG]
[2]   333 339     7 [TGAATCC]
[3]   420 426     7 [TGTCTTA]
[4]   789 795     7 [TGTCGCA]
[5]   797 803     7 [AGACTCA]
[6]   992 998     7 [TGAGTCC]
 
$TRE$`NM_005251_up_1000_chr16_86599857_f chr16:86599857-86600856`
  Views on a 1000-letter DNAString subject
subject: CCTCTAAAACCTCGATAGGTTATCCTTGACGACCCCGAGCCTGGAAACTCCCTGT...GCCAGGAGCCCGGGGCCGCCCCTCCCGCTCCCCTCCTCTCCCCCTCTGGCTCTCT
views:
     start end width
 [1]    59  65     7 [TGATTAA]
 [2]    71  77     7 [TGATTAA]
 [3]   131 137     7 [TGAATCC]
 [4]   184 190     7 [TCACTCA]
 [5]   204 210     7 [TGTCCCA]
 [6]   249 255     7 [TTACTCT]
 [7]   485 491     7 [CGACTCC]
 [8]   510 516     7 [TGGCTCA]
 [9]   622 628     7 [GGGCTCA]
[10]   658 664     7 [TGACCCT]
[11]   992 998     7 [TGGCTCT]
 
$TRE$`NM_018419_up_1000_chr20_62680980_r chr20:62680980-62681979`
  Views on a 1000-letter DNAString subject
subject: CCTTCTCCAGTAGAACCATGCGTGGCAGTGGGAGGGGAGGACCTCCAGAGGGAGA...GCCCTGAGCCGCTATATCTGGGCGCCCGCCCGGCCCGAGGCCACCGCCGTCCCCA
views:
     start end width
 [1]   186 192     7 [GGACCCA]
 [2]   202 208     7 [TGCCTCT]
 [3]   374 380     7 [TGTGTCA]
 [4]   431 437     7 [TGTCCCA]
 [5]   443 449     7 [GCACTCA]
 [6]   469 475     7 [TTACACA]
 [7]   509 515     7 [GGACTCA]
 [8]   567 573     7 [GGACTTA]
 [9]   712 718     7 [CGACTCC]
[10]   849 855     7 [GGCCTCA]
 
and so on and on.....

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: rforcancer.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.