# Pattern Matching for Transcription Factor Binding Sites

January 24, 2011
By

(This article was first published on rforcancer, and kindly contributed to R-bloggers)

I'm trying to search for binding sites for the transcription factor MAF (i.e. TFBS for MAF) in the promoter regions of various genes. I initially started out looking at a precomputed database of binding sites MAPPER. However the TFBS models that have been used from TRANSFAC or JASPAR are quite unlike the human binding sites  shown in the recent literature (the sites in the recent literature are palindromic sequences called MAREs).

Unfortunately the palindromic sites shown in the literature do not have alignment or position-weight-matrices (PWM) and they are quite variable. It would appear that these binding sites have not been derived by an unbiased screen i.e. ChIP with an antibody against a genome. Rather they have mutated a known site in a single gene promoter and tested which ones still work. Jeez!

So I cannot create my own alignments or position weight matrices. Instead I will try some cruder pattern matching and test a few mismatch and indel options (possibly with IUPAC coding later). Here is scripting for a first effort:

## I am firstly downloading upstream sequences using BSGenome and the latest human sequence build ##
library(BSgenome)
library(BSgenome.Hsapiens.UCSC.hg19)
ls('package:BSgenome.Hsapiens.UCSC.hg19')
## this creates subject a so-called DNAStringSet of upstream gene sequences i.e. promoters##
subject=Hsapiens$upstream1000 ## the query string and its reverse complement - for now: will experiment later - ## CMARE <- DNAString('TGCTGACGTCAGCA') CMARE_complement <- complement(CMARE) ## extract the accession name with a bit of messing about ## nm=names(subject) nm=strsplit(nm, '_up') nm=unlist(lapply(nm, function(x)x[1])) ## this is matching forward and complement CMARE DNAStrings to the promoters DNAStringSet ## mindexF <- vmatchPattern(CMARE, subject, max.mismatch=1) mindexR <- vmatchPattern(CMARE_complement, subject, max.mismatch=1) ## I'm using 1 mismatch here but will probably adapt later ## ## create indices whF + whR of the matches count_indexF <- countIndex(mindexF) count_indexR <- countIndex(mindexR) total_matches= sum(count_indexF)+ sum(count_indexR) # Total number of matches. whF <- which(count_indexF > 0) whR <- which(count_indexR > 0) ## next we extract the data for each and put it in a sort of list view ## result_listF=list() result_listR=list() for(i in 1: length(whF)) { result_listF[[i]]= Views(subject[[whF[i]]], mindexF[[whF[i]]]) } for(i in 1: length(whR)) { result_listR[[i]]= Views(subject[[whR[i]]], mindexR[[whR[i]]]) } ## and add the original accesion names to the correct list index names(result_listF)=nm[whF] names(result_listR)=nm[whR] result_list=c(result_listF, result_listR) Now... I don't like this script much as it seems a faff to do everything twice for forward and reverse strands... but I was in a hurry an didn't have time to really understand the object classes and combine them properly. I also never like using for loops but in this case I couldn't see any way to make it more functional stylee.. Anyway that gives a result something like this: > result_list[1:2]$NM_181575
Views on a 1000-letter DNAString subject
subject: CAGACCTGGCCTACCCCAGCCCCCACCTTAGTCTGAATCCTCAGCGTTGCGATG...AGAAGTGCACGCCGGGACCAGGATTCCGGGAGGCCGACTCCTCCCTGCCCCAC
views:
start end width
[1]   716 729    14 [TCCTGACGTCAGCA]

\$NM_173826
Views on a 1000-letter DNAString subject
subject: CACTTGGGGGATTTTCTGAGGACTTTGTACTAGTGGCATATGTAACTCCTGTGC...CAGGGCACCGCGGAAGGTTGAATCCACGCCCTGGCTGTGCGCAGGCGCCGTAG
views:
start end width
[1]   929 942    14 [TGCTGACGTCAGCT]

Next I need to use biomart and Ensembl to try and get gene symbols, chromosomes, start positions, strand etc.. for the promoters. There is an R interface package biomaRt for this service. Once again however Ensembl/biomart is having a nervous breakdown-- so I'll probably continue this tomorrow.