[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Chains of amino acids strung together make up proteins and since each amino acid has a 1-letter abbreviation, we can find words (English and otherwise) in protein sequences. I imagine this pursuit began as soon as proteins were first sequenced, but the first reference to protein word-finding as a sport is, to my knowledge, “Price’s Protein Puzzle”, a letter to Trends in Biochemical Sciences in September 1987 [1].

Price wrote:

It occurred to me that TIBS could organise a competition to find the longest word […] contained within any known protein sequence.

The journal took up the challenge and published the winning entries in February 1988 [2]. The 7-letter winner was RERATED, with two 6-letter runners-up: LEADER and LIVELY. The sub-genre “biological words in protein sequences” was introduced almost one year later [3] with the discovery of ALLELE, then no more was heard until 1993 with Gonnet and Benner’s Nature correspondence “A Word in Your Protein” [4].

Noting that “none of the extensive literature devoted to this problem has taken a truly systematic approach” (it’s in Nature so one must declare superiority), this work is notable for two reasons. First, it discovered two 9-letter words: HIDALGISM and ENSILISTS. Second, it mentions the technique: a Patricia tree data structure, and that the search took 23 minutes.

Comments on this letter noted one protein sequence that ends with END [5] and the discovery of 10-letter, but non-English words ANNIDAVATE, WALLAWALLA and TARIEFKLAS [6].

I last visited this topic at my blog in 2008 and at someone else’s blog in 2015. So why am I here again? Because the Aho-Corasick algorithm in R, that’s why!

Here’s how I approached things this time around.

1. Get protein sequences

The easy part. Head on over to UniProt and grab the “Reviewed (Swiss-Prot)” dataset in fasta format. It’s a gzipped-file and you can leave it that way. We’ll assume any downloaded files are in ~/Downloads.

2. Get English words

The hard part, since words appear be owned by companies who sell dictionaries. Fortunately a kind individual has put some word datasets on Github. I grabbed the “alpha” words, which contains 370 099 entries.

3. Read files, run search and process output

I’ve always liked the R package seqinr for working with sequences. It stores sequences in a list format which will prove very useful, as we’ll see in a moment.

library(seqinr)

as.string = TRUE,
seqtype = "AA")


Any variant of readlines can be used to read the file of English words and we’ll convert to upper case:

library(readr)

toupper()


Now the fun part. Gonnet & Benner referred to a Patricia Tree, which is a special case of a data structure called a radix trie (yes, with an i). I found two implementations of radix trees in R. The first, triebeard, only has prefix search (start of a word) or fuzzy search functions. The second, AhoCorasickTrie, implements the Aho-Corasick string search algorithm and proved to be very useful since it includes a function to search lists which can be used directly on our stored protein sequences.

library(AhoCorasickTrie)

# search sequences for words of 8 or more letters
results <- AhoCorasickSearchList(words[which(nchar(words) > 7)], sp)

# retain only those results with a match
results <- results[which(sapply(results, function(x) length(x[[1]]) > 0))]

# results with no match look like this
# List of 1
#  $sp|Q6GZX4|001R_FRG3G:List of 1 # ..$ : list()

# and those with a match like this (may be > 1)
# List of 1
#  $sp|Q5FTU6|2KGR_GLUOX:List of 1 # ..$ :List of 1
#   .. ..$:List of 2 # .. .. ..$ Keyword: chr "SHARPEST"
@   .. .. ..\$ Offset : num 172


I didn’t benchmark this, but it took around 30-45 minutes on my iMac. That probably compares favourably with Gonnet & Benner, given the increased protein database size since 1993 and the likelihood that they weren’t using a desktop machine.

Some ugly code to transform the list into a dataframe and here’s what you’ve all been waiting for: a CSV file with those words of 8 or more letters found in a protein sequence.

So what do we have? Still nothing longer than nine letters, but we did find ten 9-letter words. This includes two identified previously – HIDALGISM and TARGETEER – but omits ENSILISTS, which was not in our English word list.

.id Keyword Offset sp|Q2TAC2|CCD57_HUMAN SLAVERERS 410 sp|B7ZRM8|EVI1B_XENLA NIDERINGS 861 sp|A9BDD9|PYRG_PROM4 HACIENDAS 315 sp|P17284|VIF_SIVCZ ALKALISER 150 sp|A8AY34|ADDB_STRGC SLYNESSES 820 sp|Q96LP6|CL042_HUMAN TARGETEER 145 sp|Q8L5Z1|GDL17_ARATH TETANILLA 341 sp|Q8TV85|METK_METKA DELLENITE 376 sp|P25202|RPC1_GIAIN CHAPSTICK 163 sp|P03700|VINT_LAMBD HIDALGISM 247

There are 206 words of eight letters. This list includes ENSILIST and a number of fun examples including DISASTER, VILLAINY and the very biological VIVIPARY.

The holy grail of course is the sequence which contains a word with relevance to its function or the organism in which the protein is found. That can be, as they say, an exercise for the reader.

I’m afraid none of the literature cited in this post is open access. But you know how to get around that, right? 😉

[1] Price, N.C. (1987). Price’s Protein Puzzle. Trends Biochem Sci 12:349.

[2] Purton, M. (1988) Price’s Protein Puzzle: winners. Trends Biochem Sci 13:48.

[3] Jimenez, A. (1989). Response to Price’s Protein Puzzle. Trends Biochem Sci 14:14.

[4] Gonnet, G.H. and Benner, S.A. (1993). A word in your protein. Nature 361(6408):121.

[5] Andersson, L.C.G. (1993). Inside information. Nature 362(6416):120.

[6] Jones, D. (1993). More protein talk. Nature 361(6414):694.