Site icon R-bloggers

Price’s Protein Puzzle: 2023 update

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

What is the longest coherent word or phrase present in the amino acid sequence of a real protein?

— Dr. Caroline Bartman (@Caroline_Bartma) July 21, 2023
Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

In terms of English word matches: not much. Some new proteins but no new 9-letter words. The Twitter thread, above, contains an interesting reply about an approach using generative AI:

I found PARALEGALS in addition to AGGRAVATES for 10 with the algorithm:
(1) Ask chat-gpt for a list of x-letter words likely to be found in a coding sequence
(2) blastp and check it's not an obvious artifact
(3) check chat-gpt didn't hallucinate a word (TRAVELGAGS?)

— Zach Hensel (@alchemytoday) July 22, 2023
Note that the match is to the NR protein database. I’d like to work with this locally but I believe it’s in the order of 150 GB now, so it would take some work to optimise.

Other languages are somewhat constrained by (1) the quality of the word lists that I could find online (in Github) and (2) for some languages, the presence of characters not found in the English alphabet which reduces the viable word list even further. That said, there a few fun matches. I am not a linguist so I’m relying on Google Translate and other online translators here.

In addition to the previously-noted 10-letter Italian word ANNIDAVATE we have:

All of the languages have 9-letter matches except Swedish (maximum 8 letters, for example STALLARE – stabler). Spanish was a rich source of hits (452 distinct words > 7 letters), although that’s probably due in large part to the large size of the Spanish word list used. Swedish the lowest (26 distinct words > 7 letters), perhaps due to the large number of unusable words with non-amino acid alphabet characters.

There are 9 hits to the start of a protein. Some of these are:

And so ends the update for another year.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version