Price’s Protein Puzzle: 2023 update

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the joys (?) of having been online for…quite some time now…is watching topics reappear every few years or so.

Yes, it’s Price’s Protein Puzzle which I last wrote about back in 2019. The good news is that my code still runs, so I’ve updated the results of an English word search versus the UniProt Reviewed (Swiss-Prot) protein database. Just for fun I threw in a few other languages too.

So what’s new?

In terms of English word matches: not much. Some new proteins but no new 9-letter words. The Twitter thread, above, contains an interesting reply about an approach using generative AI:

Note that the match is to the NR protein database. I’d like to work with this locally but I believe it’s in the order of 150 GB now, so it would take some work to optimise.

Other languages are somewhat constrained by (1) the quality of the word lists that I could find online (in Github) and (2) for some languages, the presence of characters not found in the English alphabet which reduces the viable word list even further. That said, there a few fun matches. I am not a linguist so I’m relying on Google Translate and other online translators here.

In addition to the previously-noted 10-letter Italian word ANNIDAVATE we have:

  • sp|B2II34|KATG_BEII9 – GANGARILLA (Spanish) – a company of strolling players
  • sp|P40069|IMB4_YEAST – FERRAILLAI (French) – je ferraillai (I scrapped)

All of the languages have 9-letter matches except Swedish (maximum 8 letters, for example STALLARE – stabler). Spanish was a rich source of hits (452 distinct words > 7 letters), although that’s probably due in large part to the large size of the Spanish word list used. Swedish the lowest (26 distinct words > 7 letters), perhaps due to the large number of unusable words with non-amino acid alphabet characters.

There are 9 hits to the start of a protein. Some of these are:

  • sp|O49997|1433E_TOBAC – MAESTREEN (Spanish) – you direct
  • sp|Q3V0Q6|SPAG8_MOUSE – METTESTE (Italian) – you put
  • sp|Q49135|FCHA_METEA – MAGNETIET (Dutch) – magnetite

And so ends the update for another year.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)