Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Ever wondered what M184V, K65R actually mean? I learnt it from rebuilding Stanford’s HIV resistance algorithm from scratch to find out. Spoiler: it took tons of code to match their 3-line tool. But the lesson was worth it
Motivations: < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
We’ve all learnt and memorize what those letters and numbers mean when it comes to antiretroviral resistance. Since we’ve been exploring genomics lately, let’s take another look at HIV genotype. Stanford University HIVdb is an amazing resource! I’ve always been confused with all these letters and have difficulty understanding how to even check for genotype resistance because all these numbers and letters are quite confusing and intimidating. Let’s put on our bioconductor hat and revisit this topic and see if we can at least get a better understanding on what these letters and numbers mean. Better yet, use this opportunity to try to reproduce the algorithm that tells us the susceptibility of the ART! Hang tight on this one, it’s going to be a bumpy road!
Objectives < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
- Find HIV reference gene
- Find Resistance
- Workflow
- Final Thoughts
- Opportunities for improvement
- Lessons Learnt
Find HIV Reference Gene < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
From most of the searching, the reference gene appears to be HXB2. We know that pol gene encodes protease (PR), reverse transcriptase (RT) and integrase (IN). Below is directly from NCBI. source. You can click on it and hover over those regions to see what they are.
library(Biostrings) library(DECIPHER) library(tidyverse) ## We downloaded a bunch of HIV genomes, including the reference known as K03455 (aka HXB2) (all_hiv_genome <- readDNAStringSet("all_hiv_whole_genome.fasta"))
## load reference hxb2_idx <- grep("K03455", names(all_hiv_genome), ignore.case = TRUE) hxb2_genome <- all_hiv_genome[hxb2_idx] ## locate the genes rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229) pi_sequence <- subseq(hxb2_genome, start = 2253, end = 2549) int_sequence <- subseq(hxb2_genome, start = 4230, end = 5093) # Rename sequences with informative names names(pi_sequence) <- "HIV_PR" names(rt_sequence) <- "HIV_RT" names(int_sequence) <- "HIV_INT" # Combine all three into one DNAStringSet pol_regions <- c(pi_sequence, rt_sequence, int_sequence) # Write to a single FASTA file writeXStringSet(pol_regions, "hiv_pol_regions.fasta") # make it into blast database system("makeblastdb -in hiv_pol_regions.fasta -dbtype nucl -out /path/to/hiv/hiv_pol_db -parse_seqids")
With the above we’re basically setting up a databse for the reference pol gene of PR, RT, and IN. We will then find a genome of interest and use blast to identify where these genes are located on the sample sequence like so.
## use blast to find where these genes positions are system('blastn -query all_hiv_whole_genome.fasta -db /path/to/hiv/hiv_pol_db -word_size 7 -evalue 0.0000001 -outfmt 6 -out hiv_sample_blast_results.txt') ## read it colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") (all_hiv_sample <- read_tsv("hiv_sample_blast_results.txt", col_names = colnames))
Notice sseqid
section, we have HIV_RT
, HIV_INT
, HIV_PR
. We just need to filter these genes, extract the genes via their positions qstart
and qend
. Let’s take a look at U63632.1
sample1 <- all_hiv_sample |> filter(str_detect(qseqid, "U63632.1")) |> filter(sseqid == "HIV_RT") |> select(qseqid, qstart, qend)
OK, how do we go from here to knowing what’s mutated?
Find The Resistance < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
All those letters, M184V
, K65R
etc, the letters and numbers must mean something right? You’re right! The numbers are position, but what about the letters? They don’t really look like ATCG, do they? They’re amino acids! For example, the M
on M184
stands for Methionine, the V
stands for Valine. So M184V
means at position 184, Methionine has mutated to Valine. Similarly, K65R
means at position 65, Lysine has mutated to Arginine.
Amino acids and their letters:
A
: Alanine, C
: Cysteine, D
: Aspartic Acid, E
: Glutamic Acid, F
: Phenylalanine, G
: Glycine, H
: Histidine, I
: Isoleucine, K
: Lysine, L
: Leucine, M
: Methionine, N
: Asparagine, P
: Proline, Q
: Glutamine, R
: Arginine, S
: Serine, T
: Threonine, V
: Valine, W
: Tryptophan, Y
: Tyrosine
his also means that our translated RT reference HIV gene HXB2 at position 184, we should expect the amino acid to be M. Let’s take a look
(rt_t <- translate(rt_sequence))
Wow! We’re used to seeing DNA sequence colors but now take a look at the different amino acid colors! How pretty! Now let’s look at position 184.
subseq(rt_t, 184,184)
Alright! M
indeed! Now, let’s go through the workflow of using this information to run through our sample
Workflow < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
TL;DR Extract region -> Align Translation of AA with Reference & Assess Mutation -> Calculate Mutation Score
Identify/Extract Regions < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
sample1 <- all_hiv_sample |> filter(str_detect(qseqid, "U63632.1")) |> filter(sseqid == "HIV_RT") |> select(qseqid, qstart, qend) sample_rt <- subseq( all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)], start = sample1$qstart, end = sample1$qend )
We’ve done this process above, but for other genes, we’d need to write a function to change HIV_RT
to the others.
Translate < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Just for demonstration purposes:
#### Take note, this entire code chunk is not needed !!! seq_length <- sample1$qend - sample1$qstart + 1 adjusted_length <- seq_length - (seq_length %% 3) sample_rt <- subseq( all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)], start = sample1$qstart, end = sample1$qstart + adjusted_length - 1 ) (sample_rt_t <- translate(sample_rt, if.fuzzy.codon = "solve"))
solve
to handle any fuzzy codons that may arise due to sequencing errors or ambiguities in the nucleotide sequence. This ensures that the translation process can still proceed even if there are some uncertainties in the input sequence.
But in reality, we can align them directly from our DNA sequence with the help of DECIPHER, it will align the translated AA!
Align Translation & Assess Mutation < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
# this automatically translate aligned seq into aligned AA, sweet !!! alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet") # turn alignment into matrix align_matrix <- as.matrix(alignseq) # extract alignment on both ref and sample ref_seq <- align_matrix[1,] sample_seq <- align_matrix[2,] # find position where there is mutation mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-") # load into dataframe mutations <- data.frame( position = mutation_positions, reference = ref_seq[mutation_positions], sample = sample_seq[mutation_positions], mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions]) ) |> mutate(position_replace = paste0(position,sample)) mutations$mutation
Lots of mutations. HIV viruses are notorious for SNPs which may or may not be clinically significant. And with just a single nucleotide mutation, as you can see the translated amino acid could be different from reference. OK, since we know how to change for mutations, how do we know if this is the same as Stanford HIVdb? Let’s copy and paste the entire genome of this sample and paste it here
## easy way of copying to system so we can paste on Stanford website (link above) all_hiv_genome[str_detect(names(all_hiv_genome),"U63632.1")] |> as.character() |> clipr::write_clip()
Not too shabby! The one mutation that is significant is M348I
. We were able to capture that, not too shabby! Now, how do we go from here to the inferred susceptibility of NRTIs and NNRTIs? In come the mutation score algorithm!
Calculate Mutation Score < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
This part is very interesting. My initial trial was using their [genotype-phenotype DRMcv model]((
https://hivdb.stanford.edu/download/GenoPhenoDatasets/DRMcv.R) but I couldn’t reproduce most of the results. And realized that
Stanford HIVdb Sequance Analysis uses
another algorithm. And this algorithm, at least for the final product, is more straight forward than trying to use a model to reproduce a prediction, which was what the DRMcv was using glmnet
and lasso
. There is also
extensive documentation. All of the csv
files below were obtained from the links provided from the documentation. Let’s code!
load_hivdb <- function(dataset, mutations){ if (dataset=="NRTI") { mut_score <- read_csv("hivdb_nrti_single.csv") mut_score_combo <- read_csv("hivdb_nrti_combo.csv") n <- 2 m <- 4 } if (dataset=="NNRTI") { mut_score <- read_csv("hivdb_nnrti_single.csv") mut_score_combo <- read_csv("hivdb_nnrti_combo.csv") n <- 2 m <- 2 } if (dataset=="PI") { mut_score <- read_csv("hivdb_pi_single.csv") mut_score_combo <- read_csv("hivdb_pi_combo.csv") n <- 2 m <- 3 } if (dataset=="INSTI") { mut_score <- read_csv("hivdb_insti_single.csv") mut_score_combo <- read_csv("hivdb_insti_combo.csv") n <- 2 m <- 3 } mut_interest <- c() for (mut_i in mut_score |> pull(Rule)) { add_mut <- F add_mut <- mut_i %in% (mutations |> pull(mutation)) if (add_mut) { mut_interest <- c(mut_interest, mut_i) } } if (length(mut_interest)==0) { return(mut_score |> slice_sample(n = 0) |> # Get 0 rows but keep structure select(-1) |> summarise(across(everything(), ~0)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" ))) } mut_combo_vec <- mut_score_combo |> pull(combination_rule) |> paste(collapse = " + ") |> str_split(" \\+ ") |> unlist() |> unique() mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation)) # Function to sort mutations by position number sort_mutations <- function(mutation_string) { # Split the string by " + " positions <- str_extract(mutation_string, "\\d+") |> as.numeric() # Sort mutations by their positions sorted_mutations <- mutation_string[order(positions)] return(sorted_mutations) } mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx]) # Function to create combinations of a given size create_combinations <- function(mutations_, size) { combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE) } if (length(mut_combo_seq)==1) { mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> select(-1) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } else { # Generate combinations of size 2, 3, and 4 all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> unlist() # filter from mut_score_combo mut_score_combo_df <- mut_score_combo |> filter(combination_rule %in% all_combinations) |> rename(Rule = combination_rule) # sum single + combo mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> bind_rows(mut_score_combo_df) |> select(-1) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } return(mut_score_sum) } (nrti <- load_hivdb("NRTI", mutations)) (nnrti <- load_hivdb("NNRTI", mutations))
Wow, long code but we were able to reproduce the same score as Stanford HIVdb! Not too shabby. Take note that if certain mutations exist together, there is another table that shows additional penalty score! For example, if our sequence contain M41L + T215FY
mutation, the total mutation score for TDF is 5+10+10=25
, it’s not just 5+10
.
see here for full table.
Now, we just need to repeat this for PR and IN, write it in a pipeline and there you have it! Let’s look at their official web service via sierrapy
and then see if we can reproduce with our algorithm!
Stanford Sierrapy < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Let’s explore a much simpler and expert-driven approach. Let’s try Stanford HIVdb’s SierraPy, their python equivalent. This uses their web service, hence will need internet and also need to send sequence as well. Very fast.
# install pip install sierrapy # write fasta of sample of interest, let's take a look at KJ849778.1 all_hiv_genome[all_hiv_genome |> names() |> str_detect("KJ849778.1")] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
Alright. 🤔 with 3 lines of code after you have an assembled HIV whole genome, Sierra web service via
Sierrapy
will provide you the susceptibility interpretation within seconds! The json also provides you with the alignments as well! You do need internet for this thought, if you’re looking for local interpretation then can look into
Sierra Local
. There are lots of ways to get to this without having the have heavy codes like we did! lol, but it’s great to know how it works under the hood!
Now, since we know what Stanford HIVdb shows, let’s see if our algorithm will return the same result!
< details> < summary>codeextract_align <- function(sample,class) { if (!exists("hxb2_genome")) { stop("you need load/enter hxb2_genome") } sample1 <- all_hiv_sample |> filter(str_detect(qseqid, sample)) |> filter(sseqid == class) |> select(qseqid, qstart, qend) sample_rt <- subseq( all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)], start = sample1$qstart, end = sample1$qend ) ## locate the genes if (class == "HIV_RT") { rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229) } if (class == "HIV_PR") { rt_sequence <- subseq(hxb2_genome, start = 2253, end = 2549) } if (class == "HIV_INT") { rt_sequence <- subseq(hxb2_genome, start = 4230, end = 5093) } # this automatically translate aligned seq into aligned AA, sweet !!! alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet", verbose = F) # turn alignment into matrix align_matrix <- as.matrix(alignseq) # extract alignment on both ref and sample ref_seq <- align_matrix[1,] sample_seq <- align_matrix[2,] # find position where there is mutation mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-") # load into dataframe mutations <- data.frame( position = mutation_positions, reference = ref_seq[mutation_positions], sample = sample_seq[mutation_positions], mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions]) ) |> mutate(position_replace = paste0(position,sample)) return(mutations) } load_hivdb <- function(dataset, mutations){ if (dataset=="NRTI") { mut_score <- read_csv("hivdb_nrti_single.csv") mut_score_combo <- read_csv("hivdb_nrti_combo.csv") n <- 2 m <- 4 } if (dataset=="NNRTI") { mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE) n <- 2 m <- 2 } if (dataset=="PI") { mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE) n <- 2 m <- 3 } if (dataset=="INSTI") { mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE) n <- 2 m <- 3 } mut_interest <- c() for (mut_i in mut_score |> pull(Rule)) { add_mut <- F add_mut <- mut_i %in% (mutations |> pull(mutation)) if (add_mut) { mut_interest <- c(mut_interest, mut_i) } } if (length(mut_interest)==0) { return(mut_score |> slice_sample(n = 0) |> # Get 0 rows but keep structure select(-1) |> summarise(across(everything(), ~0)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" ))) } mut_combo_vec <- mut_score_combo |> pull(combination_rule) |> paste(collapse = " + ") |> str_split(" \\+ ") |> unlist() |> unique() mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation)) # Function to sort mutations by position number sort_mutations <- function(mutation_string) { # Split the string by " + " positions <- str_extract(mutation_string, "\\d+") |> as.numeric() # Sort mutations by their positions sorted_mutations <- mutation_string[order(positions)] return(sorted_mutations) } mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx]) # Function to create combinations of a given size create_combinations <- function(mutations_, size) { combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE) } if (length(mut_combo_seq)==1) { mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> select(-1) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } else { # Generate combinations of size 2, 3, and 4 all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> unlist() # filter from mut_score_combo mut_score_combo_df <- mut_score_combo |> filter(combination_rule %in% all_combinations) |> rename(Rule = combination_rule) # sum single + combo mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> bind_rows(mut_score_combo_df) |> select(-1) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } return(mut_score_sum) } hiv_genotype <- function(sample="KJ849778.1") { class_group <- c("HIV_RT","HIV_PR","HIV_INT") for (class in class_group) { mutations <- extract_align(sample, class) if (class == "HIV_RT") { print(load_hivdb("NRTI",mutations)) print(load_hivdb("NNRTI",mutations)) } if (class == "HIV_PR") { print(load_hivdb("PI",mutations)) } if (class == "HIV_INT") { print(load_hivdb("INSTI",mutations)) } } }
WHAT !?!?! My algorithm failed 💔 !!! ❌❌❌ Noooo…!!! Ths iis very odd. Let’s inspect!
Let’s pick RT and inspect.
sample <- "KJ849778.1" mutations <- extract_align(class = "HIV_RT", sample = sample) mutations |> pull(mutation)
Let’s look at Stanford’s
X
’s and these X’s coincide with Stanford’s. For example M184X
on ours is Stanford’s M184MI
. Our G190X
is their G190EKR
. 🤔 And some of the mutations are the same, so it’s not a frame issue. Oh wait !!! All X
’s because we cannot assess what exactly the amino acid is, we can’t tell if there is mutation at all, hence we assume it could be any !?! That total makes sense! That means we’ll have to incorporate this into our algorithm! And also, not shown here, if there is missing, algorithm should also choose the max mutation score as penalty. Now let’s implement that!
load_hivdb <- function(dataset, mutations){ if (dataset=="NRTI") { mut_score <- read_csv("hivdb_nrti_single.csv") mut_score_combo <- read_csv("hivdb_nrti_combo.csv") n <- 2 m <- 4 } if (dataset=="NNRTI") { mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE) n <- 2 m <- 2 } if (dataset=="PI") { mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE) n <- 2 m <- 3 } if (dataset=="INSTI") { mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE) mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE) n <- 2 m <- 3 } mut_interest <- c() # check if there is X if (sum(mutations |> pull(mutation) |> str_detect("X$")) > 0) { mutations_vec <- str_replace(mutations |> pull(mutation), pattern = "X", replacement = "") } else { mutations_vec <- mutations |> pull(mutation)} for (mut_i in mut_score |> pull(Rule)) { add_mut <- F add_mut <- str_detect(mut_i, paste(mutations_vec,collapse="|")) if (add_mut) { mut_interest <- c(mut_interest, mut_i) } } if (length(mut_interest)==0) { return(mut_score |> slice_sample(n = 0) |> # Get 0 rows but keep structure select(-1) |> summarise(across(everything(), ~0)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" ))) } mut_combo_vec <- mut_score_combo |> pull(combination_rule) |> paste(collapse = " + ") |> str_split(" \\+ ") |> unlist() |> unique() mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation)) # Function to sort mutations by position number sort_mutations <- function(mutation_string) { # Split the string by " + " positions <- str_extract(mutation_string, "\\d+") |> as.numeric() # Sort mutations by their positions sorted_mutations <- mutation_string[order(positions)] return(sorted_mutations) } mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx]) # Function to create combinations of a given size create_combinations <- function(mutations_, size) { combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE) } if (length(mut_combo_seq)==1|sum(mutations |> pull(mutation) |> str_detect("X$")) > 0) { mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> mutate(Rule_x = str_extract(Rule, "^[A-Z]{1}[0-9]+")) |> # distinct(Rule_x, .keep_all = T) |> group_by(Rule_x) |> summarize(across(everything(),max)) |> select(-Rule,-Rule_x) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } else { # Generate combinations of size 2, 3, and 4 if (m > length(mut_combo_seq)) { m <- length(mut_combo_seq) } all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> unlist() # filter from mut_score_combo mut_score_combo_df <- mut_score_combo |> filter(combination_rule %in% all_combinations) |> rename(Rule = combination_rule) # sum single + combo mut_score_sum <- mut_score |> filter(Rule %in% mut_interest) |> bind_rows(mut_score_combo_df) |> select(-1) |> summarize(across(everything(), sum)) |> pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |> mutate(levels = case_when( score <= 9 ~ 1, score >= 10 & score <= 14 ~ 2, score >= 15 & score <= 29 ~ 3, score >= 30 & score <= 59 ~ 4, score >= 60 ~ 5 )) |> mutate(interpretation = case_when( levels == 1 ~ "susceptible", levels == 2 ~ "potential low-level resistance", levels == 3 ~ "low-level resistance", levels == 4 ~ "intermediate resistance", levels == 5 ~ "high-level resistance" )) } return(mut_score_sum) } hiv_genotype()
Not too shabby! Now our NNRTI and PI look exactly the same! But our INSTI is not great! Mainly because there are a quite a few deletions that our alignment and their alignments don’t agree.
Let’s try another genome and see if our algorithm agrees with Stanford’s
Trial 1 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Our code
sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1] hiv_genotype(sample = sample)
Stanford’s
all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
AY900571.2
accession.
Trial 2 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Let’s try another one.
Our code
< details> < summary>codesample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1] hiv_genotype(sample = sample)
Stanford’s
< details> < summary>codeall_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
Awesome! No resistance at all! That was FM877777.1
. Let’s give another a go!
Trial 3 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Our code
< details> < summary>codesample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1] hiv_genotype(sample = sample)
Stanford’s
< details> < summary>codeall_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
There you go! Our algorithm works! That was EU448295.1
. Let’s do another one!
Trial 4 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Our code
< details> < summary>codesample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1] hiv_genotype(sample = sample)
Stanford’s
< details> < summary>codeall_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
There you go! Our algorithm works again! That was EU735539.1
. Let’s do our last one.
Trial 5 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Our code
< details> < summary>codesample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1] hiv_genotype(sample = sample)
Stanford’s
< details> < summary>codeall_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta") # run sierrapy input fasta system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json") # read json output jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores
That was AF443091.1
. And 5 of 5 correct ✅ ! Hurray !!!!
Final Thoughts < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Hands down using SierraPy
or Sierra
-type app is the way to go, if we want reproducibility. There are options for local apps out there (see below). But learning to reproduce this really helps me understand the labeling of mutation and the way to get to susceptibility scoring better! It was a bumpy road, lots of trials, and lots more error and failure, but it ultimately it was a great learning experience. My respect for all these developers and researchers who have worked on this for years, really goes up! ❤️ It’s not easy at all! Even with the help of LLM!
Opportunities for Improvement < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
- Need to rewrite the interpretation code into a function so that we don’t have to paste 3 different times or modify 3 different times on the
load_hivdb
function. - Learn the exact algorithm sierra uses for alignment, especially when it comes to deletion/insertion
- Look at
sierra-local
to see how they usenucamino
to align and find mutations/indels - prettify the results with
gt
- include more ARTs
- look into Stanford HIVdb NGS analysis
Lessons Learnt < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
- We did use genotype-phenotype DRMcv but found the results not reproducible and also does not match the mutation interpretation results, probably user error.
- After multiple attempts to align seq, map to reference, then translate, just found out that
DECIPHER::AlignTranslation()
does a better job at this! Don’t useAlignSeq
for aligning amino acids! - learnt that the position of M184V, K65R etc are RT, PR, IN gene specific!
- learnt that Stanford HIVdb assumes mutation of different variants if
X
. For example, ifM184X
then it could beM184I
orM184V
, assume the worst case scenario and assign the max mutation score for that position.
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.