My Attempt To Reproduce Stanford HIVdb Sequence and Mutation Analysis From Scratch

r on Everyday Is A School Day

1 day ago

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Ever wondered what M184V, K65R actually mean? I learnt it from rebuilding Stanford’s HIV resistance algorithm from scratch to find out. Spoiler: it took tons of code to match their 3-line tool. But the lesson was worth it

Motivations: < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

We’ve all learnt and memorize what those letters and numbers mean when it comes to antiretroviral resistance. Since we’ve been exploring genomics lately, let’s take another look at HIV genotype. Stanford University HIVdb is an amazing resource! I’ve always been confused with all these letters and have difficulty understanding how to even check for genotype resistance because all these numbers and letters are quite confusing and intimidating. Let’s put on our bioconductor hat and revisit this topic and see if we can at least get a better understanding on what these letters and numbers mean. Better yet, use this opportunity to try to reproduce the algorithm that tells us the susceptibility of the ART! Hang tight on this one, it’s going to be a bumpy road!

Objectives < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Find HIV Reference Gene < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

From most of the searching, the reference gene appears to be HXB2. We know that pol gene encodes protease (PR), reverse transcriptase (RT) and integrase (IN). Below is directly from NCBI. source. You can click on it and hover over those regions to see what they are.

Take note that all these genes reside in specific position. So the trick is to use a reference, in our case HXB2, locate these 3 genes (RT, PR, IN), then extract them and make them into a database. Some reference is pretty good at letting you know where these positions are, some don’t! As for our reference, it didn’t really say either! We’ll have to look around reference and see if we can get those position.

library(Biostrings)
library(DECIPHER)
library(tidyverse)

## We downloaded a bunch of HIV genomes, including the reference known as K03455 (aka HXB2)
(all_hiv_genome <- readDNAStringSet("all_hiv_whole_genome.fasta"))

Wow, look at that. HIV genome is only around 9000bp! Whereas bacteria such as Ecoli it was about 4 Mb.

## load reference
hxb2_idx <- grep("K03455", names(all_hiv_genome), ignore.case = TRUE)
hxb2_genome <- all_hiv_genome[hxb2_idx]  

## locate the genes
rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229)
pi_sequence <- subseq(hxb2_genome, start = 2253, end = 2549)
int_sequence <- subseq(hxb2_genome, start = 4230, end = 5093)

# Rename sequences with informative names
names(pi_sequence) <- "HIV_PR"
names(rt_sequence) <- "HIV_RT"
names(int_sequence) <- "HIV_INT"

# Combine all three into one DNAStringSet
pol_regions <- c(pi_sequence, rt_sequence, int_sequence)

# Write to a single FASTA file
writeXStringSet(pol_regions, "hiv_pol_regions.fasta")

# make it into blast database
system("makeblastdb -in hiv_pol_regions.fasta -dbtype nucl -out /path/to/hiv/hiv_pol_db -parse_seqids")

With the above we’re basically setting up a databse for the reference pol gene of PR, RT, and IN. We will then find a genome of interest and use blast to identify where these genes are located on the sample sequence like so.

## use blast to find where these genes positions are
system('blastn -query all_hiv_whole_genome.fasta -db /path/to/hiv/hiv_pol_db -word_size 7 -evalue 0.0000001 -outfmt 6 -out hiv_sample_blast_results.txt')

## read it
colnames <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")

(all_hiv_sample <- read_tsv("hiv_sample_blast_results.txt", col_names = colnames))

Notice sseqid section, we have HIV_RT, HIV_INT, HIV_PR. We just need to filter these genes, extract the genes via their positions qstart and qend. Let’s take a look at U63632.1

sample1 <- all_hiv_sample |>
  filter(str_detect(qseqid, "U63632.1")) |> 
  filter(sseqid == "HIV_RT") |> 
  select(qseqid, qstart, qend)

OK, how do we go from here to knowing what’s mutated?

Find The Resistance < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

All those letters, M184V, K65R etc, the letters and numbers must mean something right? You’re right! The numbers are position, but what about the letters? They don’t really look like ATCG, do they? They’re amino acids! For example, the M on M184 stands for Methionine, the V stands for Valine. So M184V means at position 184, Methionine has mutated to Valine. Similarly, K65R means at position 65, Lysine has mutated to Arginine.

Amino acids and their letters:
A: Alanine, C: Cysteine, D: Aspartic Acid, E: Glutamic Acid, F: Phenylalanine, G: Glycine, H: Histidine, I: Isoleucine, K: Lysine, L: Leucine, M: Methionine, N: Asparagine, P: Proline, Q: Glutamine, R: Arginine, S: Serine, T: Threonine, V: Valine, W: Tryptophan, Y: Tyrosine

his also means that our translated RT reference HIV gene HXB2 at position 184, we should expect the amino acid to be M. Let’s take a look

(rt_t <- translate(rt_sequence))

Wow! We’re used to seeing DNA sequence colors but now take a look at the different amino acid colors! How pretty! Now let’s look at position 184.

subseq(rt_t, 184,184)

Alright! M indeed! Now, let’s go through the workflow of using this information to run through our sample

Workflow < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

TL;DR Extract region -> Align Translation of AA with Reference & Assess Mutation -> Calculate Mutation Score

Identify/Extract Regions < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

sample1 <- all_hiv_sample |>
  filter(str_detect(qseqid, "U63632.1")) |> 
  filter(sseqid == "HIV_RT") |> 
  select(qseqid, qstart, qend)

sample_rt <- subseq(
  all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
  start = sample1$qstart,
  end = sample1$qend
)

We’ve done this process above, but for other genes, we’d need to write a function to change HIV_RT to the others.

Translate < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Just for demonstration purposes:

#### Take note, this entire code chunk is not needed !!! 
seq_length <- sample1$qend - sample1$qstart + 1
adjusted_length <- seq_length - (seq_length %% 3)

sample_rt <- subseq(
  all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
  start = sample1$qstart,
  end = sample1$qstart + adjusted_length - 1
)

(sample_rt_t <- translate(sample_rt, if.fuzzy.codon = "solve"))

The reason we had to adjust length, sometimes, is that the sequence may not be perfectly divisible by 3. Since codons are groups of 3 nucleotides, any extra nucleotides that don’t form a complete codon would lead to an incomplete translation. By adjusting the length to be divisible by 3, we ensure that the translation process can proceed without any issues. During translation process, we could also use solve to handle any fuzzy codons that may arise due to sequencing errors or ambiguities in the nucleotide sequence. This ensures that the translation process can still proceed even if there are some uncertainties in the input sequence.

But in reality, we can align them directly from our DNA sequence with the help of DECIPHER, it will align the translated AA!

Align Translation & Assess Mutation < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

# this automatically translate aligned seq into aligned AA, sweet !!! 
alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet")

# turn alignment into matrix
align_matrix <- as.matrix(alignseq)

# extract alignment on both ref and sample
ref_seq <- align_matrix[1,]
sample_seq <- align_matrix[2,]

# find position where there is mutation
mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-")

# load into dataframe
mutations <- data.frame(
  position = mutation_positions,
  reference = ref_seq[mutation_positions],
  sample = sample_seq[mutation_positions],
  mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions])
) |>
  mutate(position_replace = paste0(position,sample))

mutations$mutation

Lots of mutations. HIV viruses are notorious for SNPs which may or may not be clinically significant. And with just a single nucleotide mutation, as you can see the translated amino acid could be different from reference. OK, since we know how to change for mutations, how do we know if this is the same as Stanford HIVdb? Let’s copy and paste the entire genome of this sample and paste it here

## easy way of copying to system so we can paste on Stanford website (link above)
all_hiv_genome[str_detect(names(all_hiv_genome),"U63632.1")] |> as.character() |> clipr::write_clip()

Not too shabby! The one mutation that is significant is M348I. We were able to capture that, not too shabby! Now, how do we go from here to the inferred susceptibility of NRTIs and NNRTIs? In come the mutation score algorithm!

Calculate Mutation Score < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

This part is very interesting. My initial trial was using their [genotype-phenotype DRMcv model](( https://hivdb.stanford.edu/download/GenoPhenoDatasets/DRMcv.R) but I couldn’t reproduce most of the results. And realized that Stanford HIVdb Sequance Analysis uses another algorithm. And this algorithm, at least for the final product, is more straight forward than trying to use a model to reproduce a prediction, which was what the DRMcv was using glmnet and lasso. There is also extensive documentation. All of the csv files below were obtained from the links provided from the documentation. Let’s code!

< details> < summary>code

load_hivdb <- function(dataset, mutations){
  
if (dataset=="NRTI") {  
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}

if (dataset=="NNRTI") {
  mut_score <- read_csv("hivdb_nnrti_single.csv")
  mut_score_combo <- read_csv("hivdb_nnrti_combo.csv")
  n <- 2
  m <- 2
}  
  
if (dataset=="PI") {
    mut_score <- read_csv("hivdb_pi_single.csv")
    mut_score_combo <- read_csv("hivdb_pi_combo.csv")
    n <- 2
    m <- 3
}  
  
if (dataset=="INSTI") {
  mut_score <- read_csv("hivdb_insti_single.csv")
  mut_score_combo <- read_csv("hivdb_insti_combo.csv")
  n <- 2
  m <- 3
  
}

mut_interest <- c()
for (mut_i in mut_score |> pull(Rule)) { 
    add_mut <- F
    add_mut <-  mut_i %in% (mutations |> pull(mutation))
    if (add_mut) {
      mut_interest <- c(mut_interest, mut_i)
    }
    
}

if (length(mut_interest)==0) {
  return(mut_score |> 
           slice_sample(n = 0) |>  # Get 0 rows but keep structure
           select(-1) |> 
           summarise(across(everything(), ~0)) |>
            pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
            mutate(levels = case_when(
              score <= 9 ~ 1,
              score >= 10 & score <= 14 ~ 2,
              score >= 15 & score <= 29 ~ 3,
              score >= 30 & score <= 59 ~ 4,
              score >= 60 ~ 5
            )) |>
            mutate(interpretation = case_when(
              levels == 1 ~ "susceptible",
              levels == 2 ~ "potential low-level resistance",
              levels == 3 ~ "low-level resistance",
              levels == 4 ~ "intermediate resistance",
              levels == 5 ~ "high-level resistance"
            )))
}

mut_combo_vec <- mut_score_combo |> 
  pull(combination_rule) |>
  paste(collapse = " + ") |>
  str_split(" \\+ ") |> 
  unlist() |>
  unique()

mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))

# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
  # Split the string by " + "
  positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
  
  # Sort mutations by their positions
  sorted_mutations <- mutation_string[order(positions)]
  
  return(sorted_mutations)
}

mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])

# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
  combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}

if (length(mut_combo_seq)==1) { 
  mut_score_sum <- mut_score |>
    filter(Rule %in% mut_interest) |>
    select(-1) |>
    summarize(across(everything(), sum)) |>
    pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
  
  } else {

# Generate combinations of size 2, 3, and 4
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> 
  unlist()

# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
  filter(combination_rule %in% all_combinations) |>
  rename(Rule = combination_rule)

# sum single + combo
mut_score_sum <- mut_score |>
  filter(Rule %in% mut_interest) |>
  bind_rows(mut_score_combo_df) |>
  select(-1) |>
  summarize(across(everything(), sum)) |>
  pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
}

return(mut_score_sum)
}

(nrti <- load_hivdb("NRTI", mutations))
(nnrti <- load_hivdb("NNRTI", mutations))

Wow, long code but we were able to reproduce the same score as Stanford HIVdb! Not too shabby. Take note that if certain mutations exist together, there is another table that shows additional penalty score! For example, if our sequence contain M41L + T215FY mutation, the total mutation score for TDF is 5+10+10=25, it’s not just 5+10. see here for full table.

Now, we just need to repeat this for PR and IN, write it in a pipeline and there you have it! Let’s look at their official web service via sierrapy and then see if we can reproduce with our algorithm!

Stanford Sierrapy < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Let’s explore a much simpler and expert-driven approach. Let’s try Stanford HIVdb’s SierraPy, their python equivalent. This uses their web service, hence will need internet and also need to send sequence as well. Very fast.

# install
pip install sierrapy

# write fasta of sample of interest, let's take a look at KJ849778.1
all_hiv_genome[all_hiv_genome |> names() |> str_detect("KJ849778.1")] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

Alright. 🤔 with 3 lines of code after you have an assembled HIV whole genome, Sierra web service via Sierrapy will provide you the susceptibility interpretation within seconds! The json also provides you with the alignments as well! You do need internet for this thought, if you’re looking for local interpretation then can look into Sierra Local. There are lots of ways to get to this without having the have heavy codes like we did! lol, but it’s great to know how it works under the hood!

Now, since we know what Stanford HIVdb shows, let’s see if our algorithm will return the same result!

< details> < summary>code

extract_align <- function(sample,class) {
  
if (!exists("hxb2_genome")) { stop("you need load/enter hxb2_genome") } 
  
sample1 <- all_hiv_sample |>
  filter(str_detect(qseqid, sample)) |> 
  filter(sseqid == class) |> 
  select(qseqid, qstart, qend)

sample_rt <- subseq(
  all_hiv_genome[str_detect(names(all_hiv_genome), sample1$qseqid)],
  start = sample1$qstart,
  end = sample1$qend
)


## locate the genes
if (class == "HIV_RT") { rt_sequence <- subseq(hxb2_genome, start = 2550, end = 4229) }
if (class == "HIV_PR") { rt_sequence <- subseq(hxb2_genome, start = 2253, end = 2549) }
if (class == "HIV_INT") { rt_sequence <- subseq(hxb2_genome, start = 4230, end = 5093) }

# this automatically translate aligned seq into aligned AA, sweet !!! 
alignseq <- AlignTranslation(c(rt_sequence,sample_rt), type="AAStringSet", verbose = F)

# turn alignment into matrix
align_matrix <- as.matrix(alignseq)

# extract alignment on both ref and sample
ref_seq <- align_matrix[1,]
sample_seq <- align_matrix[2,]

# find position where there is mutation
mutation_positions <- which(ref_seq != sample_seq & ref_seq != "-" & sample_seq != "-")

# load into dataframe
mutations <- data.frame(
  position = mutation_positions,
  reference = ref_seq[mutation_positions],
  sample = sample_seq[mutation_positions],
  mutation = paste0(ref_seq[mutation_positions], mutation_positions, sample_seq[mutation_positions])
) |>
  mutate(position_replace = paste0(position,sample))

return(mutations)
}

load_hivdb <- function(dataset, mutations){
  
if (dataset=="NRTI") {  
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}

if (dataset=="NNRTI") {
  mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE)
  mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE)
  n <- 2
  m <- 2
}  
  
if (dataset=="PI") {
    mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE)
    mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE)
    n <- 2
    m <- 3
}  
  
if (dataset=="INSTI") {
  mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE)
  mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE)
  n <- 2
  m <- 3
  
}

mut_interest <- c()
for (mut_i in mut_score |> pull(Rule)) { 
    add_mut <- F
    add_mut <-  mut_i %in% (mutations |> pull(mutation))
    if (add_mut) {
      mut_interest <- c(mut_interest, mut_i)
    }
    
}

if (length(mut_interest)==0) {
  return(mut_score |> 
           slice_sample(n = 0) |>  # Get 0 rows but keep structure
           select(-1) |> 
           summarise(across(everything(), ~0)) |>
            pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
            mutate(levels = case_when(
              score <= 9 ~ 1,
              score >= 10 & score <= 14 ~ 2,
              score >= 15 & score <= 29 ~ 3,
              score >= 30 & score <= 59 ~ 4,
              score >= 60 ~ 5
            )) |>
            mutate(interpretation = case_when(
              levels == 1 ~ "susceptible",
              levels == 2 ~ "potential low-level resistance",
              levels == 3 ~ "low-level resistance",
              levels == 4 ~ "intermediate resistance",
              levels == 5 ~ "high-level resistance"
            )))
}

mut_combo_vec <- mut_score_combo |> 
  pull(combination_rule) |>
  paste(collapse = " + ") |>
  str_split(" \\+ ") |> 
  unlist() |>
  unique()

mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))

# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
  # Split the string by " + "
  positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
  
  # Sort mutations by their positions
  sorted_mutations <- mutation_string[order(positions)]
  
  return(sorted_mutations)
}

mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])

# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
  combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}

if (length(mut_combo_seq)==1) { 
  mut_score_sum <- mut_score |>
    filter(Rule %in% mut_interest) |>
    select(-1) |>
    summarize(across(everything(), sum)) |>
    pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
  
  } else {

# Generate combinations of size 2, 3, and 4
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> 
  unlist()

# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
  filter(combination_rule %in% all_combinations) |>
  rename(Rule = combination_rule)

# sum single + combo
mut_score_sum <- mut_score |>
  filter(Rule %in% mut_interest) |>
  bind_rows(mut_score_combo_df) |>
  select(-1) |>
  summarize(across(everything(), sum)) |>
  pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
}

return(mut_score_sum)
}

hiv_genotype <- function(sample="KJ849778.1") {
class_group <- c("HIV_RT","HIV_PR","HIV_INT")
for (class in class_group) {
  mutations <- extract_align(sample, class)
  if (class == "HIV_RT") {
    print(load_hivdb("NRTI",mutations))
    print(load_hivdb("NNRTI",mutations))
  }
  if (class == "HIV_PR") {
    print(load_hivdb("PI",mutations))
  }
  if (class == "HIV_INT") {
    print(load_hivdb("INSTI",mutations))
  } 
}
}

WHAT !?!?! My algorithm failed 💔 !!! ❌❌❌ Noooo…!!! Ths iis very odd. Let’s inspect!

Let’s pick RT and inspect.

sample <- "KJ849778.1"
mutations <- extract_align(class = "HIV_RT", sample = sample)
mutations |> pull(mutation)

Let’s look at Stanford’s

Do you see what i’m seeing? Our aligned AA has a lot of X’s and these X’s coincide with Stanford’s. For example M184X on ours is Stanford’s M184MI. Our G190X is their G190EKR. 🤔 And some of the mutations are the same, so it’s not a frame issue. Oh wait !!! All X’s because we cannot assess what exactly the amino acid is, we can’t tell if there is mutation at all, hence we assume it could be any !?! That total makes sense! That means we’ll have to incorporate this into our algorithm! And also, not shown here, if there is missing, algorithm should also choose the max mutation score as penalty. Now let’s implement that!

< details> < summary>code

load_hivdb <- function(dataset, mutations){
  
if (dataset=="NRTI") {  
mut_score <- read_csv("hivdb_nrti_single.csv")
mut_score_combo <- read_csv("hivdb_nrti_combo.csv")
n <- 2
m <- 4
}

if (dataset=="NNRTI") {
  mut_score <- read_csv("hivdb_nnrti_single.csv",show_col_types = FALSE)
  mut_score_combo <- read_csv("hivdb_nnrti_combo.csv",show_col_types = FALSE)
  n <- 2
  m <- 2
}  
  
if (dataset=="PI") {
    mut_score <- read_csv("hivdb_pi_single.csv",show_col_types = FALSE)
    mut_score_combo <- read_csv("hivdb_pi_combo.csv",show_col_types = FALSE)
    n <- 2
    m <- 3
}  
  
if (dataset=="INSTI") {
  mut_score <- read_csv("hivdb_insti_single.csv",show_col_types = FALSE)
  mut_score_combo <- read_csv("hivdb_insti_combo.csv",show_col_types = FALSE)
  n <- 2
  m <- 3
  
}

mut_interest <- c()

# check if there is X
if (sum(mutations |> pull(mutation) |> str_detect("X$")) > 0) {
  mutations_vec <- str_replace(mutations |> pull(mutation), pattern = "X", replacement = "")
} else { 
  mutations_vec <- mutations |> pull(mutation)}

for (mut_i in mut_score |> pull(Rule)) { 
    add_mut <- F
    add_mut <-  str_detect(mut_i, paste(mutations_vec,collapse="|"))
    if (add_mut) {
      mut_interest <- c(mut_interest, mut_i)
    }
    
}

if (length(mut_interest)==0) {
  return(mut_score |> 
           slice_sample(n = 0) |>  # Get 0 rows but keep structure
           select(-1) |> 
           summarise(across(everything(), ~0)) |>
            pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
            mutate(levels = case_when(
              score <= 9 ~ 1,
              score >= 10 & score <= 14 ~ 2,
              score >= 15 & score <= 29 ~ 3,
              score >= 30 & score <= 59 ~ 4,
              score >= 60 ~ 5
            )) |>
            mutate(interpretation = case_when(
              levels == 1 ~ "susceptible",
              levels == 2 ~ "potential low-level resistance",
              levels == 3 ~ "low-level resistance",
              levels == 4 ~ "intermediate resistance",
              levels == 5 ~ "high-level resistance"
            )))
}

mut_combo_vec <- mut_score_combo |> 
  pull(combination_rule) |>
  paste(collapse = " + ") |>
  str_split(" \\+ ") |> 
  unlist() |>
  unique()

mut_combo_idx <- mut_combo_vec %in% (mutations |> pull(mutation))

# Function to sort mutations by position number
sort_mutations <- function(mutation_string) {
  # Split the string by " + "
  positions <- str_extract(mutation_string, "\\d+") |> as.numeric()
  
  # Sort mutations by their positions
  sorted_mutations <- mutation_string[order(positions)]
  
  return(sorted_mutations)
}

mut_combo_seq <- sort_mutations(mut_combo_vec[mut_combo_idx])


# Function to create combinations of a given size
create_combinations <- function(mutations_, size) {
  combn(mutations_, size, FUN = function(x) paste(x, collapse = " + "), simplify = TRUE)
}

if (length(mut_combo_seq)==1|sum(mutations |> pull(mutation) |> str_detect("X$")) > 0) { 
  mut_score_sum <- mut_score |>
    filter(Rule %in% mut_interest) |>
    mutate(Rule_x = str_extract(Rule, "^[A-Z]{1}[0-9]+")) |>
    # distinct(Rule_x, .keep_all = T) |>
    group_by(Rule_x) |>
    summarize(across(everything(),max)) |>
    select(-Rule,-Rule_x) |>
    summarize(across(everything(), sum)) |>
    pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
  
  } else {

# Generate combinations of size 2, 3, and 4
if (m > length(mut_combo_seq)) { m <- length(mut_combo_seq) }
all_combinations <- map(n:m, ~create_combinations(mut_combo_seq, .x)) |> 
  unlist()

# filter from mut_score_combo
mut_score_combo_df <- mut_score_combo |>
  filter(combination_rule %in% all_combinations) |>
  rename(Rule = combination_rule)

# sum single + combo
mut_score_sum <- mut_score |>
  filter(Rule %in% mut_interest) |>
  bind_rows(mut_score_combo_df) |>
  select(-1) |>
  summarize(across(everything(), sum)) |>
  pivot_longer(cols = everything(), names_to = "ART", values_to = "score") |>
    mutate(levels = case_when(
      score <= 9 ~ 1,
      score >= 10 & score <= 14 ~ 2,
      score >= 15 & score <= 29 ~ 3,
      score >= 30 & score <= 59 ~ 4,
      score >= 60 ~ 5
    )) |>
    mutate(interpretation = case_when(
      levels == 1 ~ "susceptible",
      levels == 2 ~ "potential low-level resistance",
      levels == 3 ~ "low-level resistance",
      levels == 4 ~ "intermediate resistance",
      levels == 5 ~ "high-level resistance"
    ))
}

return(mut_score_sum)
}

hiv_genotype()

Not too shabby! Now our NNRTI and PI look exactly the same! But our INSTI is not great! Mainly because there are a quite a few deletions that our alignment and their alignments don’t agree.

Let’s try another genome and see if our algorithm agrees with Stanford’s

Trial 1 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Our code

sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)

Stanford’s

all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

Alright! Not too shabby! That was AY900571.2 accession.

Trial 2 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Let’s try another one.

Our code

< details> < summary>code

sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)

Stanford’s

< details> < summary>code

all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

Awesome! No resistance at all! That was FM877777.1. Let’s give another a go!

Trial 3 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Our code

< details> < summary>code

sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)

Stanford’s

< details> < summary>code

all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

There you go! Our algorithm works! That was EU448295.1. Let’s do another one!

Trial 4 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Our code

< details> < summary>code

sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)

Stanford’s

< details> < summary>code

all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

There you go! Our algorithm works again! That was EU735539.1. Let’s do our last one.

Trial 5 ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Our code

< details> < summary>code

sample <- sample(all_hiv_genome,1) |> names() |> str_split(" ") |> unlist() |> _[1]
hiv_genotype(sample = sample)

Stanford’s

< details> < summary>code

all_hiv_genome[all_hiv_genome |> names() |> str_detect(sample)] |> writeXStringSet("hiv_test.fasta")

# run sierrapy input fasta
system("sierrapy fasta hiv_test.fasta -o hiv_test_output.json")

# read json output
jsonlite::read_json("hiv_test_output.0.json", simplifyVector = T)$drugResistance[[1]][3]$drugScores

That was AF443091.1. And 5 of 5 correct ✅ ! Hurray !!!!

Final Thoughts < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Hands down using SierraPy or Sierra-type app is the way to go, if we want reproducibility. There are options for local apps out there (see below). But learning to reproduce this really helps me understand the labeling of mutation and the way to get to susceptibility scoring better! It was a bumpy road, lots of trials, and lots more error and failure, but it ultimately it was a great learning experience. My respect for all these developers and researchers who have worked on this for years, really goes up! ❤️ It’s not easy at all! Even with the help of LLM!

Opportunities for Improvement < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Need to rewrite the interpretation code into a function so that we don’t have to paste 3 different times or modify 3 different times on the load_hivdb function.
Learn the exact algorithm sierra uses for alignment, especially when it comes to deletion/insertion
Look at sierra-local to see how they use nucamino to align and find mutations/indels
prettify the results with gt
include more ARTs
look into Stanford HIVdb NGS analysis

Lessons Learnt < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

We did use genotype-phenotype DRMcv but found the results not reproducible and also does not match the mutation interpretation results, probably user error.
After multiple attempts to align seq, map to reference, then translate, just found out that DECIPHER::AlignTranslation() does a better job at this! Don’t use AlignSeq for aligning amino acids!
learnt that the position of M184V, K65R etc are RT, PR, IN gene specific!
learnt that Stanford HIVdb assumes mutation of different variants if X. For example, if M184X then it could be M184I or M184V, assume the worst case scenario and assign the max mutation score for that position.

If you like this article:

please feel free to send me a comment or visit my other blogs
please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
if you would like collaborate please feel free to contact me

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Our code

Stanford’s

Our code

Stanford’s

Our code

Stanford’s

Our code

Stanford’s

Our code

Stanford’s

Related