An R function to compute Tumor Mutational Burden (TMB)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tumor mutational (or mutation) burden, TMB, is considered a useful estimation of tumor neoantigenic load (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6336005/) and a biomarker of sensitivity to immune checkpoint inhibitors (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0424-2).
Computing TMB is relatively easy. Together with a student of mine (Stefano Pezzella), I followed the approach described in the work by Chalmers and colleagues (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0424-2) and wrote down a simple R function that does the job.
TMB is defined as the number of coding mutations per megabase of coding region. Luckily enough, one recent study computed the length of the coding regions of the human genome. Only considering the coding portions of exons, such length was estimated to be 25,840,698 base pairs, as shown in Table 2 of the work by Piovesan and colleagues (https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4343-8)
I release here the R function to compute TMB. The function, together with a toy example, is also available on github, on the page I dedicate to the pieces of code I wrote as part of some work I did for the Clingen-fvg project (https://github.com/genomeud/clingen-fvg)
The example file has to include one column named ExonicFunc, in which the function of the variant is annotated. Intergenic variants should be listed as “.” and will be discarded. Be careful, since everything that is not “.” is not discarded, and will enter the computation. If you notice crazily high values of TMB, it is likely that in your input file the intergenic variants are represented with a different string (and you just need to change the string).
# Reference paper for the computation of TMB: #https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-017-0424-2 library(data.table) library(openxlsx) #We use a literature-based proxy for length of the CDS #https://doi.org/10.1186/s13104-019-4343-8 cdslength<-25840698 cdslength<-cdslength/1000000 #convert to Mbases #Files snvtoread<-"Sample.muTect.somatic.snv.annovar.hg19_multianno.xls" vars<-fread(snvtoread, data.table=F) #Important assumption about the input files #1) It has one column named ExonicFunc, in which intergenic mutations are annotated as "." # if this is not the case you will need to change the symbol in the line below exvars<-vars[vars$ExonicFunc!=".",] mut_type<-table(exvars$ExonicFunc) TMB<-sum(mut_type)/cdslength
The final object generated by the function is named TMB and represents the estimate of the Tumor mutational burden.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.