RTCGA factory of R packages – Quick Guide

[This article was first published on r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday we have been delivered with the new version of R – R 3.3.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s new version 3.3. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be easily installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas study.

About The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing – http://cancergenome.nih.gov/.

Our team converted selected datasets from this study into few separate packages that are hosted on Bioconductor. These R packages make selected datasets easier to access and manage. Data sets in RTCGA packages are large and cover complex relations between clinical outcomes and genetic background.

To use RTCGA install package with instructions from it’s Bioconductor home page

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")

Check, Download and Read Data

Packages from the RTCGA factory will be useful for at least three audiences: biostatisticians that work with cancer data; researchers that are working on large scale algorithms, for them RTGCA data will be a perfect blasting site; teachers that are presenting data analysis method on real data problems.

library(RTCGA)

TCGA releases various datasets over time for different cohorts, that are determined by cancer types. One can check

  • infoTCGA() – what are cohort codes and counts for each cohort from TCGA project,
  • checkTCGA('Dates') – what are TCGA datasets’ dates of release,
  • checkTCGA('DataSets', cancerType = "BRC") – what are TCGA datasets’ names for current release date and cohort.

With that knowledge we are able to download specific datasets from TCGA study. The following command downloads datasets that have string Merge_Clinical.Level_1 in it’s name for BRCA cohort type (Breast carcinoma) for 2015-11-01 date of release.

downloadTCGA(cancerTypes = "BRCA",
             dataSet = "Merge_Clinical.Level_1",
             destDir = "output_dir",
             date = "2015-11-01")

For specific datasets (8 types) we have prepared readTCGA funciton that reads dataset to the tidy format, using datatable::fread function. For expression datasets we also change columns types to natural numeric values.

readTCGA(path = file.path("output_dir",
                          grep("clinical_clin_format.txt",
                               list.files("output_dir/",
                                          recursive = TRUE),
                               value = TRUE)
                          ),
         dataType = "clinical") -> BRCA.clinical.20151101
dim(BRCA.clinical.20151101)
[1] 1098 1494

Prepared Available Datasets

For the most popular datasets types we have prepared data packages that provides various genetic information for 2015-11-01 date of TCGA release. You can read about those datasets and install them with

?datasetsTCGA
?installTCGA

Those datasets can be converted to Bioconductor format with convertTCGA function. You can check full documentation prepared with staticdocs here – http://rtcga.github.io/RTCGA/staticdocs/.

Manipulate and Visualize Data

For prepared datasets we have provided functions to manipulate and visualize effect of statistical procedures like Principal Component Analysis (based on ggbiplot) or estimates of the Kaplan-Meier survival curves (based on the elegant survminer package). Check few examples below

Survival Curves

library(RTCGA.clinical)
survivalTCGA(BRCA.clinical,
             OV.clinical,
             extract.cols = "admin.disease_code") -> BRCAOV.survInfo
## Kaplan-Meier Survival Curves
kmTCGA(BRCAOV.survInfo,
       explanatory.names = "admin.disease_code",
       pval = TRUE,
       xlim = c(0,2000),
       break.time.by = 500)

plot of chunk unnamed-chunk-7

PCA Biplot

library(dplyr)
## RNASeq expressions
library(RTCGA.rnaseq)
expressionsTCGA(BRCA.rnaseq, OV.rnaseq, HNSC.rnaseq) %>%
   rename(cohort = dataset) %>%  
   filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
   BRCA.OV.HNSC.rnaseq.cancer

pcaTCGA(BRCA.OV.HNSC.rnaseq.cancer,
        group.names = "cohort",
        title = "Genes expressions vs cohort types")

plot of chunk unnamed-chunk-8

For more visualization examples visit RTCGA project website. If you have noticed any bugs or have any reflections please open an issue under project’s repository or post a comment on below Disqus panel.

To leave a comment for the author, please follow the link and comment on their blog: r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)