# Determine optimal cutpoints for numerical variables in survival plots

**http://r-addict.com**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The often demand in the biostatistical research is to group patients depending on explanatory variables that are continuous. In some cases the requirement is to test overall survival of the subjects that suffer on a mutation in specific gene and have high expression (over expression) in other given gene. To visualize differences in the Kaplan-Meier estimates of survival curves between groups, first the discretization of continuous variable is performed. Problems caused by categorization of continuous variables are known and widely spread (Harrel, 2015), but in this case there appear a simplification requirement for the discretization. In this post I present the maxstat(maximally selected rank statistics) statistic to determine the optimal cutpoint for continuous variables, which was provided it in the survminer package by Alboukadel Kassambara kassambara.

In this post I will use data from TCGA study, that are provided in the RTCGA package Star and survminer Star package to determine the optimal cutpoint for continuous variable.

- Data preparation
- maxstat – maximally selected rank statistics
- Fit and visualize Kaplan-Meier estimates of survival curves

# Data preparation

I wrote about TCGA datasets and their preprocessing in my earlier posts: RTCGA factory of R packages – Quick Guide and BioC 2016 Conference Overview and Few Ways of Downloading TCGA Data. If your are not familiar with RTCGA family of data packages, you can visit the RTCGA website. Below I join survival information with `ABCD4|5826`

gene expression for patients suffering from BRCA (breast cancer) and HNSC (head and neck cancer). It can be done due to `bcr_patient_barcode`

column which identifies each patient.

Joining survival times and `ABCD4`

gene’ expression.

13 patients have clinical info but they do not have expression information so I remove them from the analysis.

The complete data used for further analysis is printed below

# maxstat – maximally selected rank statistics

kassambara prepared a functionality that uses the maxstat(maximally selected rank statistics) statistic to determine the optimal cutpoint for continuous variables and provided it in the survminer package. The development process is described on the survminer issues track

Articles in which the maxstat statistics was used:

- http://www.haematologica.org/content/99/9/1410
- http://www.bloodjournal.org/content/bloodjournal/120/5/1087.full.pdf
- http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4058021/pdf/oncotarget-05-2487.pdf
- http://www.impactjournals.com/oncotarget/index.php?journal=oncotarget&page=article&op=view&path[]=3237&path[]=6227

Explanation of the selection process for the cutpoint is described in this vignette, chapter 2

Determining the optimal cutpoint for ABCD4 gene’s expression

Plot the cutpoint for gene ABCD4

Categorize ABCD4 variable

# Fit and visualize Kaplan-Meier estimates of survival curves

Below I divided patients into 4 groups, denoting the membership to cancer type and patient’s ABCD4 gene’s expression level.

RTCGA way

survminer way

**leave a comment**for the author, please follow the link and comment on their blog:

**http://r-addict.com**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.