Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Survival Analysis

Survival analysis deals with estimating probability of continuation of a particular status-quo at given point in time. Naturally, it also estimates the probability of discontinuation of the status quo i.e. occurance of an event or a hazard. It finds application in several fields. For e.g. in medicine to estimate probability of survival of a patient under treatment of a fatal disease. Or in engineering to estimate reliability or predicting failure. Even in marketing and sales, for e.g. to decrease churn rate.

## Cricket

In this piece, application of similar method in the game of cricket has been illustrated. Survival in this case means ‘not getting out or dismissed’ while batting. Naturally hazard means getting out. ODI Batting statistics of two Indian batsmen have been used. They are Sachin Tendulkar and Sourav Ganguly.
Data has been sourced from ESPN.

```#data frames named ganguly and tendulkar are loaded
#function to manipulate data
mani<-function(x){
x<-x%>%
filter(Pos != "-")%>%
mutate(diss=ifelse((Dismissal=="not out"),0,1))%>% #This is the event of interest
mutate(Runs=str_remove(Runs,"[*]"))%>%
mutate(BF=as.integer(BF), Runs=as.integer(Runs),
Mins=as.integer(Mins),Fours=as.integer(Fours),Sixes=as.integer(Sixes),
SR=as.numeric(SR),Pos=as.integer(Pos),
Dismissal=as.factor(Dismissal), Inns=as.factor(Inns),
Date=dmy(Date))%>%
separate(Opposition,c(NA,"Opposition"),extra = "merge", fill = "left")%>%
separate(Ground,c(NA,"Ground"),extra = "drop", fill = "left")%>%
mutate(Runs=ifelse(is.na(Runs),round(BF*SR,0),Runs))%>% #BF or Balls Faced will be used as unit of time
mutate(h_cen=ifelse((Runs>=50 & Runs<100),1,0))%>%
mutate(cen=ifelse(Runs>=100,1,0))
#Adding dummy columns for possible future requirement
#x<-dummy_cols(x,select_columns = c("Opposition","Ground"))
return(x)
}
#Adding player names in their data sets
ganguly\$player<-c(rep("ganguly",nrow(ganguly)))
tendulkar\$player<-c(rep("tendulkar",nrow(tendulkar)))
#Combining data of the three players
data.raw<-rbind(ganguly,tendulkar)
#structuring the data using the function created
data.df<-mani(data.raw)
## Runs Mins BF Fours Sixes SR Pos Dismissal Inns X Opposition Ground
## 1 3 33 13 0 0 23.07 6 lbw 1 NA West Indies Brisbane
## 2 46 117 83 3 0 55.42 3 stumped 1 NA England Manchester
## 3 16 NA 41 3 0 39.02 3 caught 1 NA Sri Lanka RPS
## 4 36 59 52 3 1 69.23 3 caught 2 NA Zimbabwe SSC
## 5 59 102 75 7 0 78.66 7 lbw 1 NA Australia SSC
## 6 11 NA 8 1 0 137.50 8 not out 1 NA Pakistan Toronto
## Date player diss h_cen cen
## 1 1992-01-11 ganguly 1 0 0
## 2 1996-05-26 ganguly 1 0 0
## 3 1996-08-28 ganguly 1 0 0
## 4 1996-09-01 ganguly 1 0 0
## 5 1996-09-06 ganguly 1 1 0
## 6 1996-09-17 ganguly 0 0 0```

## Estimating probability of staying ‘not out’

Kaplan Meier Analysis is a non-parametric analysis. This means that only the event and the time is used in the analysis i.e. for e.g. possible effect of Opposition will not be considered. Unless, of course the data is grouped based on a parameter. Just like we will group the data based on players so that we observe the survival statistics for the individual players. We can further group the data based on Opposition or Ground.

```#model fitting
kap.mr.fit<-survfit(Surv(BF,diss) ~ player, data=data.df)
#summary of the fitted model
summary(kap.mr.fit)\$table
## records n.max n.start events *rmean *se(rmean) median
## player=ganguly 300 300 300 279 53.31482 2.526322 41
## player=tendulkar 318 318 318 278 52.39229 2.131667 44
## 0.95LCL 0.95UCL
## player=ganguly 32 45
## player=tendulkar 38 56
#detailed summary can viewed by running summary(kap.mr.fit)```

The results do not show any significant difference in the survival tendencies of the two batsmen, yet. Plotting the data makes this clear.

```ggsurvplot(
kap.mr.fit,
pval = TRUE, # show p-value
break.time.by = 25, #break X axis by 25 balls
#risk.table = "abs_pct", # absolute number and percentage at risk
#risk.table.y.text = FALSE,# show bars instead of names in text annotations
linetype = "strata",
# Change line type by groups
conf.int = TRUE,
# show confidence intervals for
#conf.int.style = "step", # customize style of confidence intervals
surv.median.line = "hv",
# Specify median survival
ggtheme = theme_bw(),
# Change ggplot2 theme
legend.labs =
c("Ganguly", "Tendulkar"),
# change legend labels
ncensor.plot = TRUE,
# plot the number of censored subjects (outs) at time t
#palette = c("#000000", "#2E9FDF","#FF0000")
)+
labs(x="Balls")``` Survival probabilities of both are almost similar. The survival probability of Ganguly at the beginning of the innings is slightly lower than that of Tendulkar. However, after around 75 balls, it increases and exceeds than that of Teldulkar. But the difference is insignificant.

# Estimating probability of Sixers

Similar approach was used to estimate probability of at least one sixer. The event, or hazard (for the fielding team) is the batsman hitting a sixer. The hazard plot reveals interesting insight.

```#creating new data set with result of sixes
data.new<-data.df%>%
mutate(six=ifelse(Sixes>0,1,0))
#fitting the model
kap.six.fit<-survfit(Surv(BF,six) ~ player, data=data.new)
#summary(kap.six.fit)\$table
ggsurvplot(
kap.six.fit,
pval = TRUE,
break.time.by = 25,
linetype = "strata",
conf.int = TRUE,
surv.median.line = "hv",
ggtheme = theme_bw(),
legend.labs =
c("Ganguly", "Tendulkar"),
ncensor.plot = TRUE,
fun = "event"
)+
labs(x="Balls", y="Sixer")``` The result clearly shows why Ganguly was a ‘Hazard’ for the fielding team after he had set in. The probability of him hitting at least one sixer is substantially higher than that of Tendulkar, esp. after 40 balls or so. The narrow strip of confidence interval indicates higher accuracy in the estimate, in case of Ganguly. The confidence interval spreads wide in case of Tendulkar, indicating higher uncertainty in estimates.

Similar analysis was conducted using Cox Proportional Hazard model. This model is capable to consider effect of other parameters. ‘Runs’ was used as another parameter in the new model and the hazard plot was plotted again.

```cox.fit <- survfit(coxph(Surv(BF, six) ~ Runs + strata(player),
data = data.new))
autoplot(
cox.fit,
fun = 'event',
censor = FALSE,
conf.int = TRUE,
surv.linetype = 'strata'
)+ labs(x = "Balls", y = "Sixer")``` The results are similar. The probability of sixer by Ganguly is substantially higher than that of Tendulkar esp. from around 25 balls to around 120 balls. The confidence interval widens in the right end of the curve, indicating immense uncertainty.

# Final Analyses

The final stage of analysis includes a subset of the data previously used. Specifically, a few ‘Opposition’ was selected with whom the number of matches were substantially higher than the rest.

```#Defining the opposition teams
teams<-c("Pakistan","Sri Lanka", "Australia", "West Indies", "South Africa", "New Zealand",
#Creating new data set with by filtering the teams
data.new<-data.new%>%filter(Opposition %in% teams)
# Fitting Cox models
#Sixes
cox.six<-survfit(coxph(Surv(BF,six) ~ Runs+strata(player)+factor(Opposition),
data = data.new))
sixerplot<-autoplot(cox.six,conf.int = TRUE,fun='event',pval=TRUE)+labs(x="Balls",y="Sixer Chance")+theme(legend.position = "top")
#Dismissals
cox.diss<-survfit(coxph(Surv(BF,diss) ~ Runs+strata(player)+factor(Opposition),
data = data.new))
dissplot<-autoplot(cox.diss,conf.int = TRUE,pval=TRUE)+labs(x="Balls",y="Survival Chance")+theme(legend.position = "top")
#Half centuries
cox.hcen<-survfit(coxph(Surv(BF,Runs>=50) ~ strata(player)+factor(Opposition),
data = data.new))
hcenplot<-autoplot(cox.hcen,conf.int = TRUE, fun='event',
pval=TRUE,plotTable = TRUE, divideTime = 25,)+labs(x="Balls",y="Chance of half century")+xlim(50,100)+theme(legend.position = "top")
#Result of Dismissal
cox.diss
## Call: survfit(formula = coxph(Surv(BF, diss) ~ Runs + strata(player) +
## factor(Opposition), data = data.new))
##
## n events median 0.95LCL 0.95UCL
## ganguly 292 273 45 44 51
## tendulkar 313 276 57 53 60
#Result of Sixes
cox.six
## Call: survfit(formula = coxph(Surv(BF, six) ~ Runs + strata(player) +
## factor(Opposition), data = data.new))
##
## n events median 0.95LCL 0.95UCL
## ganguly 292 92 75 70 83
## tendulkar 313 32 95 87 NA
#Result of Half Century
cox.hcen
## Call: survfit(formula = coxph(Surv(BF, Runs >= 50) ~ strata(player) +
## factor(Opposition), data = data.new))
##
## n events median 0.95LCL 0.95UCL
## ganguly 292 89 103 96 111
## tendulkar 313 93 89 85 99```

# Results:

#### Tendulkar will remain longer on crease than Ganguly

1. Ganguly will be dismissed by the time he faces 44 to 51 balls, about 50% of the time. This should be true 95% of the time.
2. Tendulkar will be dismissed by the time he faces 53 to 60 balls, about 50% of the time. This should be true 95% of the time.

Conclusion: Chances are higher that Tendulkar will remain on crease longer than Ganguly.

#### Ganguly will sixes earlier than Tendulkar

1. Ganguly will hit at least one six by the time he faces 70 to 83 balls, 50% of the time. This should be true 95% of the time.
2. Tendulkar will hit at least one six by the has faces 87 balls. This should be true 95% of the time.

Conclusion: Chances are higher that on average, Ganguly will hit six much before Tendulkar does, in an innings.

#### Ganguly will score slower till a point, than Tendulkar

1. If Ganguly faces 96 to 111 balls, 50% of the time he will score a half century. This should be true 95% of the time.
2. If Tendulkar faces 85 to 99 balls, 50% of the time he will score a half century. This should be true 95% of the time.

Conclusion: Chances are higher that Ganguly will score slower on average that Tendulkar, at least till reaching half century.

The plots of the result are as follows.

```#Plotting and arranging
ggarrange(dissplot,sixerplot,hcenplot,ncol=3)``` ## Further possibilities

After building any model, the model should be tested on a test data to check for accuracy. Validation of the model is required to understand the robustness of the model. If the results are not satisfactory, changes are made to the model to improve accuracy. The process is repeated a few times on requirement.

The predictions are based on certain parameters. There will be more parameters on which the outcome will depend. For e.g. the above analysis considers opposition team as a parameter. However, the ground and weather will also impact the outcome, which has not been considered. There is always a possibility to add more parameter to improve results.

The effect of parameters may also change over time. For e.g. the opposition team may change members in their team, say, add a fast bowler. Also, the batsmen may evolve themselves to tackle a difficult opponent. Then the effect of the opponent on the outcome will not remain same throughout time. Further complex methods like accelerated failure models can be applied to improve the model.

These are beyond the scope of this article. Feel free to try them out and don’t forget to share your results. Feel free to reach out if you need help.

Survival Analysis can be used to help reduce churn, reduce warranty cost and improve pricing, predict failure in machinery and more. If you are curious about how it can help you achieve your business goals, do not hesitate to contact me.