# S-shaped data: Smoothing with quasibinomial distribution

January 16, 2016
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function .  Cumulative distribution function of logistic distribution function is a logistic function, i.e., logit.

To demonstrate this, in this short example, after generating a synthetic data, we will fit quasibinomial regression model to different observations.

ggplot , an implementation of grammar of graphics , provides capability to apply regression or customised smoothing onto a raw data during plotting.

Generating Synthetic Data

Let generate set of $n$ observation over time $t$, denoted, $X_{1}, X_{2}, …, X_{n}$ for $k$ observation $X=(x_{1}, x_{2}, …, x_{k})$. We will use cumulative function for logistic distribution ,
$$F(x;\mu,s) = \frac{1}{2} + \frac{1}{2} \tanh((x-\mu)/2s)$$, adding some random noise to make
it realistic.

Let’s say there are $k=6$ observations with the following parameter sets, $\mu = \{9,2,3,5,7,5\}$  and $s=\{2,2,4,3,4,2\}$, we will utilise mapply  in generating a syntetic data frame.

generate_logit_cdf <- function(mu, s,                                sigma_y=0.1,                                x=seq(-5,20,0.1)) {  x_ms <- (x-mu)/s  y    <- 0.5 + 0.5 * tanh(x_ms)    y    <- abs(y + rnorm(length(x), 0, sigma_y))  ix   <- which(y>=1.0)  if(length(ix)>=1) {     y[ix] <- 1.0  }  return(y)}set.seed(424242)x      <- seq(-5,20,0.025) # 1001 observationmu_vec <- c(1,2,3,5,7,8)   # 6 variabless_vec  <- c(2,2,4,3,4,2)# Syntetic variablesobservations_df<- mapply(generate_logit_cdf,                               mu_vec,                               s_vec,                               MoreArgs = list(x=x))# Give them namescolnames(observations_df) <- c("Var1", "Var2", "Var3", "Var4", "Var5", "Var6")head(observations_df)

Smoothing of observations

Using the syntetic data we have generated, observations_df,
we can noq use ggplot and quasibinomial glm to visualise
and smooth the variables.

library(ggplot2) library(reshape2) df_all <- reshape2:::melt(observations_df) colnames(df_all) <- c("x", "observation", "y") df_all$observation <- as.factor(df_all$observation) p1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) + geom_point() +         scale_color_brewer(palette = "Reds") +        theme(              panel.background = element_blank(),               axis.text.x      = element_text(face="bold", color="#000000", size=11),              axis.text.y      = element_text(face="bold", color="#000000", size=11),              axis.title.x     = element_text(face="bold", color="#000000", size=11),              axis.title.y     = element_text(face="bold", color="#000000", size=11)#              legend.position = "none"              ) l1<-ggplot(df_all, aes(x=x, y=y, colour=observation)) +        geom_point(size=3) + scale_color_brewer(palette = "Reds") +         scale_color_brewer(palette = "Reds") +         #geom_smooth(method="loess", se = FALSE, size=1.5) +        geom_smooth(aes(group=observation),method="glm", family=quasibinomial(), formula="y~x",                       se = FALSE, size=1.5) +        xlab("x") +         ylab("y") +        #scale_y_continuous(breaks=seq(0.0,1,0.1)) +        #scale_x_continuous(breaks=seq(0.0,230,20)) +        #ggtitle("")  +         theme(              panel.background = element_blank(),               axis.text.x      = element_text(face="bold", color="#000000", size=11),              axis.text.y      = element_text(face="bold", color="#000000", size=11),              axis.title.x     = element_text(face="bold", color="#000000", size=11),              axis.title.y     = element_text(face="bold", color="#000000", size=11)              ) library(gridExtra) gridExtra:::grid.arrange(p1,l1)

References
 https://en.wikipedia.org/wiki/Logistic_distribution#Applications
 http://www.ggplot.org
 The Grammar of Graphics, L. Wilkinson, http://www.amzn.com/038724544
 http://en.wikipedia.org/wiki/Logistic_distribution#Cumulative_distribution_function.
 https://stat.ethz.ch/R-manual/R-devel/library/base/html/mapply.html

Notes:
* Quasibinomial distribution is as a term used in R’s GLM implementation context, see here.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.