# Chapter 2 Solutions – Statistical Methods in Bioinformatics

August 14, 2012
By

(This article was first published on John Ramey, and kindly contributed to R-bloggers)

As I have mentioned previously, I have begun reading Statistical Methods in Bioinformatics by Ewens and Grant and working selected problems for each chapter. In this post, I will give my solution to two problems. The first problem is pretty straightforward.

## Problem 2.20

Suppose that a parent of genetic type Mm has three children. Then the parent transmits the M gene to each child with probability 1/2, and the genes that are transmitted to each of the three children are independent. Let $I_1 = 1$ if children 1 and 2 had the same gene transmitted, and $I_1 = 0$ otherwise. Similarly, let $I_2 = 1$ if children 1 and 3 had the same gene transmitted, $I_2 = 0$ otherwhise, and let $I_3 = 1$ if children 2 and 3 had the same gene transmitted, $I_3 = 0$ otherwise.

The question first asks us to how that the three random variables are pairwise independent but not independent. The pairwise independence comes directly from the bolded phrase in the problem statement. Now, to show that the three random variables are not independent, denote by $p_j$ the probability that $I_j = 1$, $j = 1, 2, 3$. If we had independence, then the following statement would be true:

However, notice that the event in the lefthand side can never happen because if $I_1 = 1$ and $I_2 = 1$, then $I_3$ must be 1. Hence, the lefthand side must equal 0, while the righthand side equals 1/8. Therefore, the three random variables are not independent.

The question also asks us to discuss why the variance of $I_1 + I_2 + I_3$ is equal to the sum of the individual variances. Often, this is only the case of the random variables are independent. But because the random variables here are pairwise independent, the covariances must be 0. Thus, the equality must hold.

## Problems 2.23 - 2.27

While I worked the above problem because of its emphasis on genetics, the following set of problems is much more fun in terms of the mathematics because of its usage of approximations.

For $i = 1, \ldots, n$, let $X_i$ be the $i$th lifetime of certain cellular proteins until degradation. We assume that $X_1, \ldots, X_n$ are iid random variables, each of which is exponentially distributed with rate parameter $\lambda > 0$. Furthermore, let $n = 2m + 1$ be an odd integer.

This set of questions is concerned with the mean and variance of the sample median, $X_{(m + 1)}$, where $X_{(i)}$ denotes the $i$th order statistic. First, note that the mean and variance of the minimum value $X_{(1)}$ are $1/(n\lambda)$ and $1/(n\lambda)^2$, respectively. From the memoryless property of the exponential distribution, the mean value of the time until the next protein degrades is independent of the previous. However, there are now $n - 1$ proteins remaining. Thus, the mean and variance of $X_{(2)}$ are $1/(n\lambda) + 1/((n-1)\lambda)$ and $1/(n\lambda)^2 + 1/((n-1)\lambda)^2$, respectively. Continuining in this manner, we have

and

### Approximation of $E[X_{(m + 1)}]$

Now, we wish to approximate the mean with a much simpler formula. First, from (B.7) in Appendix B, we have

where $\gamma$ is Euler’s constant. Then, we can write the expected sample median as

Hence, as $n \rightarrow \infty$, this approximation goes to $\frac{\log 2}{\lambda}$, which is the median of an exponentially distributed random variable. Specifically, the median is the solution to $F_X(x) = 1/2$, where $F_X$ denotes the cumulative distribution function of the random variable $X$.

### Improved Approximation of $E[X_{(m + 1)}]$

It turns out that we can improve this approximation with the following two results:

Following the derivation of our above approximation, we have that

### Approximation of $Var[X_{(m + 1)}]$

We can also approximate $Var[X_{(m + 1)}]$ using the approximation

With $a = m+1$ and $b = 2m + 1$, we have