Python and R: Basic Sampling Problem

April 16, 2015
By

(This article was first published on Analysis with Programming, and kindly contributed to R-bloggers)

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

Problems

  1. Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.


    $y_1 = 98$ $y_2 = 102$ $y_3=154$
    $y_4 = 133$ $y_5 = 190$ $y_6=175$


    We are interested in $bar{y}_U$, the population mean. Consider eight possible samples chosen.


    Sample No. Sample, $mathcal{S}$ $P(mathcal{S})$
    1 ${1,3,5}$ $1/8$
    2 ${1,3,6}$ $1/8$
    3 ${1,4,5}$ $1/8$
    4 ${1,4,6}$ $1/8$
    5 ${2,3,5}$ $1/8$
    6 ${2,3,6}$ $1/8$
    7 ${2,4,5}$ $1/8$
    8 ${2,4,6}$ $1/8$


    1. What is the value of $bar{y}_U$?
    2. Let $bar{y}$ be the mean of the sample values. For each sampling plan, find
      1. $mathrm{E}bar{y}$;
      2. $mathrm{Var}bar{y}$;
      3. $mathrm{Bias}(bar{y})$;
      4. $mathrm{MSE}(bar{y})$;
  2. Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:


    Age (months) 9 10 11 12 13 14 15 16 17 18 19 20
    Number of Children 13 35 44 69 36 24 7 3 2 5 1 1


    Find the mean and SE of the age for onset of free walking.
  3. Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,


    Village $Y_j$ Village $Y_j$ Village $Y_j$ Village $Y_j$
    1 105 11 319 21 70 31 16
    2 625 12 72 22 249 32 439
    3 47 13 109 23 384 33 123
    4 312 14 91 24 482 34 207
    5 327 15 152 25 378 35 145
    6 230 16 189 26 111 36 666
    7 240 17 365 27 534 37 338
    8 203 18 70 28 306 38 624
    9 535 9 249 29 655 39 501
    10 275 20 384 30 102 40 962

Solutions

In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.

    1. The value of $bar{y}_U$ is coded as follows:

      Python Code R Code

    2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6choose 3}$, first. Where the first two combinations are the samples, ${1,2,3}$ and ${1,2,4}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index ${1,3,5}$, having population units, ${98, 154, 190}$. So that the following is the code of this sampling design:

      Python Code R Code

      1. Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:

        Python Code R Code From the above code, the output tells us that $mathrm{E}bar{y}=140$.

      2. Next is to compute for the variance of $bar{y}$, which is $mathrm{Var}bar{y}=mathrm{E}bar{y}^{2}-(mathrm{E}bar{y})^2$. So we need a function for $mathrm{E}bar{y}^2$, where the first term of this, $k=1$, is $bar{y}_1^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2(frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:

        Python Code R Code So that using the above output, 20182.94, and subtracting $(mathrm{E}bar{y})^2$ to it, will give us the variance. And hence the succeeding code:

        Python Code: R Code: So the variance of the $bar{y}$ is $18.9444$.

    3. The $mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($mathrm{E}bar{y}=142$), so $mathrm{Bias}=142-142=0$.
    4. $mathrm{MSE}=mathrm{Var}bar{y}-(mathrm{Bias}bar{y})^2$, and since the $mathrm{Bias}bar{y}=0$. So $mathrm{MSE}=mathrm{Var}bar{y}$.
  1. First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

    Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

    Python Code R Code So the standard variability of the Age is 1.920824.

  2. Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

    Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that’s why we have the same population units sampled between the two language despite the differences between the index returned.

Reference

  1. Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

To leave a comment for the author, please follow the link and comment on their blog: Analysis with Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)