Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
Until now, we used random variable simulation and bootstrapping to test hypothesis and compute statistics of a single sample. In today’s set, we’ll learn how to use permutation to test hypothesis about two different samples and how to adapt bootstrapping to this situation.
Answers to the exercises are available here.
- Generate 500 points from a beta distribution of parameter a=2 and b=1.5, then store the result in a vector named beta1.
- Generate 500 points from the same distribution and store those points in a vector named beta2.
- Concatenate both vectors to create a vector called beta.data.
- Plot the ecdf of beta1 and beta2.
- Sample 500 points from beta.data and plot the ecdf of this sample. Repeat this process 5 times.
- Does all those samples share the same distribution and if the answer is yes, what is the distribution?
When we test an hypothesis, we suppose that this hypothesis is true, we simulate what would happen if that’s the case and if our initial observation happen less that α percent of the time we reject the hypothesis. Now, from the first exercise, we know that if two samples share the same distribution, we can assume that any sample drawn from those samples will follow the same distribution. In particular, if we shuffle the observations from a sample of size n1 and those of a sample of size n2, shuffle them and draw two new samples of size n1 and n2, they all should have a similar CDF. We can use this fact to test the hypothesis that two samples have the same distribution. This is process is called a permutation test.
Load this dataset where each column represents a variable and we want to know if they are identically distributed. Each exercise below follow a step of a permutation test.
- What are the null and alternative hypotheses for this test?
- Concatenate both samples into a new vector called data.ex.2.
- Write a function that take data.ex.2 and the size of both sample as arguments, create a temporary vector by permuting data.ex.2 and return two new samples. The first sample has the same number of observations than the first column of the dataset, the second is made from the rest of the observations. Name this function permutation.sample (we will used it in the next exercise.) Why do we want the function to return samples of those size?
- Plot the ECDF of both initial variables in black.
- Use the function permutation.sample 100 times to generate permuted samples, then compute the ECDF of those samples and add the plot of those curve to the previous plot. Use the color red for the first batch of samples and green for the second batch.
- By looking at the plot, can you tell if the null hypothesis is true?
A business analyst think that the daily returns of the apple stocks follow a normal distribution with mean of 0 and a standard deviation of 0.1. Use this dataset of the daily return of those stocks for the last 10 years to test this hypothesis.
Permutation test can help us verify if two samples come from the same distribution, but if this is true, we can conclude that both sample share the same statistics. As a consequence permutation test can also be used to test if statistic of two sample are the same. One really useful application of this is to test if two mean are the same or significantly different (as you have probably realized by now, statistician are obsessed with mean and love to spend time studying it!). In this situation, the question is to determine if the difference of mean in two sample are random or a consequence of a difference of distribution.
You should be quite familiar with tests by now, so how would you proceed to do a permutation test to verify if two means are equals? Used that process to test the equality of the mean of both sample in this dataset.
Looking at the average annual wage of the United States and Switzerland both country have relatively the same level of wealth since those statistics are of 60154 and 60124 US dollar respectively. In this dataset, you will find simulated annual wage from citizen of both countries. Test the hypothesis that both the American and the Swiss have the same average annual wage based on those samples at a level of 5%.
To test if two samples from different distribution have the same statistics, we cannot use the permutation test: we instead will use bootstrapping. To test if two sample as the same mean, for example, you should follow those steps:
- Formulate a null and an alternative hypothesis.
- Set a significance level.
- Compute the difference of mean of both samples. This will be the reference value we will use to compute the p-value.
- Concatenate both samples and compute the mean of this new dataset.
- Shift both samples so that they share the mean of the concatenated dataset.
- Use bootstrap to generate an estimate of the mean of both shifted samples.
- Compute the difference of both means.
- Repeat the last two steps at least 1000 times.
- Compute the p-value and draw a conclusion.
Use the dataset from last exercise to see if the USA and Switzerland have the same average wage at a level of 5%.
Test the hypothesis that both samples in this dataset have the same mean.
R have functions that use analytic methods to test if two samples have an equal mean.
- Use the t.test() function to test the equality of the mean of the samples of the last exercise.
- Use this function to test the hypothesis that the average wage in the US are bigger than in Switzerland.
The globular cluster luminosity dataset list measurement about the luminosity of cluster of stars in different region of the milky way galaxy and the Andromeda galaxy. Test the hypothesis that the average luminosity in both galaxy have a difference of 24,78.
A company that mold aluminum for auto parts has bought a smaller company to increase the amount of parts they can produce each year. In their factory, the smaller company used the standard equipment, but used a different factory layout, had a different supply line and managed their employees work schedules in a completely different manner that their new parent company. Before changing the company culture, the engineer in the parent company are interested to know which of the approach is the more effective. To do so they measure the time it took to make an auto part in each factory, 150 times and created this dataset where the first column represent the sample of the small factory.
- Does the average time it takes to make a part is the same in both factory?
- Does the production time follow the same distribution in both factory?
- If the engineer want to minimize the percentage of part that take more than one hour to be made, which setup they should implement in both their factory: the one of the parent company or the one of the smaller company?