Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In previous set, we’ve seen how to compute probability based on certain density distributions, how to simulate situations to compute their probability and use that knowledge make decisions in obvious situation. But what is a probability? Is there a more scientific way to make those decisions? What is the P-value xkcd keep talking about? In this exercise set, will learn the answer to most of those question and more!
One simple definition of the probability that an event will occur is that it’s the frequency of the observations of this event in a data set divided by the total number of observations in this set. For example, if you have a survey where 2 respondents out of 816 says that they are interested in a potential partner only if they are dressed in an animal costume, you can say that the probability that someone in the population is a furry is about 2/816 or 1/408 or 0.00245… or 0.245%.
Answers to the exercises are available here.
The average height of males in the USA is about 5 foot 9 inches with a standard deviation of 2.94 inches. If this measure follow a normal distribution, write a function that takes a sample size as input and compute the probability to have a subject taller than 5 foot 8 and smaller than 5 foot 9 on this sample size. Then, set the seed to 42 and compute the probability for a sample size of 200.
We can deduce a lot from that definition. First, the probability is always a fraction, but since we are usually not used to high number and have a hard time doing division in our head 3968/17849 is not a really useful probability. In consequence, we will usually use a percentage or a real number between o and 1 to represent a probability. Why 0 and one? If an event is not present in the data set, his frequency is 0 so whatever is the total number of observations his probability is 0 and if all the observations are the same, the fraction is going to be equal to 1. Also, if you think about the example of the furries in the survey, maybe you think that there’s a chance that there are only two furries in the entire population and they both take the survey, so the probability that an individual is a furry is in reality a lot lower than 0.0245%. Or maybe there’s a lot more furries in the population and only two where surveyed, which makes the real probability much higher. You are right token reader! In a survey, we estimate the real probability and we can never tell the real probability from a small sample (that’s why if you are against the national survey in your country, all the statisticians hate you in silence). However, the more the sample size of a survey is high the less those rare occurrences happen.
- Compute the probability that an American male is taller than 5 foot 8 and smaller than 5 foot 9 with the pnorm function.
- Write a function that draws a sample of subject from this distribution, compute the probability of observing a male of this height and compute the percentage of difference between that estimate and the real value. Make sure that you can repeat this process for all sample size between two values.
- Use this function to draw sample of size from 1 to 10000 and store the result in a matrix.
- Plot the difference between the estimation of the probability and the real value.
This plot show that the more the sample size is big, the less the error of estimation is, but the difference of error between an sample of size 1000 and 10000 is quite small.
We have already seen that density probability can be used to compute probability, but how?
For a standard normal distribution:
- Compute the probability that x is smaller or equal to zero, then plot the distribution and draw a vertical line at 0.
- Compute the probability that x is greater than zero.
- Compute the probability that x is less than -0.25, then plot the distribution and draw a vertical line at -0.25.
- Compute the probability that x is smaller than zero and greater than -0.25.
Yeah, the area under the curve of a density function between two points is equal to the probability that an event is equal to a value on this interval. That’s why density are really useful: they help us to easily compute the probability of an event by doing calculus. Often we will use the cumulative distribution function (cdf), which is the antiderivative of the density function, to compute directly the probability of an event on an interval. The function
pnorm() for example, compute the value of the cdf between minus infinity and a value x. Note that a cdf return the probability that a random variable take a value smaller.
For a standard normal distribution, find the values x such as:
- 99% of the observation are smaller than x.
- 97.5% of the observation are smaller than x.
- 95% of the observation are smaller than x.
- 99% of the observation are greater than x.
- 97.5% of the observation are greater than x.
- 95% of the observation are greater than x.
Since probability are often estimated, it is useful to measure how good is the estimation and report that measure with the estimation. That’s why you often hear survey reported in the form of “x% of the population with a y% margin 19 times out of 20”. In practice, the size of the survey and the variance of the results are the two most important factors that can influence the estimation of a probability. Simulation and bootstrap methods are great way to find the margin of error of an estimation.
Load this dataset and use bootstrapping to compute the interval that has 95% (19/20) chance to contain the real probability of getting a value between 5 and 10. What is the margin of error of this estimation?
This process can be used to any statistics that is estimated, like a mean, a proportion, etc.
When doing estimation, we can use a statistic test to draw conclusion about our estimation and eventually make decisions based on it. For example, if in a survey, we estimate that the average number of miles traveled by car each week by American is 361.47, we could be interested to know if the real average is bigger than 360. To do so, we could start by formulation a null and an alternative hypothesis to test. In our scenario, a null hypothesis would be that the mean is equal or less than 360. We will follow the step of the test and if at the end we cannot support this hypothesis, then we will conclude that the alternative hypothesis is probably true. In our scenario that hypothesis should be that the mean is bigger than 360.
Then we choose a percentage of times we could afford to be wrong. This value will determine the range of possible values for which we will accept the null hypothesis and is called the significance level (α).
Then we can use a math formula or a bootstrap method to estimate the probability that a sample from this population would create an estimate of 361.47. If this probability is less than the significance level, we reject the null hypothesis and go with the alternative hypothesis. If not, we cannot reject the null hypothesis.
So basically, what we do is we look at how often our estimation should happen if the null hypothesis is true and if it’s rare enough at our taste, significance level, we conclude that it’s not a random occurance but a sign that the null hypothesis is false.
This dataset represents the survey of the situation above.
- Estimate of the mean of this dataset.
- Use the bootstrap method to find 10000 estimations of the mean from this dataset.
- Find the value from this bootstrap sample that is bigger than 5% of all the others values.This value is called the critical value of the test and correspond to α.
- From the data we have, should be conclude that the mean of the population is bigger than 360? What is the significance level of this test?
We can represent the test visually. Since we reject the null hypothesis if the percentage of bootstrapped mean smaller than 360 is bigger than 5%, we can simply look where the fifth percentile lie on the histogram of the bootstrapped mean. If it’s at the left of the 360 value, we know that more than 5% of bootstrapped means are smaller than 360 and we don’t reject the null hypothesis.
Draw the histogram of the bootstrapped mean and draw two vertical lines: one at 360 and one at the fifth percentile.
There are two ways that a mean can be not equal to a value: when the mean is bigger than the value and when it’s smaller than this value. So if we want to test the equality of the mean to a specific value we must verify if most of our estimations lie around this value or if a lot of them are far from it. To do so, we create an interval who has for endpoints our mean and another point that is at the same distance from this value that the mean. Then we can compute the probability to get an estimation outside this interval. This way, we test if the value is not bigger or smaller than the value 1-α of the time.
Here’s the steps to test the hypothesis that the mean of the dataset of exercise 6 is equal to 363:
- To simulate that our distribution has a mean of 363, shift the dataset so that this value become the mean.
- Generate 10000 bootstrapped means from this distribution.
- Compute the endpoints of the test interval.
- Compute the probability that the mean is outside this interval.
- What conclusion can we make with a α of 5%?
Repeat the step of exercise 8, but this time test if the mean is smaller than 363.
This show that a one direction test is more powerful than a two direction test in this situation since there’s less wiggle room between the value of reference and the critical region of the test. So if you have prior knowledge that could make you believe that an estimation is bigger or smaller than a value, testing for than would give you more assurance of the validity of your results.
The p-value of a test is the probability that we would observe a random estimation as the one we made if the null hypothesis is true. This value is often used in scientific reports since it’s a concise way to express statistics finding. If we know the p-value of a test and the significance level α we can deduce the result of the test since the null hypothesis is rejected when p<α. In another word: you have been using the p-value all this time to make conclusion!
Load the dataset of exercise 5 and compute the p-value associated to the test that the mean is equal to 13 if α is equal to 5%.