Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.
In today’s set we take a break of hypothesis testing and we come back to the fundamental of statistics: the probability. Precisely, in this set, you will see how to compute probability of complex events, use conditional and marginal distribution function and learn to sample from and plot a multivariate distribution function.
Answers to the exercises are available here.
So far we know that probabilities take a value between 0 and 1. We know that probabilities of realization of single events that form a set can be added together to compute the probability of realization of any event in that set. For example, the probability of getting hit by a bus on a given day or bitten by a shark is equal to the sum of those probabilities. However, we can say that because it’s almost impossible that both event happen on the same day (if you know somebody who got bitten by a shark, survived, then got hit by a bus, please stay far from them for your own safety!). Those kinds of event are called mutually exclusive and can be identified by looking at the Venn diagram of the outcome. More info here. For those interested, when two evens are not mutually exclusive, we can still add their probabilities together to get the probability of realization of one event or the other, but we must subtract the probability of getting both even to the total. The next exercise should give you an idea why we must subtract this value from the total.
The quality assurance department of a video game studio classify found bug in two categories: graphic issues or collision bug. One of the tester created this dataset compiling the bugs he found during an average workday.
- Use the
VennDiagrampackage to draw the Venn Diagram of the dataset.
- What is the probability of finding a graphic issues uniquely? Of getting only a collision bug?
- What is the probability that the tester find a graphic bug that is also a collision bug?
- What is the probability that the tester find a graphic bug or a collision bug?
If we have two events A and B, we know how to compute the probability that A or B happen. Now if you want to know the probability of observing A and B, there’s two possible scenarios: the one where the realization of A influence the probability of realization of the event B and the one where the probability of B stay the same whether A happen or not. This last case is the easier to compute: we just have to multiply both probabilities to get the probability of realization of both events.
This result can be extended to more than two event. For example, if you flip a coin three times and want to know the probability to get three heads, you know that each coin flip result doesn’t influence the next result. As consequence you can just multiply the probability of each event, in this case 0.5*0.5*0.5=0.125 to know the probability of this particular result.
- Sample with replacement 500 integers between 1 and 10 and store the result in a vector called Event.A.
- Sample with replacement 500 integers between 1 and 5 and store the result in a vector called Event.B.
- If each element in both vector represent the result of a simultaneous draw. Empirically, what is the probability to draw the number 5 in both vectors, at the same time? What is the probability to draw a 1 in the first vector and a number bigger than 3 in the second?
- Use the multiplication rule to compute the probability of those events and compare the results of the last exercise.
When the realization of an event A change the probability of realization of the event B we estimate what is called a conditional probability. To do so, we use the same process than we used to estimate probability, but since the event A change the possible outcome of event B, we will used the number of those possible outcomes as denominator in our formula. So the general formula for estimation of probability #of observation of B/total number of observations become #of observations of B when A happen/total number of observations when A happen. Here’s some more formal definition here.
- Load this dataset and explore it (make a histogram and list the unique observed value).
- Compute the probability to observe each value.
- There’s seems to be two sub-processes that compose those random events. Let’s assume that this dataset represent a lottery where you have 1 chance out of 100 to get a bonus that multiply by 10 your prize and that this bonus appear only in a winning situation. In this case, we could be interested to know the probability of winning and not having the bonus. Use the dataset to estimate the probability of those individual events.
- Compute the probabilities of winning each amount when the bonus is applied.
In a rural fair, people can pay 5 dollars to play a game where they choose to open one of three doors and pick a plastic ball from a closed box that sits behind the door. If the ball is red, they win 50 dollars and if the ball is blue, they win nothing. Each box contains 50 balls, but the amount of red ball change from one box to the other. A bored statistician have spent an afternoon compiling which door has been chosen by 450 players and if they won.
- Load this dataset.
- Estimate the probability of winning at this game.
- Estimate the probability of winning at this game, if you choose the first door, the second door or the third door.
- Create a contingency table of this situation.
- Use the table to compute the conditional probability of winning if someone chose the first door, the second or the third door.
Just as for the ordinary probability we can create a distribution from the conditional probabilities to better understand how a random process behave. The easiest way to compute such a distribution is to use a contingency table where all the outcome of two even are listed in the margin and the elements are the number of observations of each combination of outcome. The conditional distribution if an outcome Ai happened correspond to the ECDF computed by using the observation on the row or column of Ai.
Another useful distribution is the marginal distribution, which is the distribution of the individual event A and B. The name marginal come from the fact that when using a contingency table to estimate it, we must use the total of each rows and columns to compute the ECDF and those values are often put in the margins. The next exercise should help you get familiar with those concepts.
A sample of 50 articles from three websites on the same subject has been analyzed by a professional facts checker to see the quality of their news coverage. The news has been classify in three categories: factually correct, mostly correct and fake news. The following dataset show the result of his work.
- What is the probability of getting factually correct, mostly correct and fake news by looking at a random article from one of those sites?
- What is the probability of reading a fake news from the first website?
- What is the probability of reading the second website if you are reading a factually correct article?
- What is the marginal distribution in this situation?
- What is the conditional distribution for the mostly correct news?
Let’s look at the multivariate normal distribution and how the marginal and the conditional distribution are used in this case. Basically, a multivariate normal distribution is a function of dimension higher than 1 whose component are normaly distributed.
- Generate 2000 points from a standard normal distribution and store the results in a vector called x.
- Generate 2000 points from a normal distribution of mean 10 and a standard deviation of 5 and store the result in a vector called y.
- Create a matrix with two columns x and y which will be the coordinate of 2000 points.
- Make a basic plot of the points in the last matrix and draw the histogram of both x and y matrix.
We know the marginal distributions of the multivariate normal distribution of the last exercise: they are the distribution of the x and y variables. Fun fact: the projection of the multivariate normal distribution on the x-z plane will be identical to the distribution of the variable x i.e. if we look at the 3D histogram of those points by putting our eye over the x axis the shape of the curve would look like the distribution of x. Same thing with the projection of the curve on the y-z axis.
Create an histogram of the point in the matrix in the last exercise which the x coordinate are smaller than 1.5 but bigger than 1.3. Then, do the same things for points whose y coordinate are between 10 and 11.
Those are the conditional distributions for some fixed value of x or y. We can see that those conditional distributions are also normally distributed!
We did before a basic plot of the points from this multivariate distribution, but this plot didn’t show the shape of the distribution. We can do better. Use the
plot3D package and the
hist3D() function (more detail here) to draw the 3d histogram of the dataset of last exercise.
Another way to represent a 3D distribution in 2D is to use an heatmap. Draw the heatmap of your sample by using:
image2Dfunction from the
hist2d()function from the
The factor x and y of our multivariate normal distribution are independent, meaning that the value of one value doesn’t influence the value of the other. To create a more realistic sample, you should use the
mvrnorm package which let you pass a matrix as argument containing the covariance between each variable. This statistics is a measure of the dependence between the factor that take a value between o and 1. You can read more about it here.
mvrnorm function to sample 500 points from a multivariate normal distribution of dimension two. The marginal distribution of the first factor is a normal distribution of mean equal to 5 and a standard deviation of 3, while the marginal distribution of the second has a mean of 9 and a standard deviation of 1.5. The covariance between both factor is of 0.6.
Then draw the heatmap of this distribution.