If there are two (can be generalized to n) classes and both follow the same distribution (but with different parameters) it is possible to predict which class an observations comes from.
Here I’ll try to predict a sample’s gender based on their height. The distribution of a person’s height is more or less normal. There are two parameters of a normal distribution. I’ll consider the easy case in this post: males and females have different average heights, but the distributions have the same standard deviation.
For the graphs and subscripts, male = 1, female = 0.
There are 3 things to do:
- Make sure our data is roughly normal. If our prediction is predicated on the data being normal, it data better be normal.
- Derive the decision rule.
- Test how well our rule works.
The data set we’ll be using is from the Journal of Statistics Education. I’ve stripped out most the the information except for height, and gender.
Again this looks good.
Deriving the Decision Rule
Great, so the the data is normal, but what’s next. We’ll make the decision to classify a case to a gender if the probability of that case being male is greater than that case being female. Or, formally,
Because we’ve assumed normality let’s put the pdf’s the inequality.
Remember, we assumed that the standard deviations were the same.
It’s fairly obvious form the equations that when,
the original inequality will hold. Now if we do some algebra we can see that when
the case will be classified as male. To visualize this, it would be a vertical line through the average of the means. Anything on the right male, on the left female.
To see how well our decision rule works the data needs to be split into a training set – to put actual numbers to the rule – then a testing set to see how well the prediction works.
I’ll be using R to do the analysis. All the data is in a data.frame
First we’ll split the data.frame into the training and testing sets.
> nr <- nrow(hw)
> hw.shuffle <- hw[sample.int(nr),]
> hw.train <- hw.shuffle[1:as.integer(nr*.7),]
> hw.test <- hw.shuffle[as.integer(nr*.7):nr,]
So now that the data is split into the two separate sets the mean of the training set can be tested against the test set.
> tapply(hw.train$height, hw.train$gender, mean)
Which means from decision rule derived above anything larger than the average of the 164.79 and 177.77, which is 171.28, will be classified as male, and under will be classified as female.
Now to set up the classification.
> hw.train.mean <- mean(c(164.79,177.77)) > hw.test$classify <- rep(0, (nrows(hw.test)) > hw.test$classify <- ifelse(hw.test$height > hw.train.mean, 1, 0)
> hw.test$classify <- as.factor(hw.test$classify) > hw.test$classify <- as.factor(hw.test$classify) > tab <- table(hw.test$gender, hw.test$classify) > tab
0 62 12
1 17 61
The table shows the number predicted vs the actual number. Meaning there are 74 females in our test, and we correctly predicted 62 of them.
This is pretty good, of 152 test cases, the decision rule correctly predicted 123 correct or ~81%. It could be made potentially better by assuming a different standard deviations between factors.
And to wrap it up, and nice graph showing the rule overlaid with polynomial density.
- There are much better ways to check for normality, but this’ll do there.
- Remember that when you multiply both sides of an inequality by a negative number you switch the inequality.