Predicting Gender

November 28, 2011
By

(This article was first published on Lambda Omega Lambda » R, and kindly contributed to R-bloggers)

If there are two (can be generalized to n) classes and both follow the same distribution (but with different parameters) it is possible to predict which class an observations comes from.

Here I’ll try to predict a sample’s gender based on their height. The distribution of a person’s height is more or less normal. There are two parameters of a normal distribution. I’ll consider the easy case in this post: males and females have different average heights, but the distributions have the same standard deviation.

For the graphs and subscripts, male = 1, female = 0.

There are 3 things to do:

  1. Make sure our data is roughly normal. If our prediction is predicated on the data being normal, it data better be normal.
  2. Derive the decision rule.
  3. Test how well our rule works.

The data set we’ll be using is from the Journal of Statistics Education. I’ve stripped out most the the information except for height, and gender.

Is Our Data Normal
Have a look at this graph:
Histogram Genders Combined
Looks more or less normal like we thought, but what about the genders by themselves.

Male vs Female


Again this looks good.[1]

Deriving the Decision Rule

Great, so the the data is normal, but what’s next. We’ll make the decision to classify a case to a gender if the probability of that case being male is greater than that case being female. Or, formally,
P(case = male | height = x) \geq P(case = female | height = x)
Because we’ve assumed normality let’s put the pdf’s the inequality.
\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\frac{-(x - \mu_1)^2}{2}} \geq \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\frac{-(x - \mu_2)^2}{2}}
Remember, we assumed that the standard deviations were the same.
It’s fairly obvious form the equations[2] that when,
(x - \mu_1)^2 \leq (x - \mu_0)^2
the original inequality will hold. Now if we do some algebra we can see that when
x \geq \frac{\mu_1 + \mu_0}{2}
the case will be classified as male. To visualize this, it would be a vertical line through the average of the means. Anything on the right male, on the left female.

Testing

To see how well our decision rule works the data needs to be split into a training set – to put actual numbers to the rule – then a testing set to see how well the prediction works.

I’ll be using R to do the analysis. All the data is in a data.frame hw.

First we’ll split the data.frame into the training and testing sets.

> nr <- nrow(hw)
> hw.shuffle <- hw[sample.int(nr),]
> hw.train <- hw.shuffle[1:as.integer(nr*.7),]
> hw.test <- hw.shuffle[as.integer(nr*.7):nr,]

So now that the data is split into the two separate sets the mean of the training set can be tested against the test set.

> tapply(hw.train$height, hw.train$gender, mean)
0 1
164.7908 177.7703

Which means from decision rule derived above anything larger than the average of the 164.79 and 177.77, which is 171.28, will be classified as male, and under will be classified as female.

Now to set up the classification.

> hw.train.mean <- mean(c(164.79,177.77))
> hw.test$classify <- rep(0, (nrows(hw.test))
> hw.test$classify <- ifelse(hw.test$height > hw.train.mean, 1, 0)
> hw.test$classify <- as.factor(hw.test$classify)
> hw.test$classify <- as.factor(hw.test$classify)
> tab <- table(hw.test$gender, hw.test$classify)
> tab
0 1
0 62 12
1 17 61

The table shows the number predicted vs the actual number. Meaning there are 74 females in our test, and we correctly predicted 62 of them.

This is pretty good, of 152 test cases, the decision rule correctly predicted 123 correct or ~81%. It could be made potentially better by assuming a different standard deviations between factors.

And to wrap it up, and nice graph showing the rule overlaid with polynomial density.

Decision Rule

Notes:

  1. There are much better ways to check for normality, but this’ll do there.
  2. Remember that when you multiply both sides of an inequality by a negative number you switch the inequality.


To leave a comment for the author, please follow the link and comment on his blog: Lambda Omega Lambda » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags:

Comments are closed.