Improve Your Training Set with Unsupervised Learning

[This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On my previous post Advanced Survey Design and Application to Big Data I mentioned unsupervised learning can be used to generate a stratification variable. In this post I want to elaborate on this point and how they can work together to improve estimates and training data for predictive models.

SRS and stratified samples

Consider the estimators of the total from an SRS and stratified sample.

    \[ \begin{array}{l l} \hat{T}_{\text{srs}} & =  \sum^n_{i = 1} {\frac{N}{n}} y_{i} \\ \hat{T}_{\text{str}} & =  \sum^H_{h = 1} \sum^{n_h}_{i \in S_h} {\frac{N_h}{n_h}} y_{ih} \end{array} \]

The variance of these estimators are given by

    \[ \begin{array}{l l} \text{Var} \left( \hat{T}_{\text{srs}} \right) & = \left( 1 - \frac{n}{N} \right) N^2 \frac{s^2}{n} \\ \text{Var} \left( \hat{T}_{\text{str}} \right) & = \sum^H_{h = 1}{ \left( 1 - \frac{n_h}{N_h} \right)} N^2_h \frac{s^2_h}{n_h} \end{array} \]

The variance of the stratification estimator is made up of two components, the within and between strata sum of squares.

    \[ \begin{array}{r l} SSB & = \sum^H_{h = 1} \sum^{N_h}_{i \in S_h}{\left( \bar{y}_h - \bar{y} \right)^2} = \sum^H_{h = 1}{N_h \left( \bar{y}_h - \bar{y} \right)^2} \\ SSW & = \sum^H_{h = 1} \sum^{N_h}_{i \in S_h}{\left( \bar{y}_{ih} - \bar{y}_h \right)^2} = \sum^H_{h = 1} \left( N_h - 1 \right) s^2_h \\ SSTO & = SSB + SSW = \left( N - 1 \right) s^2 \end{array} \]

With some algebra it can be shown

    \[ \begin{array}{r l} \text{Var}\left( \hat{T}_{\text{srs}} \right) & = \left( 1 - \frac{n}{N} \right) N^2 \frac{s^2}{n} \\ & = \left( 1 - \frac{n}{N} \right) \frac{N^2}{n} \frac{SSTO}{N - 1} \\ & = \left( 1 - \frac{n}{N} \right) \frac{N^2}{n \left(N - 1 \right)} \left( SSB + SSW \right) \\ & = \text{Var} \left( \hat{T}_{\text{str}} \right) + \left( 1 - \frac{n}{N} \right) \frac{N}{n \left(N - 1 \right)} \left[ N \left( SSB \right) - \sum^H_{h = 1}{\left( N - N_h \right)s^2_h} \right] \\ & = \text{Var} \left( \hat{T}_{\text{str}} \right) + \left( 1 - \frac{n}{N} \right) \frac{N^2}{n \left(N - 1 \right)} \left[ SSB - \sum^H_{h = 1}{\left( 1 - \frac{N_h}{N} \right)s^2_h} \right] \\ \end{array} \]

This result shows that \text{Var} \left( \hat{T}_{\text{str}} \right) < \text{Var} \left( \hat{T}_{\text{srs}} \right) while SSB - \sum^H_{h = 1}{\left( 1 - \frac{N_h}{N} \right)s^2_h}  data-recalc-dims= 0" title="Rendered by" height="33" width="235" style="vertical-align: -12px;"/> and as SSB increases the stratified estimator improves on the SRS estimator.

Unsupervised Learning

Unsupervised learning attempts to uncover hidden structure in the observed data by sorting the observations into a chosen number of clusters. The simplest algorithm to do this is k-means. The k-means algorithm is as follows:

  1. Choose K (number of clusters)
  2. Choose K random points and assign as centers c_k
  3. Compute the distance between each point and each center
  4. Assign each observation to the center they are closest to
  5. Compute the new centers given the cluster allocation c_k = \frac{1}{n_k} \sum^{n_k}_{i \in S_k}{x_i} where S_k contains the points allocated to cluster k
  6. Compute the between and within sum of squares
  7. Repeat 3-6 until the clusters do not change, meet a specified tolerance or max iterations met

The algorithm will minimise the within sum of squares and maximise the between sum of squares.

    \[ \begin{array}{l l} SSW & = \sum^K_{k = 1} \sum^{n_k}_{i \in S_k} (x_i - c_k)^2 \\ SSB & = \sum^K_{k = 1} n_k(c_k - \bar{c})^2 \\ SSTO & = SSW + SSB \end{array} \]

As we saw from the formula above the estimator under a stratified sample performs better than an SRS when

    \[ SSB  data-recalc-dims= \sum^N_{h=1} \left( 1- \frac{N_h}{N} \right) S^2_h \]" title="Rendered by"/>

From here it’s easy to see that if we construct a stratification variable which aims to minimise SSW and maximise SSB, the estimator for the corresponding sample will also perform better than by using a less efficient variable. There may be practical reasons why this isn’t possible and it makes more sense to use a natural stratification variable however there are many examples where using unsupervised learning to construct a stratification variable can improve the estimator or the training set to be used for modelling. This isn’t isolated to k-means, most clustering algorithms aim to do the same thing in different ways, and each has it’s benefits given the structure of the data. This can be expanded to more sophisticated sampling techniques and not confined to simple stratified samples.

The post Improve Your Training Set with Unsupervised Learning appeared first on Daniel Oehm | Gradient Descending.

To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)