Feature standardization considered harmful

[This article was first published on R – David's blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described.

The same advice is frequently given for K-means clustering, but there’s a great counter-example given in The Elements of Statistical Learning that I try to reproduce here.

Consider two point clouds ($n=100$ each), randomly drawn around two origins 3 units away from the origin:

set.seed(495)
n <- 100
d <- 3
x <- matrix(rnorm(n * 2, sd = 1), ncol = 2)
x[1:(n/2), 1] <- x[1:(n/2), 1] - d
x[(n/2 + 1):n, 1] <- x[(n/2 + 1):n, 1] + d

The K-means algorithm has no problem in classifying these points:

km <- kmeans(x, centers = 2)
km$centers


##        [,1]         [,2]
## 1  2.922143  0.098422541
## 2 -2.991026 -0.003131757

Let’s see now what happens when we standardize each feature. Since their mean is already zero, we merely divide by their standard deviation:

x_scaled <- x
x_scaled[, 1] <- x_scaled[, 1] / sd(x_scaled[, 1])
x_scaled[, 2] <- x_scaled[, 2] / sd(x_scaled[, 2])

And we run again the K-means algorithm on these new data:

km_scaled <- kmeans(x_scaled, centers = 2)

We see that K-means has completely failed to identify the clusters, because ‘standardizing’ the features has destroyed the clear separation between the clusters.

So what’s the lesson here? Clearly, for K-means you should not blindly standardize the features unless there are clear reasons to do so. In this toy example, we didn’t know what the features represent, so it’s impossible to say whether standardizing the features was the right thing to do. Perhaps the clusters seen pre-standardization were mere artefacts of our choice of units! As a rule of thumb, I suggest that features that are expressed in the same units and that represent the same ‘stuff’ (such as width and length) should not be standardized. If you have deeper insights into this I’d love to hear your comments.

The post Feature standardization considered harmful appeared first on David's blog.

To leave a comment for the author, please follow the link and comment on their blog: R – David's blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)