Site icon R-bloggers

A guide to Data Transformation

[This article was first published on Stories by Tim M. Schendzielorz on Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learn when and how to transform your variables for better insights.

Photo by Arseny Togulev on Unsplash

What is Data Transformation?

Data Transformation in a statistics context means the application of a mathematical expression to each point in the data. In contrast, in a Data Engineering context Transformation can also mean transforming data from one format to another in the Extract Transform Load (ETL) process.

Why should I transform my data?

Important distinctions

It is important to know what we are talking about when we use the term transformation. Transformation, normalization and standardization are often used interchangeably and wrongly so.

Standardization transforms the data to follow a Standard Normal Distribution (left graph). Normalization and Standardization can be seen as special cases of Transformation. To demonstrate the difference between a standard normal distribution and a standard distribution we simulate data and graph it:

R Code for the Plotly graphs above. Interactive Plotly graphs are embedded via plot.ly hosting, but could also be embedded from Github Pages via an iframe, see this article for instructions.

How to transform data?

To get insights, data is most often transformed to follow close to a normal distribution either to meet statistical assumptions or to detect linear relationships between other variables. One of the first steps for those techniques is to check how close the variables already follow a normal distribution.

How to check if your data follows a normal distribution?

It is common to inspect your data visually and/or check the assumption of normality with a statistical test.

Variable distribution histogram and corresponding QQ-plot with reference line of a perfect normal distribution. From UCD

To visually explore the distribution of your data, we will look at the density plot as well as a simple QQ-plot. The QQ-plot is an excellent tool for inspecting various properties of your data distribution and asses if and how you need to transform your data. Here the quantiles of a perfect normal distribution are plotted against the quantiles of your data. Quantiles measure at which data point a certain percentage of the data is included. For example, the data point of the 0.2 quantile is the point where 20% of the data is below and 80% is above. A reference line is drawn which indicates how the plot would look if your variable would follow a perfect normal distribution. The closer your points in the QQ-plot are to this line, the more likely it is that your data follows a normal distribution and does not need additional transformation.

For a statistical analysis of normality of your data, commonly used tests are the Shapiro-Wilk-Test or the Kolmogorov-Smirnov-Test. The SW Test has generally a higher detection power, the non-parametric KS Test should be used with a high number of observations. Generally speaking, those tests calculate how likely it is that your data distribution is similar to a normal distribution (Technically, how likely it is that you do not err with H0- the hypothesis that the data is normally distributed). These tests however have the well known problems of Frequentist Null Hypothesis Testing, which is not in the scope of this article to discuss, i.e. the problem of being too sensitive with a huge amount of observations. The KS test is generally too sensitive to points in the middle of the data distribution in comparison to the more important tails. Additionally, those tests can not tell you how problematic a non-normality would be for getting insights from your data. Because of this, I would advise to use an exploratory, visual approach to check your data distribution and forego any statistical testing if you do not need this for an automated script.

Data distributions and their corresponding QQ-plots

The following diagrams show simulated data with the density distribution and the corresponding QQ-plot. Four strong and typical deviations from a normal distribution are shown. Only for the normally distributed data an additional statistical test for normality is shown in the code snippet for completeness.

For playing around with distributions and their corresponding QQ-plots I can recommend this nice little R shiny app from Cross Validated user Zhanxiong.

Which transformation to pick?

If you decide that your data should follow a normal distribution and needs transformation, there are simple and highly utilized power transformations we will have a look at. They transform your data to follow a normal distribution more closely. It is however important to note, that when transforming data you will lose information about the data generation process and you will lose interpretability of the values, too. You might consider to back-transform the variable at a certain step in your analysis. Generally speaking, the expression for transformation which matches data generation is suited best. Logarithm should be used if data generation effects were multiplicative and the data follows order of magnitudes. Roots should be used if the data generation involved squared effects.

Simple Transformations

For transformation multiply every data point with one of the following expression. The expressions are sorted from weakest effect to strongest. If your transformation of choice is too strong, you will end up with data skewed in the other direction.

Right (positive) skewed data:

Left (negative) skewed data

Light & heavy tailed data

Automatic Transformations

There are various implementations of automatic transformations in R that choose the optimal transformation expression for you. They determine a lambda value which is the power coefficient used to transform your data closest to a normal distribution.

Tukey’s Ladder of Powers lamda values and corresponding power transforms. Lambda values can be decimal. Source

Concluding remarks

This guide provides an overview over an important data preprocessing technique, data transformation. It demonstrates why you want to transform your data during analysis. It explains how you can detect if your data needs transformation to meet the most common requirement to data distribution of normality and transform it accordingly. It shows which mathematical expression to use for transformation for stereotypical cases of non-normality and how to automate this. There are a few advanced cases for transformation, e.g. for multimodal distributions which is not covered here.

A word of caution must be given, however. There are no definitive rules when and how to transform your data. It depends on how the data was generated (and how much you do know about this), what insights you want to generate from it, how important interpretability is and how much the data distribution deviates from your desired distribution, which will be in the majority of cases a normal distribution. Hence, some closing advice for data transformation:

This article was also published on http://www.r-bloggers.com/.


A guide to Data Transformation was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Stories by Tim M. Schendzielorz on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.