Learn when and how to transform your variables for better insights.
What is Data Transformation?
Data Transformation in a statistics context means the application of a mathematical expression to each point in the data. In contrast, in a Data Engineering context Transformation can also mean transforming data from one format to another in the Extract Transform Load (ETL) process.
Why should I transform my data?
- Improve interpretability. Some variables are not in the format we need for a certain question, e.g. car manufactures supply miles/gallon values for fuel consumption, however for comparing car models we are more interested in the reciprocal gallons/mile.
- De-clutter graphs. If you visualize two or more variables that are not evenly distributed across the parameters, you end up with data points close by. For a better visualization it might be a good idea to transform the data so it is more evenly distributed across the graph. Another approach could be to use a different scale on your graph axis.
- To get insight about the relationship between variables. The relationship between variables is often not linear but of a different type. Common example is taking the log of income to compare it to another variable as the utility of more income diminishes with higher income. (See this excellent discussion about the highly utilized log- transform on Cross Validated.) Another example is the polynomial growth of money on an bank account with interest rate compared to time. To calculate a simple correlation coefficient between variables, the variables need to show a linear relationship. To meet this criteria, you might be able to transform one or both variables.
- To meet assumptions for statistical inference. When constructing simple confidence intervals, the assumption is that the data is normally distributed and not skewed left or right. For linear regression analysis an important assumption is homoscedasticity, meaning that the error variance of your dependent outcome variable is independent from your predictor variables. An assumption for many statistical test as the T-test is that the errors of a model ( the values of a measurement sampled from a population) are normally distributed.
It is important to know what we are talking about when we use the term transformation. Transformation, normalization and standardization are often used interchangeably and wrongly so.
- Normalization is the process of scaling in respect to the entire data range so that the data has a range from 0 to 1.
- Standardization is the process of transforming in respect to the entire data range so that the data has a mean of 0 and a standard deviation of 1. It’s distribution is now a Standard Normal Distribution.
- Transformation is the application of the same calculation to every point of the data separately.
Standardization transforms the data to follow a Standard Normal Distribution (left graph). Normalization and Standardization can be seen as special cases of Transformation. To demonstrate the difference between a standard normal distribution and a standard distribution we simulate data and graph it:
How to transform data?
To get insights, data is most often transformed to follow close to a normal distribution either to meet statistical assumptions or to detect linear relationships between other variables. One of the first steps for those techniques is to check how close the variables already follow a normal distribution.
How to check if your data follows a normal distribution?
It is common to inspect your data visually and/or check the assumption of normality with a statistical test.
To visually explore the distribution of your data, we will look at the density plot as well as a simple QQ-plot. The QQ-plot is an excellent tool for inspecting various properties of your data distribution and asses if and how you need to transform your data. Here the quantiles of a perfect normal distribution are plotted against the quantiles of your data. Quantiles measure at which data point a certain percentage of the data is included. For example, the data point of the 0.2 quantile is the point where 20% of the data is below and 80% is above. A reference line is drawn which indicates how the plot would look if your variable would follow a perfect normal distribution. The closer your points in the QQ-plot are to this line, the more likely it is that your data follows a normal distribution and does not need additional transformation.
For a statistical analysis of normality of your data, commonly used tests are the Shapiro-Wilk-Test or the Kolmogorov-Smirnov-Test. The SW Test has generally a higher detection power, the non-parametric KS Test should be used with a high number of observations. Generally speaking, those tests calculate how likely it is that your data distribution is similar to a normal distribution (Technically, how likely it is that you do not err with H0- the hypothesis that the data is normally distributed). These tests however have the well known problems of Frequentist Null Hypothesis Testing, which is not in the scope of this article to discuss, i.e. the problem of being too sensitive with a huge amount of observations. The KS test is generally too sensitive to points in the middle of the data distribution in comparison to the more important tails. Additionally, those tests can not tell you how problematic a non-normality would be for getting insights from your data. Because of this, I would advise to use an exploratory, visual approach to check your data distribution and forego any statistical testing if you do not need this for an automated script.
Data distributions and their corresponding QQ-plots
The following diagrams show simulated data with the density distribution and the corresponding QQ-plot. Four strong and typical deviations from a normal distribution are shown. Only for the normally distributed data an additional statistical test for normality is shown in the code snippet for completeness.
Which transformation to pick?
If you decide that your data should follow a normal distribution and needs transformation, there are simple and highly utilized power transformations we will have a look at. They transform your data to follow a normal distribution more closely. It is however important to note, that when transforming data you will lose information about the data generation process and you will lose interpretability of the values, too. You might consider to back-transform the variable at a certain step in your analysis. Generally speaking, the expression for transformation which matches data generation is suited best. Logarithm should be used if data generation effects were multiplicative and the data follows order of magnitudes. Roots should be used if the data generation involved squared effects.
For transformation multiply every data point with one of the following expression. The expressions are sorted from weakest effect to strongest. If your transformation of choice is too strong, you will end up with data skewed in the other direction.
Right (positive) skewed data:
- Root ⁿ√x. Weakest transformation, stronger with higher order root. For negative numbers special care needs to be taken with the sign while transforming negative numbers:
- Logarithm log(x). Commonly used transformation, the strength of this transformation can be somewhat altered by the root of the logarithm. It can not be used on negative numbers or 0, here you need to shift the entire data by adding at least |min(x)|+1.
- Reciprocal 1/x. Strongest transformation, the transformation is stronger with higher exponents, e.g. 1/x³. This transformation should not be done with negative numbers and numbers close to zero, hence the data should be shifted similar as the log transform.
Left (negative) skewed data
- Reflect Data and use the appropriate transformation for right skew. Reflect every data point by subtracting it from the maximum value. Add 1 to every data point to avoid having one or multiple 0 in your data.
- Square x². Stronger with higher power. Can not be used with negative values.
- Exponential eˣ. Strongest transformation and can be used with negative values. Stronger with higher base.
Light & heavy tailed data
- Subtract the data points from the median and transform. Deviations of the tail from normality are usually less critical than skewness and might not need transformation after all. The subtraction from the median sets your data to a median of 0. After that use an appropriate transformation for skewed data on the absolute deviations from 0 on either side. For heavy-tailed data use transformations for right skew to pull in on the median and for light-tailed data use transformations for left skew to push data away from the median.
There are various implementations of automatic transformations in R that choose the optimal transformation expression for you. They determine a lambda value which is the power coefficient used to transform your data closest to a normal distribution.
- Use Lambert W x Gaussian transform. The R package LambertW has an implementation for automatically transforming heavy or light tailed data with Gaussianize().
- Tukey’s Ladder of Powers. For skewed data, the implementation transformTukey()from the R package rcompanion uses Shapiro-Wilk tests iteratively to find at which lambda value the data is closest to normality and transforms it. Left skewed data should be reflected to right skew and there should be no negative values.
- Box-Cox Transformation. The implementation BoxCox.lambda()from the R package forecast finds iteratively a lambda value which maximizes the log-likelihood of a linear model. However it can be used on a single variable with model formula x~1. The transformation with the resulting lambda value can be done via the forecast function BoxCox(). There is also an implementation in the R package MASS. Standard Box-Cox can not be used with negative values, two-parameter Box-Cox however can.
- Yeo-Johnson Transformation. This can be seen as an useful extension to the Box-Cox. It is the same as Box-Cox for non-negative values and handles negative and 0 values as well. There are various implementations in R via packages car, VGAM and recipes in the meta machine-learning framework tidymodels.
This guide provides an overview over an important data preprocessing technique, data transformation. It demonstrates why you want to transform your data during analysis. It explains how you can detect if your data needs transformation to meet the most common requirement to data distribution of normality and transform it accordingly. It shows which mathematical expression to use for transformation for stereotypical cases of non-normality and how to automate this. There are a few advanced cases for transformation, e.g. for multimodal distributions which is not covered here.
A word of caution must be given, however. There are no definitive rules when and how to transform your data. It depends on how the data was generated (and how much you do know about this), what insights you want to generate from it, how important interpretability is and how much the data distribution deviates from your desired distribution, which will be in the majority of cases a normal distribution. Hence, some closing advice for data transformation:
- Decide if the insights you will get from transforming are worth the downsides. E.g. decide if being able to do statistical modelling, applying a geometric technique such as k-means clustering, being able to better compare ratios or just de-clutter your graphs is worth losing direct interpretability.
- Decide if an alternative approach instead satisfies your analysis. For example, you can use non parametric models or weighted-least-square regression instead of standard linear regression if your data does not meet normality assumptions. Alternatively you could remove outliers, however you should remember that you need a quite good reason to delete measurements.
- Before and after transformation, check your distribution with a QQ-plot, even with an automatic transformation approach.
- Do not overwrite your original values with you transformed values in your data set.
This article was also published on http://www.r-bloggers.com/.