Site icon R-bloggers

Transforming Your Data: A Guide to Popular Methods and How to Implement Them with {healthyR.ai}

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< section id="introduction" class="level1">

Introduction

Transforming data refers to the process of changing the scale or distribution of a variable in order to make it more suitable for analysis. There are many different methods for transforming data, and each has its own specific use case.

  1. Box-Cox: This is a method for transforming data that is positively skewed (i.e., has a long tail to the right) into a more normal distribution. It uses a power transformation to adjust the scale of the data.
  2. Basis Spline: This is a type of non-parametric regression that uses splines (piecewise polynomials) to model the relationship between a dependent variable and one or more independent variables.
  3. Log: This is a method for transforming data that is positively skewed (i.e., has a long tail to the right) into a more normal distribution. It uses the logarithm function to adjust the scale of the data.
  4. Logit: This is a method for transforming binary data (i.e., data with only two possible values) into a continuous scale. It uses the logistic function to adjust the scale of the data.
  5. Natural Spline: This is a type of non-parametric regression that uses splines (piecewise polynomials) to model the relationship between a dependent variable and one or more independent variables, where the splines are chosen to be as smooth as possible.
  6. Rectified Linear Unit (ReLU): This is a type of activation function used in artificial neural networks. It is used to introduce non-linearity in the output of a neuron.
  7. Square Root: This is a method for transforming data that is positively skewed (i.e., has a long tail to the right) into a more normal distribution. It uses the square root function to adjust the scale of the data.
  8. Yeo-Johnson: This is a power transformation that works well for data that is positively or negatively skewed. It is a generalization of the Box-Cox transformation and handles zero and negative data.

The R library {healthyR.ai} provides a function called hai_data_transform() that allows users to easily apply any of these transforms to their data. The function takes in the data and the type of transformation as arguments, and returns the transformed data. This makes it easy for users to experiment with different transformations and see which one works best for their data.

< section id="function" class="level1">

Function

Let’s take a look at the full function call.

hai_data_transform(
  .recipe_object = NULL,
  ...,
  .type_of_scale = "log",
  .bc_limits = c(-5, 5),
  .bc_num_unique = 5,
  .bs_deg_free = NULL,
  .bs_degree = 3,
  .log_base = exp(1),
  .log_offset = 0,
  .logit_offset = 0,
  .ns_deg_free = 2,
  .rel_shift = 0,
  .rel_reverse = FALSE,
  .rel_smooth = FALSE,
  .yj_limits = c(-5, 5),
  .yj_num_unique = 5
)

Now let’s go over the arguments to the parameters.

< section id="examples" class="level1">

Examples

Let’s look over some examples. For an example data set we are going to pick on the mtcars data set as the histogram will prove to be skewed which makes it a good candidate to test these transformations on.

install.packages("healthyR.ai")

Now that we have {healthyR.ai} installed we can get to work. It does use the {recipes} package underneath so you will need to have that installed as well. Let’s look at the histogram of mtcars now.

mpg_vec <- mtcars$mpg

hist(mpg_vec)

plot(density(mpg_vec))

First up, Box-Cox

library(healthyR.ai)
library(recipes)

ro <- recipe(mpg ~ wt, data = mtcars)

boxcox_vec <- hai_data_transform(
  .recipe_object = ro,
  mpg,
  .type_of_scale = "boxcox"
)$scale_rec_obj %>%
  get_juiced_data() %>%
  pull(mpg)

plot(density(boxcox_vec))

Basis Spline

bs_vec <- hai_data_transform(
  .recipe_object = ro,
  mpg,
  .type_of_scale = "bs"
)$scale_rec_obj %>%
  get_juiced_data()

plot(density(bs_vec$mpg_bs_1))

plot(density(bs_vec$mpg_bs_2))

plot(density(bs_vec$mpg_bs_3))

Log

log_vec <- hai_data_transform(
  .recipe_object = ro,
  mpg,
  .type_of_scale = "log"
)$scale_rec_obj %>%
  get_juiced_data() %>%
  pull(mpg)

plot(density(log_vec))

Yeo-Johnson

yj_vec <- hai_data_transform(
  .recipe_object = ro,
  mpg,
  .type_of_scale = "yeojohnson"
)$scale_rec_obj %>%
  get_juiced_data() %>%
  pull(mpg)

plot(density(yj_vec))

Voila!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.