Adjusting variable distribution and exploring data using mass linear regression

[This article was first published on Ronan's #TidyTuesday blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In this post, the BEA Infrastructure Investment data set from the #TidyTuesday project is used to illustrate variable transformation and the exploreR::masslm() function. The variable for gross infrastructure investment adjusted for inflation is transformed to make it less skewed. Using these transformed investment values, multiple linear models are then created to quickly see which variables in the data set have the largest impact on infrastructure investment.

Setup

Loading the R libraries and data set.

# Loading libraries
library(tidyverse)
library(tidytuesdayR)
library(exploreR)

# Loading data
tt <- tt_load("2021-08-10")

    Downloading file 1 of 3: `ipd.csv`
    Downloading file 2 of 3: `chain_investment.csv`
    Downloading file 3 of 3: `investment.csv`

Plotting distribution of inflation-adjusted infrastructure investments

In this section, the gross infrastructure investment (chained 2021 dollars) in millions of USD are plotted with and without a \(\log{10}\) transformation. From the histograms below, we can see that applying a \(\log{10}\) transformation gives the variable a less skewed distribution. This transformation should be considered for statistical testing of inflation-adjusted infrastructure investments.

# Creating tbl_df with gross_inv_chain values
untransformed_tbl_df <- tibble(
  gross_inv_chain = tt$chain_investment$gross_inv_chain,
  transformation = "Untransformed"
  )

# Creating tbl_df with log10(gross_inv_chain) values
log10_tbl_df <- tibble(
  gross_inv_chain = log10(tt$chain_investment$gross_inv_chain),
  transformation = "Log10"
)

# Combining the above tibbles into one tbl_df
gross_inv_chain_tbl_df <- rbind(untransformed_tbl_df, log10_tbl_df)

# Plotting distribution of inflation-adjusted infrastructure investments
gross_inv_chain_tbl_df %>%
  ggplot(aes(x = gross_inv_chain, fill = transformation)) +
  geom_histogram(show.legend = FALSE, position = "identity",
                 bins = 12, colour = "black") +
  facet_wrap(~transformation, scales = "free") +
  labs(fill.position = "none", y = NULL,
       x = "Gross infrastructure investments adjusted for inflation (millions USD)",
       title = "Distributions of untransformed and log transformed infrastructure investments",
       subtitle = "Log transformed investments are more normally distributed") +
  scale_fill_brewer(palette = "Set1") +
  theme_classic()
The transformed variable is more appropriate for parametric statistical tests.

(#fig:figure_1)The transformed variable is more appropriate for parametric statistical tests.

Exploring a data set using mass linear regression

In this section, exploreR::masslm() is applied to a copy of the data set with \(\log{10}\) transformed investment values. The masslm() function from the exploreR package quickly produces a linear model of the dependent variable and every other variable in the data set. It then returns a data frame containing the features of each linear model that are useful when selecting predictor variables:

  • R squared The proportion of variation in the dependent (response) variable that is explained by the independent (predictor) variable.
  • p-value The statistical significance of the model. A p-value \(\lt 5\%\) is typically considered significant.

This function is useful for quickly determining which variables should be included in predictive models. Note that the data set used should satisfy the assumptions of linear models, including a normally distributed response variable. In this case, the \(\log{10}\) transformed investment variable is close to normal.

From this mass linear regression model, we can see that investment category is the single variable that explains the largest proportion of variation in \(\log{10}\) investment; and the linear model with group number is the most significant, followed by year.

# Creating a copy of the chain_investment data set with log10 transformed
# gross investment values
chain_investment_df <- tt$chain_investment %>%
  # Creating a log10 transformed copy of gross_inv_chain
  mutate(gross_inv_transformed = log10(gross_inv_chain)) %>%
  # Removing -Inf values
  filter(gross_inv_transformed != -Inf) %>%
  # Selecting variables to include in the data frame
  select(category, meta_cat, group_num, year, gross_inv_transformed)

# Applying mass linear regression
transformed_investment_masslm <- masslm(chain_investment_df,
                                        dv.var = "gross_inv_transformed")

# Printing the masslm results in order of R squared values (decreasing)
transformed_investment_masslm %>%
  arrange(-R.squared)
         IV Coefficient    P.value  R.squared
1  category   -0.579900  8.471e-10 0.63754622
2  meta_cat    0.349300  7.848e-10 0.37782201
3 group_num   -0.058750 3.625e-204 0.14695670
4      year    0.009507  7.007e-59 0.04377399
# Printing the masslm results in order of p-values
transformed_investment_masslm %>%
  arrange(P.value)
         IV Coefficient    P.value  R.squared
1 group_num   -0.058750 3.625e-204 0.14695670
2      year    0.009507  7.007e-59 0.04377399
3  meta_cat    0.349300  7.848e-10 0.37782201
4  category   -0.579900  8.471e-10 0.63754622

References

To leave a comment for the author, please follow the link and comment on their blog: Ronan's #TidyTuesday blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)