Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Problem
You have two predictors in your model. One seems to have a stronger coefficient than the other. But is it significant?
Example: when predicting a worker’s salary, is the standardized coefficient of number of extra hours (xtra_hours
) really larger than that of number of compliments given the to boss n_comps
?
library(parameters) library(effectsize) data("hardlyworking", package = "effectsize") hardlyworkingZ <- standardize(hardlyworking) m <- lm(salary ~ xtra_hours + n_comps, data = hardlyworkingZ) model_parameters(m) #> Parameter | Coefficient | SE | 95% CI | t(497) | p #> --------------------------------------------------------------------- #> (Intercept) | -7.19e-17 | 0.02 | [-0.03, 0.03] | -4.14e-15 | > .999 #> xtra_hours | 0.81 | 0.02 | [ 0.78, 0.84] | 46.60 | < .001 #> n_comps | 0.41 | 0.02 | [ 0.37, 0.44] | 23.51 | < .001
Here are 4 6 methods to test coefficient equality in R
.
Notes
– If we were interested in the unstandardized coefficient, we would not need to first standardize the data.
– Note that if one parameter was positive, and the other was negative, one of the terms would need to be first reversed (-X) to make this work.
Method 1: As Model Comparisons
Based on this awesome tweet.
Since:
\(\hat{Y} = a \times \text{xtra_hours} + a \times \text{n_comps} \\ = a \times (\text{xtra_hours} + \text{n_comps})\)
We can essentially force a constraint on the coefficient to be equal by using a composite variable.
m0 <- lm(salary ~ I(xtra_hours + n_comps), data = hardlyworkingZ) model_parameters(m0) #> Parameter | Coefficient | SE | 95% CI | t(498) | p #> ------------------------------------------------------------------------------ #> (Intercept) | -2.05e-17 | 0.02 | [-0.04, 0.04] | -9.57e-16 | > .999 #> xtra_hours + n_comps | 0.61 | 0.01 | [ 0.58, 0.64] | 41.09 | < .001
We can then compare how this model compares to our first model without this constraint:
anova(m0, m) #> Analysis of Variance Table #> #> Model 1: salary ~ I(xtra_hours + n_comps) #> Model 2: salary ~ xtra_hours + n_comps #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 498 113.67 #> 2 497 74.95 1 38.716 256.73 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can conclude that the unconstrained model is significantly better than the constrained model – meaning that \(\beta_{\text{xtra_hours}} > \beta_{\text{n_comps}}\).
Method 1b: Composite Variable + Difference
Edit (Feb-17, 2021): Thanks to @joejps84 for pointing this out!
We can achieve the same thing in a single model. If we say that the slope of n_comps
as equal to the slope of xtra_hours
+ some change:
\(\hat{Y} = a \times \text{xtra_hours} + (a + b) \times \text{n_comps} \\ = a \times \text{xtra_hours} + a \times \text{n_comps} + b \times \text{n_comps} \\ = a \times (\text{xtra_hours} + \text{n_comps}) + b \times \text{n_comps}\)
So the slope of n_comps
in such a model is the difference between the slope of n_comps
and xtra_hours
:
m1 <- lm(salary ~ I(xtra_hours + n_comps) + n_comps, data = hardlyworkingZ) model_parameters(m1) #> Parameter | Coefficient | SE | 95% CI | t(497) | p #> ------------------------------------------------------------------------------- #> (Intercept) | -2.99e-17 | 0.02 | [-0.03, 0.03] | -1.72e-15 | > .999 #> xtra_hours + n_comps | 0.81 | 0.02 | [ 0.78, 0.84] | 46.60 | < .001 #> n_comps | -0.40 | 0.03 | [-0.45, -0.35] | -16.02 | < .001
This give the exact same result (\(t^2 = (-16)^2 = 256\) is the same as the F-value from the anova()
test)!
Method 2: Paternoster et al (1998)
According to Paternoster et al. (1998), we can compute a t-test to compare the coefficients:
b <- coef(m) V <- vcov(m) tibble::tibble( diff_estim = b[2] - b[3], diff_SE = sqrt(V[2, 2] + V[3, 3] - 2 * V[2, 3]), t_stat = diff_estim / diff_SE, df = df.residual(m), p_value = 2 * pt(abs(t_stat), df = df, lower.tail = FALSE) ) #> # A tibble: 1 x 5 #> diff_estim diff_SE t_stat df p_value #> <dbl> <dbl> <dbl> <int> <dbl> #> 1 0.402 0.0251 16.0 497 6.96e-47
This gives the exact same results as the first method!
Edit (Feb-17, 2021): As @bmwiernik pointed out, this can also be done with some fancy matrix multiplication:
contr <- c(0, 1, -1) tibble::tibble( diff_estim = b %*% contr, diff_SE = sqrt(contr %*% V %*% contr), t_stat = diff_estim / diff_SE, df = df.residual(m), p_value = 2 * pt(abs(t_stat), df = df, lower.tail = FALSE) ) #> # A tibble: 1 x 5 #> diff_estim[,1] diff_SE[,1] t_stat[,1] df p_value[,1] #> <dbl> <dbl> <dbl> <int> <dbl> #> 1 0.402 0.0251 16.0 497 6.96e-47
All of the following solutions are essentially this method, wrapped in some nice functions.
Method 3: emmeans
<3
We can estimate the slopes from the model using emmeans
, and then, with some trickery, compare them!
library(emmeans) trends <- rbind( emtrends(m, ~1, "xtra_hours"), emtrends(m, ~1, "n_comps") ) # clean up so it does not error later trends@grid$`1` <- c("xtra_hours", "n_comps") trends@misc$estName <- "trend" trends #> 1 trend SE df lower.CL upper.CL #> xtra_hours 0.811 0.0174 497 0.772 0.850 #> n_comps 0.409 0.0174 497 0.370 0.448 #> #> Confidence level used: 0.95 #> Conf-level adjustment: bonferroni method for 2 estimates
Compare them:
pairs(trends) #> contrast estimate SE df t.ratio p.value #> xtra_hours - n_comps 0.402 0.0251 497 16.023 <.0001
Once again – exact same results!
Method 4: lavaan
The lavaan
package for latent variable analysis and structural equation modeling allows for parameter constraining. So let’s do just that:
library(lavaan) L <- sem("salary ~ xtra_hours + n_comps", data = hardlyworkingZ) L0 <- sem("salary ~ a * xtra_hours + a * n_comps", data = hardlyworkingZ) anova(L0, L) #> Chi-Squared Difference Test #> #> Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq) #> L 0 476.04 488.69 0.00 #> L0 1 682.26 690.69 208.22 208.22 1 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This too yields similar results! (Only slightly different due to different estimation methods.)
We can also directly estimate the difference using lavaan
’s “defined parameters” – defines with the :=
operator:
L <- sem("salary ~ b1 * xtra_hours + b2 * n_comps diff := b1 - b2", data = hardlyworkingZ) parameterEstimates(L, output = "text")[7,] #> #> Defined Parameters: #> Estimate Std.Err z-value P(>|z|) ci.lower ci.upper #> diff 0.402 0.025 16.071 0.000 0.353 0.451
Which, again, gives the same results.
Method 5: car
Edit (Feb-17, 2021): Thanks to @DouglasKGAraujo for his suggestion!
Even more methods!
library(car) linearHypothesis(m, c("xtra_hours - n_comps")) #> Linear hypothesis test #> #> Hypothesis: #> xtra_hours - n_comps = 0 #> #> Model 1: restricted model #> Model 2: salary ~ xtra_hours + n_comps #> #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 498 113.67 #> 2 497 74.95 1 38.716 256.73 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which once again gives the same result!
Method 6: multcomp
Edit (Feb-17, 2021): Thanks to Stefan Th. Gries for his suggestion!
Even more more methods!
library(multcomp) summary(glht(m, matrix(c(0, 1, -1), nrow = 1))) #> #> Simultaneous Tests for General Linear Hypotheses #> #> Fit: lm(formula = salary ~ xtra_hours + n_comps, data = hardlyworkingZ) #> #> Linear Hypotheses: #> Estimate Std. Error t value Pr(>|t|) #> 1 == 0 0.40168 0.02507 16.02 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> (Adjusted p values reported -- single-step method)
Which once again again gives the same result!
Summary
That’s it really – these 4 6 simple methods have wide applications to GL(M)M’s, SEM, and more.
Enjoy!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.