Log transform or log link? And confounding variables. by @ellis2013nz
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I wrote about the relationship between weight and height in US adults, as seen in the US Centers for Disease Control and prevention (CDC) Behavioral Risk Factor Surveillance System, an annual telephone survey of around 400,000 interviews per year. In particular, I tested the widely-circulated claim that Body Mass Index (BMI) exaggerates the “fatness” of tall people because empirically (it is claimed), weight is proportional to height to the power of 2.5, not to 2.0 as implied by the formula for calculating BMI. I showed that in 2018 in the BRFSS at least, weight is actually proportional to height to the power of around 1.6 (after adjusting for race and sex). So if anything, the BMI calculation exaggerates the weight of short people, not tall people, on the logic of the original critique.
I have a couple of follow-up points today:
- whether it is best to to logarithm-transform a response variable in a particular model, or model it on its original scale and use a logarithm link function in a generalized linear model
- whether socioeconomic class is a confounding variable (richer people are taller, and also healthier in general, so perhaps the relatively low weights of tall people measured by the BMI are caused by the better health of rich people, not by a problem in the measurement).
Let’s deal with those one at a time.
Log transform or log link?
Our height and weight data looks a bit like this (limiting ourselves to a random sub-sample of the full survey to better illustrate a point without over-plotting):
There is a very slight curvature (not really visible) in the relationship between height and weight, but more importantly (if one is considering estimating a model by ordinary least squares) there is a clear “fanning out” effect – as the expected value of weight increases, so does its variance. In intro stats and econometrics courses, one is trained to recognise this effect as heteroskedasticity, which violates one of the assumptions needed for ordinary least squares to be an optimal fitting algorithm and certain inferences from its results to be valid.
The workaround usually taught in those same courses is to take logarithmic transforms of both variables. With many metrics common in the social sciences, this will turn a skewed distribution into one that at least resembles a Gaussian (‘Normal’) one. At the same time this conveniently turns your underlying model into one where the linear coefficients are elasticities, which have a nice interpretation (as exploited by me last week in examining how closely BMI matches real weight-height relationship):
Here’s the code for those two charts (assumes you have run last week’s code first), and doing some setup for later:
In fact, this log transformation approach, in my opinion, is fine. But an alternative approach, which some argue we should always follow, is to model the response variable in its original metric (kg, not log kg), explicitly dealing with the increasing variance (ie dropping the “constant variance” assumption behind the usual rationale of ordinary least squares). We can do this while keeping the functional relationship of linearity in the logarithms of the two variables (and hence the elasticity interpretation of coefficients) by moving from a model of:
E(log(y)) = Xb
(which is the “log transform” approach), to:
log(E(y)) = Xb
(which is the “log link function” approach, as used in a Generalized Linear Model).
Where X is a matrix of explanatory variables that includes (in this case) the logarithm of height.
In both those formulae, E() represents the “Expected value”. Another way of thinking about this is that in the first version, the random residuals have a multiplicative relationship to y, whereas in the second they have an additive relationship. That may or may not help.
I’m not sure where I come down on this, although I like the elegance of always modelling your variable in its own units (ie the second approach, with a link function). In my blog post last week I used the transformation approach, and it worried me enough that I wanted to re-specify and fit my model as a generalized linear model with a a logarithm link function and variance proportional to the expected value of the response.
Here’s how I refit these two versions of a model of weight on height (plus sex, race and age):
Which gets us these results for the point estimates of the various coefficients in the linear predictor:
|Variable||Log - Log Gaussian||Quasi GLM with log link and mu variance|
|raceAmerican Indian or Alaskan Native||0.0410786||0.0433961|
|raceBlack or African American||0.0512117||0.0514042|
|raceHispanic, Latino/a, or Spanish origin||0.0128204||0.0122439|
|ageAge 65 or older||-0.0108261||-0.0148626|
There are some small differences, but not material (at least for my purposes). The coefficients in the “Log-Log Gaussian” column differ slightly from those reported last week because I am now explicitly coding some people as “missing” data on particular variables and keeping them in the data (with an extra coefficient for each “missing” effect) rather than dropping them as I did last week.
Note that even my log-log transform model isn’t actually using ordinary least squares, but inverse-probability weighting (this is what
svyglm() does under the hood in both cases) to take into account the complex survey origin of the data. That’s an important issue for working with this data, but not for the general point about obtaining similar results when modelling E(log(y)) to what one gets when modelling log(E(y)).
Socioeconomic class as a confounding variable
Having disposed of that model specification and estimation issue, what of the second point that was worrying me? Here’s what Nicholas Erskine wrote in response to last week’s post:
I smell confounding here. Height is correlated with socioeconomic status. Taller people on average having lower BMI scores could just be telling you that taller people are healthier on average— Nicholas Erskine (@nerskin95) February 23, 2020
Is he right? Well, I’d actually spotted some issues related this and thought about it myself beforehand, and the evidence that I’m not making that up is that my original post included code that prepared variables on household income and level of education for the survey respondents. Analysis of that issue didn’t make the cut for the blog post though. Let’s look at this whole height-weight-socioeconomic status relationship thing now.
I’m going to do this by fitting a couple of models to understand the relationship of my two key socioeconomic variables (household income and level of education) and:
- height (controlling for demographics)
- weight (controlling for demographics and height)
The second of these models will let me see if my estimate of the “elasticity” of height and weight stays at around 1.6, as per my previous blog post, rather than the 2.5 claimed by Trevethen.
Socioeconomic variables relationship to height
Here are the 95% confidence intervals for the ‘impact’ on height of various variables. The scare quotes around ‘impact’ are because I believe my socioeconomic measurements of status today for my respondents are standing in as proxies for a confounder such as “socioeconomic status while growing up”. I don’t believe I there is information available in my dataset on parents’ income and education level.
What we see is that people with higher incomes and education levels are indeed taller, after controlling for age, race and sex. I’ve deliberately fit this model on the original scale with an identity link function for interpretability, so we can say (for example) that when controlling for all other factors, men are nearly 15cm on average taller than women; Americans aged 65+ around 2cm shorter than those age 18-64; etc.
Socioeconomic variables and height’s relationship to weight
So, what happens when we use our best available proxy for socioeconomic status in our regression of weight on height, which I used last week to estimate an elasticity of weight on height of around 1.6? We find that as Nicholas suggested, higher socioeconomic status is associated with lower weight (which in today’s USA, can be seen as a proxy for better health).
But after controlling for all these things, weight is still proportional to height to the power of 1.6, not 2 and definitely not 2.5. So I’m pretty confident to say that what we’re seeing (with lower BMIs of tall people, by the standard calculation) is not a result of confounding socioeconomic status. In other words, last week’s conclusions stand up.
There’s some other interesting interpretive points from both these models about how these various factors interrelate but I’ll not go into them just now.
Here’s the code for fitting those two models and drawing those charts. Features to note include:
- use of Thomas Lumley’s
surveypackage and in particular
svyglm()so we get estimates and standard errors appropriate for the complex survey nature of the data
- use of the “quasi” family and explicitly chosen link and variance functions to fit my various model structure and interpetability needs
- using thick segments and no dot points, to focus the eye on confidence intervals rather than point estimates
- a bit of mucking around to make the different levels of factors (such as age) have visual location and colour in common while still retaining their natural sequencing
- explicit coding of the missing levels of variables with significant observations who otherwise would be dropped out; and use of transparency to avoid these estimates taking too much attention in the charts
- use of a function for some repetitive transformation work to turn model output into something needed for the chart
That’s all for today.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.