Weight loss in the U.S. – An analysis of NHANES data with tidyverse

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Based on a paper published in JAMA last year, the weight gain is increasing among US adults while there is no difference in the percentage of people that were trying to lose weight. The authors used the data from the National Health and Nutrition Examination Survey NHANES from 1988 to 2014 and calculated the proportion of people tried to lose weight during the past 12 months.

Logistic regression analysis was used to estimate the odds of trying to lose weight for different periods from 1988 to 2014. I would like to reproduce these findings with R, given that NHANES data is publically available.

Libraries and Datasets

Load libraries

library(tidyverse)
library(RNHANES)
library(weights)
library(ggsci)
library(ggthemes)

The authors utilized the data from 1988 to 2014, but since the RNHANES package enables the data starting from 1999, I will include data from 1999 to 2014. Therefore, I cannot fully reproduce the results of the paper publish at JAMA.

Load the data using nhanes_load_data function from RNHANES package.

d99 = nhanes_load_data("DEMO", "1999-2000") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX", "1999-2000"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ", "1999-2000"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

d01 = nhanes_load_data("DEMO_B", "2001-2002") %>% 
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX_B", "2001-2002"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ_B", "2001-2002"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

d03 = nhanes_load_data("DEMO_C", "2003-2004") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX_C", "2003-2004"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ_C", "2003-2004"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

d09 = nhanes_load_data("DEMO_F", "2009-2010") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX_F", "2009-2010"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ_F", "2009-2010"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

d11 = nhanes_load_data("DEMO_G", "2011-2012") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX_G", "2011-2012"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ_G", "2011-2012"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

d13 = nhanes_load_data("DEMO_H", "2013-2014") %>%
  select(SEQN, cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC=INDFMIN2, WTINT2YR, WTMEC2YR) %>%
  left_join(nhanes_load_data("BMX_H", "2013-2014"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI) %>%
  left_join(nhanes_load_data("WHQ_H", "2013-2014"), by="SEQN") %>%
  select(SEQN, wave, RIAGENDR, RIDAGEYR, RIDRETH1, RIDEXPRG, INDFMINC, WTINT2YR, WTMEC2YR, BMXBMI, WHQ070)

The list of variables I will use in analysis:

  • SEQN: id
  • RIAGENDR: gender
  • RIDAGEYR: age in years
  • BMXBMI: body mass index
  • INDFMINC: family income
  • RIDRETH1: race/ethnicity
  • RIDEXPRG: pregnancy
  • WHQ070: intention to lose weight
  • WTINT2YR: weighted variable for questionnaires
  • WTMEC2YR: weighted variable for laboratory measurements

Merging the datasets, creating new variables, and excluding pregnant women.

dat = rbind(d99, d01, d03, d09, d11, d13) %>%
  mutate(
    race = recode_factor(RIDRETH1,
                         `1` = "Mexian American",
                         `2` = "Hispanic",
                         `3` = "Non-Hispanic, White",
                         `4` = "Non-Hispanic, Black",
                         `5` = "Others"),
    bmi = if_else(BMXBMI >= 30, "Obese", "Overweight"),
    pregnancy = if_else(RIDEXPRG == 1, "Yes", "No", "No"),
    tryweloss = if_else(WHQ070 == 1, "Yes", "No")
    ) %>%
  filter(BMXBMI >= 25, RIDAGEYR >=20, RIDAGEYR < 60, race != "Others", pregnancy == "No", WHQ070 %in% c(1,2))

I excluded pregnant women from the analysis as they are overweight or obese due to pregnancy and might have no intention to lose weight during pregnancy. The R code below calculates the weighted proportion of overweight and obesity as well as intention to lose weight for the total population.

with(dat, wpct(bmi, weight =  WTMEC2YR))
##      Obese Overweight 
##  0.4997831  0.5002169

with(dat, wpct(tryweloss, weight =  WTINT2YR))
##        No       Yes 
## 0.5235062 0.4764938

The proportion of obese and overweight is 50%, as I excluded the normal weight individuals before. The intention to lose weight in the total population is 48%. Now, I will evaluate the “intention to lose weight” among overweight and obese people in different periods. (I will repeat the same code for each period studied. If you know how to short the code below please let me know. I don't know how to weight the proportion with the tidyverse package)

with(dat %>% filter(wave == "1999-2000", bmi == "Obese"),
     wpct(tryweloss, weight =  WTINT2YR))
##        No       Yes 
## 0.4555818 0.5444182

with(dat %>% filter(wave == "1999-2000", bmi == "Overweight"),
     wpct(tryweloss, weight =  WTINT2YR))
##        No       Yes 
## 0.6000436 0.3999564

There are 54% obese and 40% overweight people who are trying to lose weight in 1999-2000. Using the above, I calculated the proportions for other periods and made a table. I will focus only on those who lost weight to see how the trend goes over years in overweight and obese.

trend <- read.table(header=TRUE, text='
year bmi weightloss
1999-2000 Obese 54.44182
1999-2000 Overweight 39.99564
2001-2002 Obese 54.14932
2001-2002 Overweight 38.60543
2003-2004 Obese 53.6699
2003-2004 Overweight 42.97784
2009-2010 Obese 53.62087
2009-2010 Overweight 35.6233
2011-2012 Obese 57.5303
2011-2012 Overweight 41.99084
2013-2014 Obese 58.32048
2013-2014 Overweight 40.19653
')

trend %>% 
  ggplot(aes(year, weightloss, fill = bmi, group = bmi)) +
  geom_line(color = "black", size = 0.3) +
  geom_point(colour="black", pch=21, size = 3) +
  theme(text = element_text(family = "serif", size = 11), legend.position="bottom") +
  scale_fill_jama() +
  theme_hc() +
  labs(
    title = "Trends of 'trying to lose weight' in the United States",
    caption = "NHANES 1999-2014 survey\ndatascienceplus.com",
    x = "Survey Cycle Years",
    y = "Percentage, %",
    fill = ""
    ) +
  scale_y_continuous(limits=c(30,80), breaks=seq(30,80, by = 10))

Gives this plot:

There is a small increasing trend to lose weight in people with obesity in recent years.

Logistic Regression

A logistic regression model with Poisson distribution is used to evaluate the odds of intention to lose weight tryweloss in different time periods. In this analysis, several confounders were considered such as BMI status, gender, age, family income, and race.

mylogit <- glm(tryweloss2 ~ wave + bmi + RIAGENDR + RIDAGEYR + INDFMINC + race,
  data = dat,
  family = "poisson")
summary(mylogit)
## 
## Call:
## glm(formula = tryweloss2 ~ wave + bmi + RIAGENDR + RIDAGEYR + 
##     INDFMINC + race, family = "poisson", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2590  -0.9171  -0.7429   0.6490   1.1270  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -1.2545203  0.0886399 -14.153  < 2e-16 ***
## wave2001-2002            0.0028610  0.0553019   0.052  0.95874    
## wave2003-2004            0.0277646  0.0562594   0.494  0.62165    
## wave2009-2010           -0.0434281  0.0519876  -0.835  0.40352    
## wave2011-2012            0.0377786  0.0539726   0.700  0.48395    
## wave2013-2014            0.0929779  0.0523326   1.777  0.07562 .  
## bmiOverweight           -0.3519162  0.0306745 -11.473  < 2e-16 ***
## RIAGENDR                 0.4184619  0.0303948  13.768  < 2e-16 ***
## RIDAGEYR                -0.0028538  0.0013459  -2.120  0.03398 *  
## INDFMINC                 0.0008816  0.0010161   0.868  0.38559    
## raceHispanic             0.0847514  0.0584763   1.449  0.14725    
## raceNon-Hispanic, White  0.1199298  0.0387668   3.094  0.00198 ** 
## raceNon-Hispanic, Black -0.0016056  0.0436180  -0.037  0.97064    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7330.3  on 10232  degrees of freedom
## Residual deviance: 6923.2  on 10220  degrees of freedom
##   (88 observations deleted due to missingness)
## AIC: 16273
## 
## Number of Fisher Scoring iterations: 5

The results show that there are no significant odds (p>0.05) on the intention to lose weight when I compare different periods to 1999-2000. The small sample size could explain these findings for each survey.

However, let me get the exponential of these coefficients and confidence intervals.

exp(coef(mylogit))
##             (Intercept)           wave2001-2002           wave2003-2004 
##               0.2852126               1.0028651               1.0281537 
##           wave2009-2010           wave2011-2012           wave2013-2014 
##               0.9575014               1.0385013               1.0974375 
##           bmiOverweight                RIAGENDR                RIDAGEYR 
##               0.7033391               1.5196224               0.9971503 
##                INDFMINC            raceHispanic raceNon-Hispanic, White 
##               1.0008820               1.0884464               1.1274177 
## raceNon-Hispanic, Black 
##               0.9983957

exp(confint(mylogit))
##                             2.5 %    97.5 %
## (Intercept)             0.2395745 0.3391149
## wave2001-2002           0.8999514 1.1178640
## wave2003-2004           0.9208583 1.1481368
## wave2009-2010           0.8650406 1.0606215
## wave2011-2012           0.9344651 1.1546960
## wave2013-2014           0.9907765 1.2164297
## bmiOverweight           0.6622089 0.7468264
## RIAGENDR                1.4318982 1.6130974
## RIDAGEYR                0.9945248 0.9997859
## INDFMINC                0.9988443 1.0028327
## raceHispanic            0.9697129 1.2196128
## raceNon-Hispanic, White 1.0452581 1.2168270
## raceNon-Hispanic, Black 0.9166538 1.0876073

The odds ratio and 95% confidence intervals (95%CI) of losing weight in the period 2013-2014 is 1.10 (95% CI 0.99-1.22) compare to 1999-2000. The odds ratio (95%CI) of losing weight in overweight compared to obese individuals is significantly lower 0.70 (95% CI 0.66-0.75) indicating that obese individuals are trying more to lose weight than overweight (as we can expect).

My results were different from the paper published in JAMA. There are some differences from this analysis compared to the article. My reference is the period 1999-2000, whereas in the paper is 1988-1994. Moreover, I excluded pregnant women, when the paper keeps those in the analysis.

    Related Post

    1. Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
    2. Story of pairs, ggpairs, and the linear regression
    3. Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
    4. MNIST For Machine Learning Beginners With Softmax Regression
    5. Time-dependent ROC for Survival Prediction Models in R

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)