How to create a loop to run multiple regression models

Posted on February 6, 2017 by Klodian Dhana in R bloggers | 0 Comments

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A friend asked me whether I can create a loop which will run multiple regression models. She wanted to evaluate the association between 100 dependent variables (outcome) and 100 independent variable (exposure), which means 10,000 regression models. Regression models with multiple dependent (outcome) and independent (exposure) variables are common in genetics.

So models will be something like this: (dx is dependent and ix is independent variable, v are other variables)

dx1 = ix1 + v1 + v2 + v3
dx1 = ix2 + v1 + v2 + v3
dx1 = ix3 + v1 + v2 + v3
...
dx1 = ix100 + v1 + v2 + v3
dx2 = ix1 + v1 + v2 + v3
...
dx100 = ix100 + v1 + v2 + v3

The output should be a data frame with 5 columns, including dependent variable, independent variable, beta estimate, standard error and the p-value.
Something like this (those numbers are just for illustration purposes):

  d   i   beta se    pvalue
1 dx1 ix1 0.1  0.002 0.950
2 dx2 ix2 0.2  0.002 0.826
3 dx3 ix3 0.3  0.005 0.123

OK, now lets begin: the dataset that I received had all the variables in columns and observations in rows (the data is not real, just random numbers for illustration purposes):

id dx1 dx2 ... dx100 ix1 ... 1x100 v1 v2 v3
10 324 124 ... 214   32 ...  32    ax b4 c3
11 431 982 ... 114   12 ...  77    ce b2 c5
12 545 123 ... 104   34 ...  11    ar c2 a5
....

Position of variables

Create vectors for the position of the dependent and independent variables in your dataset.

# outcome
out_start=2
out_end= 101
out_nvar=out_end-out_start+1

out_variable=rep(NA, out_nvar)
out_beta=rep(NA, out_nvar)
out_se = rep(NA, out_nvar)
out_pvalue=rep(NA, out_nvar)

# exposure
exp_start=102
exp_end=203
exp_nvar=exp_end-exp_start+1

exp_variable=rep(NA, exp_nvar)
exp_beta=rep(NA, exp_nvar)
exp_se = rep(NA, out_nvar)
exp_pvalue=rep(NA, exp_nvar)

number=1

For Loop

I used linear mixed effect model and therefore I loaded the lme4 library. The loop should work with other regression analysis (i.e. linear regression), if you modify it according to your regression model. If you don’t know which part to modify, leave a comment below and I will try to help.

As other loops, this call variables of interest one by one and for each of them extract and store the betas, standard error and p value. Remember, this code is specific for linear mixed effect models.

library(lme4)
for (i in out_start:out_end){
  outcome = colnames(dat)[i]
  for (j in exp_start:exp_end){
    exposure = colnames(dat)[j]
    model <- lmer(get(outcome) ~ get(exposure) + v1 + (1|v2) + (1|v3),
      na.action = na.exclude,
      data=dat)

    Vcov <- vcov(model, useScale = FALSE)
    beta <- fixef(model)
    se <- sqrt(diag(Vcov))
    zval <- beta / se
    pval <- 2 * pnorm(abs(zval), lower.tail = FALSE)
    
    out_beta[number] = as.numeric(beta[2])
    out_se[number] = as.numeric(se[2])
    out_pvalue[number] = as.numeric(pval[2])
    out_variable[number] = outcome
    number = number + 1
    
    exp_beta[number] = as.numeric(beta[2])
    exp_se[number] = as.numeric(se[2])
    exp_pvalue[number] = as.numeric(pval[2])
    exp_variable[number] = exposure
    number = number + 1
  }
}

Create a dataframe with results:

outcome = data.frame(out_variable, out_beta, out_se, out_pvalue)
exposure = data.frame(exp_variable, exp_beta, exp_se, exp_pvalue)

Management of the dataframe

We have 2 different dataframes with our results and we need to combine in one. With the help of tidyverse package this is a simple task. Basically we rename variables by giving the same name and after we merge both dataframes together.

library(tidyverse)
outcome = outcome %>% 
  rename(
    variable = out_variable,
    beta = out_beta,
    se = out_se,
    pvalue = out_pvalue,
    obs = out_nobs
    )
exposure = exposure %>% 
  rename(
    variable = exp_variable,
    beta = exp_beta,
    se = exp_se,
    pvalue = exp_pvalue,
    obs = exp_nobs
    )
all = rbind(outcome, exposure)
all = na.omit(all)

head(all)
     variable beta se    pvalue
1    dx1      0.1  0.002 0.950
3    dx2      0.2  0.002 0.826
........
2    ix1      0.1  0.002 0.950
4    ix2      0.2  0.002 0.826
........

Yet, this is not a dataframe that we are looking for. We need a dataframe to have both dependent and independent variables in one row. Therefore, we do the final transformation as follows:

data = all %>% 
  mutate(
    type = substr(variable, 1, 2)
  ) %>% 
  spread(type, variable) %>% 
  rename(
    d = dx,
    i = ix
  ) %>% 
  mutate (
    beta = round(beta, 5),
    se = round(se, 5),
    pvalue = round(pvalue, 5)
  ) %>% 
  select(d, i, beta, se, pvalue)

head(data)
  d   i   beta se    pvalue
1 dx1 ix1 0.1  0.002 0.950
2 dx2 ix2 0.2  0.002 0.826
3 dx3 ix3 0.3  0.005 0.123

I hope you find this post useful for your research and data analysis!

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Position of variables

For Loop

Management of the dataframe

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)