IPTW with missing data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The tutorial is based on R and StatsNotebook, a graphical interface for R.
This is a follow-up tutorial built on our tutorial on inverse probability treatment weight. In this tutorial, we use the same example, but with some missing data in the dataset. It is recommended that you read our tutorial on inverse probability treatment weight first.
The example data can be downloaded from here .
If your dataset has missing data, multiple imputation can be firstly used to impute the missing value. StatsNotebook provides a shortcut for incorporating multiple imputation with IPTW. The shortcut method will only use variables for the IPTW calculation to impute the missing value. If more auxiliary variables or interaction between variables are needed for the imputation, we can conduct the imputation manually first. This tutorial will demonstrate the shortcut method.
In the presence of missing data,
- Multiple imputation will be used to impute missing values first. As a rule of thumb, the number of imputations needed is roughly the same as the percentage of missing data.
- IPTW is computed for each imputed dataset.
- The outcome model will be run on each imputed dataset.
- Results from different imputed datasets are combined using Rubin’s formula.
StatsNotebook uses the mice
package for multiple imputation, and then uses the mitools
package to combine results from the outcome models to form final estimates using Rubin’s formula.
The section below will provide a step-by-step guide on using IPTW with multiple imputations to answer the research question
- Does cannabis use in follow-up wave 1 to wave 3 (in Grade 8, 9 and 10) cause use of illicit drug at age 25?
In this example dataset, the key variables are
- Academic grade (failed; 0: no failed subjects; failed at least one subject)
- Peers’ cannabis use (peer_can: 0: None; 1: 1-2 friends used cannabis; 2: 3 or more friends used cannabis)
- Antisocial behaviour (antisocial: 0: No antisocial behaviour; 1: engaged in antisocial behaviour)
- Cannabis use (can; 0: No use; 1: used)
- Sex (sex; 0: male; 1: female)
- Family history of drug use (family_drug_use; 0: no; 1: at least one immediate family member used drug. This is only measured at baseline.)
- Illicit drug use at age 25 (illicit: 0: No use; 1: used)
The suffix of each variable indicates the time point the data is collected. For example, peer_can_0 represents the number of peers who used cannabis at baseline (wave 0).
For the detailed setup of this analysis, please see our tutorial on inverse probability treatment weight, in which we used a complete dataset instead of one with missing data.
Using StatsNotebook – Calculating IPTW
Prior to calculating the IPTW, we will need to conduct a descriptive analysis and it is always good practice to visualise the data.
To perform multiple imputation and then to calculate the IPTW for each imputed dataset,
- Click Analysis at the top
- Click Causal in the top menu
- Click Inverse Probability Treatment Weighting (IPTW) in the pop-up menu
- In the left panel, select can_1 into Exposure, and select failed_0, peer_can_0 and antisocial_0 into Time-varying covariates. As mentioned before, since baseline confounders/covariates can be regarded as a special subset of Time-varying covariates at Wave 1, we further select family_drug_use, sex and can_0 (baseline cannabis use) into Time-varying covariates.
- To ensure the temporal order of the variables, we use covariates from baseline to calculate IPTW for the expsoure variable at follow-up wave 1.
- Click the top + button on top of Exposure to add an extra time point (extra wave).
- Select can_2 into Exposure, and select failed_1, peer_can_1 and antisocial_1 into Time-varying covariates.
-
Repeat step 5 and 6 for can_3.
-
Select illicit into Outcome.
- This step is not necessary when there is no missing data. The variable in Outcome is only used for the multiple imputation but not for calculating IPTW.
-
Expand the panel Analysis Setting at the below the variable selection panel.
-
For Distribution family, select Binomial.
- Since the exposure is a dichotomised variable (cannabis use or no cannabis use), logistic regression will be used to estimate the IPTW.
-
Click Impute missing data. In this example, we specify 20 imputations.
- As a rule of thumb, the number of imputations should be roughly the same as the percentage of missing data.
- Click Code and Run. The codes below will be generated and executed.
R codes – Calculating IPTW
library(ipw) currentDataset <- currentDataset[which(currentDataset$.imp == 0),] currentDataset <- currentDataset[!(names(currentDataset) %in% c(".id", ".imp"))] "Impute missing data" library(mice) library(mitools) formulas <- make.formulas(currentDataset) formulas$illicit =illicit ~ can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$can_1 =can_1 ~ illicit + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$failed_0 =failed_0 ~ illicit + can_1 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$peer_can_0 =peer_can_0 ~ illicit + can_1 + failed_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$antisocial_0 =antisocial_0 ~ illicit + can_1 + failed_0 + peer_can_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$can_0 =can_0 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$family_drug_use =family_drug_use ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$sex =sex ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$can_2 =can_2 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$failed_1 =failed_1 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$peer_can_1 =peer_can_1 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$antisocial_1 =antisocial_1 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + can_3 + failed_2 + peer_can_2 + antisocial_2 formulas$can_3 =can_3 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + failed_2 + peer_can_2 + antisocial_2 formulas$failed_2 =failed_2 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + peer_can_2 + antisocial_2 formulas$peer_can_2 =peer_can_2 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + antisocial_2 formulas$antisocial_2 =antisocial_2 ~ illicit + can_1 + failed_0 + peer_can_0 + antisocial_0 + can_0 + family_drug_use + sex + can_2 + failed_1 + peer_can_1 + antisocial_1 + can_3 + failed_2 + peer_can_2 meth <- make.method(currentDataset) meth["failed_3"] <- "" meth["family_alc_supp_0"] <- "" meth["family_alc_supp_1"] <- "" meth["family_alc_supp_2"] <- "" meth["family_alc_supp_3"] <- "" meth["peer_can_3"] <- "" meth["peer_alc_0"] <- "" meth["peer_alc_1"] <- "" meth["peer_alc_2"] <- "" meth["peer_alc_3"] <- "" meth["antisocial_3"] <- "" meth["alc_0"] <- "" meth["alc_1"] <- "" meth["alc_2"] <- "" meth["alc_3"] <- "" meth["distress"] <- "" meth["alc_dep"] <- "" meth[".ipw0"] <- "" meth[".ipw1"] <- "" meth[".ipw2"] <- "" meth[".final_weight"] <- "" imputedDataset <- parlmice(currentDataset, method = meth, formulas = formulas, m = 20, n.core = 15, n.imp.core = 2) plot(imputedDataset) currentDataset <- complete(imputedDataset, action = "long", include = TRUE) "Calculate IPTW" split_imp <- currentDataset$.imp mi_dataList <- split(currentDataset, split_imp) for(i in 2:length(mi_dataList)) { weight <- ipwpoint(exposure = can_1, family = "binomial", link = "logit", numerator =~ 1, denominator =~ failed_0+peer_can_0+antisocial_0+can_0+family_drug_use+sex, trunc = 0.01, data = as.data.frame(mi_dataList[[i]])) mi_dataList[[i]]$.ipw0 = weight$weights.trunc weight <- ipwpoint(exposure = can_2, family = "binomial", link = "logit", numerator =~ can_1, denominator =~ can_1+failed_0+peer_can_0+antisocial_0+can_0+family_drug_use+sex+failed_1+peer_can_1+antisocial_1, trunc = 0.01, data = as.data.frame(mi_dataList[[i]])) mi_dataList[[i]]$.ipw1 = weight$weights.trunc weight <- ipwpoint(exposure = can_3, family = "binomial", link = "logit", numerator =~ can_1+can_2, denominator =~ can_1+can_2+failed_0+peer_can_0+antisocial_0+can_0+family_drug_use+sex+failed_1+peer_can_1+antisocial_1+failed_2+peer_can_2+antisocial_2, trunc = 0.01, data = as.data.frame(mi_dataList[[i]])) mi_dataList[[i]]$.ipw2 = weight$weights.trunc mi_dataList[[i]]$.final_weight <- mi_dataList[[i]]$.ipw0*mi_dataList[[i]]$.ipw1*mi_dataList[[i]]$.ipw2 } mi_dataList[[1]]$.ipw0 <- NA mi_dataList[[1]]$.ipw1 <- NA mi_dataList[[1]]$.ipw2 <- NA mi_dataList[[1]]$.final_weight <- NA currentDataset <- unsplit(mi_dataList, split_imp) "Chan, G. and StatsNotebook Team (2020). StatsNotebook. (Version 0.1.0) [Computer Software]. Retrieved from https://www.statsnotebook.io" "R Core Team (2020). The R Project for Statistical Computing. [Computer software]. Retrieved from https://r-project.org" "van der Wal, W. M. and R. B. Geskus (2011). ipw: an R package for inverse probability weighting. J Stat Softw 43(13): 1-23."
Given the complexity of the codes, we will only give a conceptual explanation. For a detailed explanation on the specification of the multiple imputation model, see our tutorial on multiple imputation. In brief, the top section of the codes from above specify the variables used to impute each of the variables with missing data, perform the multiple imputation and generate the diagnostic plots for the imputation.
In the second section, we compute the IPTW for each of the imputed dataset using an iterative for
loop.
Six new variables will be created. .imp is the identifier for each of the imputed datasets. The original data will have a value of 0. The .id is the identifier for each observation in each imputed dataset. .ipw0, .ipw1 and .ipw2
Using StatsNotebook - Weighted analysis
After calculating the IPTW, confounding due to included variables in the IPTW calculation will be removed in a weighted analysis. To estimate the causal effect of the cumulative exposure (measured as the number of waves an individual reported using cannabis between follow-up wave 1 and wave 3), we first create a new variable (cumulative) by summing up can_1, can_2 and can_3.
To create this new variable in StatsNotebook,
- Click Data at the top
- Click Compute at the top menu
- Type cumulative in the Target Variable field.
- You can select variable in the Variable list and use + to add the corresponding variables up.
The R code for this operation is simple.
currentDataset$cumulative <- currentDataset$can_1+currentDataset$can_2+currentDataset$can_3
Once we have created the cumulative variable, we can use the survey
package to conduct a weighted analysis to estimate the causal effect of cumulative cannabis use during adolescence on illicit drug use in young adulthood (age 25) to estimate the model parameters from each imputed dataset separately. The mitools
package is used to pool the results to form a set of single estimates using Rubin’s formula. Logistic regression is used because the outcome, illicit drug use, is a dichotomized variable (0: No; 1: Yes).
To conduct a weighted analysis,
- Click Analysis at the top
- Click Regression at the top menu
- Click Logistic regression (linear regression can be used if the outcome is a numeric variable.)
- In the left panel, select illicit into Outcome, cumulative into Covariate and .final_weight into Weight.
- Expand the Analysis setting panel, and check the box This is an Imputed Dataset.
R codes - Weighted analysis
library(mice) library(mitools) library(survey) "Weighted regression" mi_dataList <- currentDataset[currentDataset$.imp != 0,] mi_dataList <- split(mi_dataList, mi_dataList$.imp) mi_dataList <- imputationList(mi_dataList) clus <- svydesign(id =~ 1, weights =~ .final_weight, data = mi_dataList) res <- with(clus, svyglm(illicit ~ cumulative, family = binomial)) summary(MIcombine(res), alpha = 0.05, logeffect = TRUE) res1 <- res[[1]] car::infIndexPlot(res1) "Chan, G. and StatsNotebook Team (2020). StatsNotebook. (Version 0.1.0) [Computer Software]. Retrieved from https://www.statsnotebook.io" "R Core Team (2020). The R Project for Statistical Computing. [Computer software]. Retrieved from https://r-project.org"
The code above will produce multiple warning message. This is expected and not of any concern because we have fractional weights.
###################################################### Warning message: non-integer #successes in a binomial glm! ######################################################
The key results from the code above are as follow.
Multiple imputation results: with(clus, svyglm(illicit ~ cumulative, family = binomial)) MIcombine.default(res) results se (lower upper) missInfo (Intercept) 0.06951018 0.01636046 0.04381953 0.1102628 7 % cumulative 2.15334681 0.48705061 1.38223761 3.3546349 2 %
Interpretation
Results from the weighted logistic regression indicates that cannabis use was strongly associated with future illicit drug use, b = 2.15, 95% CI = [1.38, 3.36] (Your results will be slightly different because multiple imputation is not deterministic - every time the missing data is imputed, the value may be different). For every additional wave an individual used cannabis, the odds of using illicit drugs at age 25 increases by 2.15 times. It is likely that we can interpret this association as causal, as we have adjusted for key confounding (and time-varying) variables from several important domains such as family, school and peer group. However, it is very likely that the true effect will be smaller due to residual confounding (confounding effect not captured by the variables we used in the analysis.)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.