simstudy update: ordinal data generation that violates proportionality

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Version 0.4.0 of simstudy is now available on CRAN and GitHub. This update includes two enhancements (and at least one major bug fix). genOrdCat now includes an argument to generate ordinal data without an assumption of cumulative proportional odds. And two new functions defRepeat and defRepeatAdd make it a bit easier to define multiple variables that share the same distribution assumptions.

Ordinal data

In simstudy, it is relatively easy to specify multinomial distributions that characterize categorical data. Order becomes relevant when the categories take on meanings related to strength of opinion or agreement (as in a Likert-type response) or frequency. A motivating example could be when a response variable takes on four possible values: (1) strongly disagree, (2) disagree, (4) agree, (5) strongly agree. There is a natural order to the response possibilities.

If we are interested in comparing responses for two groups (say an exposed group and an unexposed group), we can look at the cumulative odds of the response a each level \(x\) of the response:

\[\small{\frac{P(response > x|exposed)}{P(response \le x|exposed)} \ \ vs. \ \frac{P(response > x|unexposed)}{P(response \le x|unexposed)}},\]

The comparison is often a ratio of those two odds – the cumulative odds ratio – or a log of the odds ratio.

genOrdCat

The way to generate ordered categorical data in simstudy is with the function genOrdCat. The probability of responses or categories is specified for the reference group – in this case the unexposed. The effect of exposure (and any other covariates) is expressed in an adjustment variable (here z). In the data generating process defined below, we are saying that the cumulative odds for the exposed is about 1/2 the odds for the unexposed at each level of response x. This is the proportional odds assumption, and on the log(OR) scale this is \(log(0.5) = -0.7\).

baseprobs <- c(0.35, 0.25, 0.20, 0.20)

defA <- defData(varname = "exposed", formula = "1;1", dist = "trtAssign")
defA <- defData(defA, varname = "z", formula = "-0.7 * exposed", dist = "nonrandom")

set.seed(230)

dd <- genData(25000, defA)
dx <- genOrdCat(dd, adjVar = "z", baseprobs, catVar = "response")

Here is a manual calculation of the observed probabilities and odds (for a more visual interpretation of all of this, see this description):

dp <- dx[, .(n = .N), keyby = .(exposed, response)]
dp[, p := n/sum(n), keyby = .(exposed)]
dp[, cump := round(cumsum(p),3), keyby = .(exposed)]
dp[, codds := (1-cump)/cump]
dp[, lcodds := log(codds)]
dp
##    exposed response    n    p cump codds lcodds
## 1:       0        1 4406 0.35 0.35  1.84   0.61
## 2:       0        2 3168 0.25 0.61  0.65  -0.43
## 3:       0        3 2471 0.20 0.80  0.24  -1.41
## 4:       0        4 2455 0.20 1.00  0.00   -Inf
## 5:       1        1 6616 0.53 0.53  0.89  -0.12
## 6:       1        2 2860 0.23 0.76  0.32  -1.14
## 7:       1        3 1638 0.13 0.89  0.12  -2.08
## 8:       1        4 1386 0.11 1.00  0.00   -Inf

We can calculate the cumulative odds ratio at each response level …

dc <- dcast(dp, response ~ exposed, value.var = "codds")
dc [, cOR := `1`/`0`]
dc
##    response    0    1  cOR
## 1:        1 1.84 0.89 0.48
## 2:        2 0.65 0.32 0.49
## 3:        3 0.24 0.12 0.51
## 4:        4 0.00 0.00  NaN

and the log(cOR):

dc <- dcast(dp, response ~ exposed, value.var = "lcodds")
dc [, lcOR := `1` - `0`]
dc
##    response     0     1  lcOR
## 1:        1  0.61 -0.12 -0.73
## 2:        2 -0.43 -1.14 -0.71
## 3:        3 -1.41 -2.08 -0.67
## 4:        4  -Inf  -Inf   NaN

Estimating the parameters of the model using function clm in the ordinal package, we can recover the original parameters quite well. Note that the threshold coefficients are log cumulative odds at each response level for the reference group, the unexposed.

library(ordinal)
clmFit <- clm(response ~ exposed, data = dx)
summary(clmFit)
## formula: response ~ exposed
## data:    dx
## 
##  link  threshold nobs  logLik    AIC      niter max.grad cond.H 
##  logit flexible  25000 -31750.20 63508.39 4(0)  6.05e-07 1.9e+01
## 
## Coefficients:
##         Estimate Std. Error z value Pr(>|z|)    
## exposed  -0.7146     0.0236   -30.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -0.6017     0.0176   -34.3
## 2|3   0.4305     0.0173    24.9
## 3|4   1.3941     0.0201    69.3

A plot of the modeled cumulative probabilities (the lines) shows that the proportionality assumption fit the observed data (the points) quite well.

Non-proportional odds

With the recent update, it is now possible to generate data that violate the proportionality assumption by using new arguments npVar and npAdj. npVar indicates the variable(s) for which the non-proportional assumption is violated, and npAdj is a vector that specifies the extent and direction of the violation at each level of the response (on the logit scale). (Since the log odds ratio for the highest level response is infinite, the final value in the vector has no impact.)

dx <- genOrdCat(dd, baseprobs = baseprobs, catVar = "response", adjVar = "z",
                npVar = "exposed", npAdj = c(1.0, 0, -1.0, 0))
dp <- dx[, .(n = .N), keyby = .(exposed, response)]
dp[, p := n/sum(n), keyby = .(exposed)]
dp[, cump := round(cumsum(p),3), keyby = .(exposed)]
dp[, codds := (1-cump)/cump]
dp[, lcodds := log(codds)]
dp
##    exposed response    n     p cump codds lcodds
## 1:       0        1 4351 0.348 0.35 1.874   0.63
## 2:       0        2 3145 0.252 0.60 0.667  -0.41
## 3:       0        3 2444 0.196 0.80 0.258  -1.36
## 4:       0        4 2560 0.205 1.00 0.000   -Inf
## 5:       1        1 3506 0.280 0.28 2.571   0.94
## 6:       1        2 5784 0.463 0.74 0.346  -1.06
## 7:       1        3 2629 0.210 0.95 0.048  -3.03
## 8:       1        4  581 0.046 1.00 0.000   -Inf

We can see that the cumulative OR for response level 2 remains close to 0.5, but the cORs shift away from 0.5 response levels 1 and 3.

dc <- dcast(dp, response ~ exposed, value.var = "codds")
dc [, cOR := `1`/`0`]
dc
##    response    0     1  cOR
## 1:        1 1.87 2.571 1.37
## 2:        2 0.67 0.346 0.52
## 3:        3 0.26 0.048 0.19
## 4:        4 0.00 0.000  NaN

On the log odds scale, it is possible to see the direct effect of the values specified in the adjustment vector npAdj. The observed log cumulative OR at response level 1 is \(1.0 - 0.7 = 0.3\), and the at level 3 it is \(-1.0 - 0.7 = -1.7:\)

dc <- dcast(dp, response ~ exposed, value.var = "lcodds")
dc [, lcOR := `1` - `0`]
dc
##    response     0     1  lcOR
## 1:        1  0.63  0.94  0.32
## 2:        2 -0.41 -1.06 -0.66
## 3:        3 -1.36 -3.03 -1.68
## 4:        4  -Inf  -Inf   NaN

The lack of proportionality is confirmed by a plot of the model fit with a proportional odds assumption along with the observed cumulative proportions. Since the model imposes proportionality, the observed points no longer lie along the prediction line:

Generating multiple variables with a single definition

defRepeat is a new function that allows us to specify multiple versions of a variable based on a single set of distribution assumptions. (There is an similar function defRepeatAdd to be used for definitions when adding data to an existing data set.) The function will add nvar variables to the data definition table, each of which will be specified with a single set of distribution assumptions. The names of the variables will be based on the prefix argument and the distribution assumptions are specified as they are in the defData function. Calls to defRepeat can be integrated with calls to defData.

def <- defRepeat(nVars = 4, prefix = "g", formula = "1/3;1/3;1/3", 
   variance = 0, dist = "categorical")
def <- defData(def, varname = "a", formula = "1;1", dist = "trtAssign")
def <- defRepeat(def, 3, "b", formula = "5 + a", variance = 3, dist = "normal")
def <- defData(def, "y", formula = "0.10", dist = "binary")

def
##    varname     formula variance        dist     link
## 1:      g1 1/3;1/3;1/3        0 categorical identity
## 2:      g2 1/3;1/3;1/3        0 categorical identity
## 3:      g3 1/3;1/3;1/3        0 categorical identity
## 4:      g4 1/3;1/3;1/3        0 categorical identity
## 5:       a         1;1        0   trtAssign identity
## 6:      b1       5 + a        3      normal identity
## 7:      b2       5 + a        3      normal identity
## 8:      b3       5 + a        3      normal identity
## 9:       y        0.10        0      binary identity
genData(5, def)
##    id g1 g2 g3 g4 a  b1  b2  b3 y
## 1:  1  1  3  3  1 1 8.7 8.2 7.2 0
## 2:  2  1  2  2  1 0 2.0 4.6 2.8 0
## 3:  3  2  2  2  1 1 3.2 6.7 4.8 0
## 4:  4  1  2  2  1 1 6.3 6.6 8.5 0
## 5:  5  2  3  1  2 0 5.2 5.1 9.5 0

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)