Question and Answer: Generating Binary and Discrete Response Data

August 19, 2013

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)

I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:
I have firm level data and the model is discrete choice with the main explanatory variable also a binary choice:  First question is how can I calibrate the data generation model? 


This is a fundamental question for any kind of econometric model.  How you calibrate your data implies the inherent structure of your data which in term implies what method you should use to attempt to recover your parameters.  Now some data generating processes exist out there which do not yet have econometric solutions to.  Yet there are many that do.

In general you can calibrate your data by i. modifying the parameters, ii. the distribution of explanatory variables, or iii. the distribution of the errors.

In a binary response case the most common models are probit/logit in which case in order to simulate data you would generate your underlying model and overlay the appropriate CDF over it which gives you probabilities of a success.  Finally you would make a random draw based on those probabilities for each outcome being simulated.

I have numerous example code demonstrating this:
Stata: (Reverse Engineering a Probit) (Probit vs Logit)
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <-,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = "probit"), data = SData))

Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.

Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data

Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome

SData <-, X)) # Combine Datarequire("nnet")
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))

To leave a comment for the author, please follow the link and comment on his blog: Econometrics by Simulation. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.