# Question and Answer: Generating Binary and Discrete Response Data

August 19, 2013
By

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)

I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:
I have firm level data and the model is discrete choice with the main explanatory variable also a binary choice:  First question is how can I calibrate the data generation model?

This is a fundamental question for any kind of econometric model.  How you calibrate your data implies the inherent structure of your data which in term implies what method you should use to attempt to recover your parameters.  Now some data generating processes exist out there which do not yet have econometric solutions to.  Yet there are many that do.

In general you can calibrate your data by i. modifying the parameters, ii. the distribution of explanatory variables, or iii. the distribution of the errors.

In a binary response case the most common models are probit/logit in which case in order to simulate data you would generate your underlying model and overlay the appropriate CDF over it which gives you probabilities of a success.  Finally you would make a random draw based on those probabilities for each outcome being simulated.

I have numerous example code demonstrating this:
Stata: (Reverse Engineering a Probit) (Probit vs Logit)

R:
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <- as.data.frame(cbind(Y=rbinom(Nobs,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = “probit”), data = SData))

Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.

Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data http://works.bepress.com/joseph_hilbe/3/)

Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome

SData <- as.data.frame(cbind(Y=Y, X)) # Combine Datarequire(“nnet”)
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...