Sampling Arbitrary data

January 3, 2016
By

(This article was first published on sweissblaug, and kindly contributed to R-bloggers)

Introduction:
Generating data usually requires a variance – covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non – linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.

MCMC Sampling:
MCMC sampling is a method that can simulate from any multivariate densities as long as one has the full conditionals. Full conditionals in this case are models of a particular variable given all other variables. For more concreteness; suppose we have a dataset with n variables (x1, x2, … xn). The estimated full conditionals in this case are:

f(x1 | x2, x3…xn)
f(x2 | x1, x3…xn)
.
.
f(xn | x1, x3…x(n-1))
Where f() is a machine learning algorithm of choice that can provide a distribution of values, in this case I use quantregForest() for continuous variables and randomForest() for categorical variables. 
Once the models are built the algorithm is as follows:
– Choose random observation as starting point
-for each iteration of N iteration
     – for each variable
           – Sample proposal observation from predicted distribution and compare likelihood with current observation. Accept proposal with min(p(proposal)/p(current),1)
Code can be found here.
Results:
A random forest model was built to predict whether the simulated data could be distinguished from original data. The resulting KS test (out of sample) was insignificant so we retain null that model can’t distinguish between original and simulated data. 

While that is a good result and shows that a model can’t distinguish between simulated and original data, visually we can see a difference between the two datasets shown below. 
Original Iris Data
Simulated Iris Data Set

The simulated dataset shows sharper boundaries between points that are not found in original dataset. For example, the scatterplot between Sepal.Length and Sepal.Width shows a sharp boundary at approximately 5 Sepal.Length in simulated data but it more gradual in original data.

Conclusion:
This post discussed a method to simulate an arbitrary dataset. While I cannot build a model to distinguish the two datasets, visually they are distinguishable.

To leave a comment for the author, please follow the link and comment on their blog: sweissblaug.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)