Site icon R-bloggers

How to generate data from a model – Part 2

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Summary

Traditionally, data scientists have built models based on data. This article details how to do the exact opposite i.e. generate data based on a model. This article is second in the series of articles on building data from model. 
You can find part 1 here


Broadly speaking, there are three steps to generating data from model as given below.

Step 1: Register for the API


Step 2: Install R Package Conjurer

Install the latest version of the package from CRAN as follows.
install.packages("conjurer")

Step 3: Generate data from model

The function used to generate data from model is buildModelData(numOfObs, numOfVars, key, modelObj) .
The components of this function buildModelData are as follows.


Generate data completely random using the code below.
library(conjurer) uncovrJson <- buildModelData(numOfObs = 1000, numOfVars = 3, key = "input your subscription key here") df <- extractDf(uncovrJson=uncovrJson)

Generate data based on the model object provided. For this example, a simple linear regression model is used.
library(conjurer) library(datasets) data(cars) m <- lm(formula = dist ~ speed, data = cars) uncovrJson <- buildModelData(numOfObs=100, numOfVars=1, key="insert subscription key here", modelObj = m) df <- extractDf(uncovrJson=uncovrJson)

Interpretation of results

The data frame df (in the code above) will have two columns with the names iv1 and dv. The columns with prefix iv are the independent variables while the dv is the dependent variable. You can rename them to suit your needs. In the example above iv1 is speed and dv is distance. The details of the model formula and its estimated performance can be inspected as follows.  A simple comparison can be made to see how the synthetic data generated compares to the original data with the following code. 

summary(cars)      speed            dist   Min. : 4.0       Min. : 2.00 1st Qu.:12.0     1st Qu.: 26.00 Median :15.0     Median : 36.00   Mean :15.4       Mean : 42.98 3rd Qu.:19.0     3rd Qu.: 56.00   Max. :25.0       Max. :120.00

summary(df)        iv1              dv   Min. : 4.080     Min. :-38.76 1st Qu.: 8.915   1st Qu.: 29.35 Median :16.844   Median : 47.03   Mean :15.405     Mean : 46.13 3rd Qu.:20.461   3rd Qu.: 75.83   Max. :24.958     Max. :127.66

Limitation and Future Work

Some of the known limitations of this algorithm are as follows.
These limitations will be addressed in the future versions. To be more specific, the distribution of the independent variables and error terms will be further engineered in the future versions.

Concluding Remarks

The underlying API uncovr is under development by FOYI . As new functionality is released, the R package conjurer will be updated to reflect those changes. Your feedback is valuable. For any feature requests or bug reports, please follow the contribution guidelines on GitHub repository. If you would like to follow the future releases and news, please follow our LinkedIn page
How to generate data from a model – Part 2 was first posted on January 21, 2023 at 4:47 am.
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.