Introducing the HCmodelSets Package

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Henrique Helfer Hoeltgebaum

Introduction

I am happy to introduce the package HCmodelSets, which is now available on CRAN. This package implements the methods proposed by Cox, D.R. and Battey, H.S. (2017). In particular it performs the reduction, exploratory and model selection phases given in the aforementioned reference. The software supports linear regression, likelihood-based fitting of generalized linear regression models and the proportional hazards model fitted by partial likelihood.

The standard method described in the literature to deal with sparse regression is the LASSO proposed by Tibshirani (1996), which assumes sparsity of the effects. This results in only one single model, leaving open the possibility that other sparse choices of explanatory features fit equally as well. To explore these other sparse possibilities, Cox, D.R. and Battey, H.S. (2017) provide methods that specify alternative models that are essentially equally as descriptive as the models fitted using LASSO. The key idea of Cox, D.R. and Battey, H.S. (2017) is to organize the variable indexes into a hypercube and then run regressions over rows and columns (and equivalently in higher dimensions). Further, significant variables in the hypercube are retained and then organized into a lower dimensional one. This process is repeated until a user specified low dimensional hypercube and returns the number of times each variable is considered significant at each dimension across all models fitted. Such a strategy conducts numerous combinations of a sparse models which allows for the separate analysis of subsets of explanatory variables.

Constructing sets of well-fitting models with HCmodelSets package

Now let us make use of the R package HCmodelSets, its functions will be used and detailed to estimate model sets when dealing with a high number of exploratory variables. Note that different response variables, namely binary and continuous, have distinct statistical proprieties which, under some mild conditions, were recently discussed in Cox, D.R. and Battey, H.S. (2018).

We demonstrate the procedure of Cox, D.R. and Battey, H.S. (2017) using simulated data to recover the true model. Consider a continuous variable y_t \in \mathbb{R} which depends on only a sparse number of variables. Firstly we create this Data Generator Process (DGP) assuming continuous values using the dgp function.

library("HCmodelSets")
dgp = DGP(s=5, a=3, sigStrength=1, rho=0.9, n=100, intercept=5, noise=1,
          var=1, d=1000, DGP.seed = 2019)

The DGP sample size is 100 and out of 1000 variables, only 5 are relevant. To access which ones are relevant, just use the TRUE.idx value. Next we perform the reduction phase using the first 70 observations, which successively discard variables according to appropriate decision rules. At the end, it provides the number and indexes of variables selected at each stage.

outcome.Reduction.Phase =  Reduction.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70],
                                           family=gaussian, seed.HC = 1093)

Also, one can use the vector.signif argument to create a vector of p-values proposing its own decision rule for each reduction of the hypercube. If it is not provided, as we did in the example above, the default option selects the two most significant variables in each transverse of the hypercube for the highest dimensional one and then adopts pvalues of 0.01 for the sub-sequential lower dimensions. Also, since y_t \in \mathbb{R}, family=”gaussian” must be specified. In the former example we did not provided the value of dimension of the hypercube to be used in the first-stage reduction, however this could be easily done by using the dmHC argument.

The next step is to perform the exploratory phase on the variables retained through the reduction phase in the same 70 observations, which will return any significant squared and/or interaction terms. We opt to use the variables in the hypercube of dimension 2 that were selected at least once as this retains 25 variables.

idxs = outcome.Reduction.Phase$List.Selection$`Hypercube with dim 2`$numSelected1
outcome.Exploratory.Phase =  Exploratory.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70],
                                               list.reduction = idxs,
                                               family=gaussian, signif=0.01)

At the end, only one squared variable is relevant, and will be used to construct sets of well-fitting models using the remaining 30 observations.

sq.terms = outcome.Exploratory.Phase$mat.select.SQ
in.terms = outcome.Exploratory.Phase$mat.select.INTER

MS = ModelSelection.Phase(X=dgp$X[71:100,],Y=dgp$Y[71:100], list.reduction = idxs,
                          sq.terms = sq.terms,in.terms = in.terms, signif=0.01)

Note that since we did not specified the modelSize argument, the procedure will adopt the minimum value between the sum of 25 (the number of variables in the reduction phase) and 1 (the number of variables in the exploratory phase) with 5 (i.e. min(26,5)). Since we specified in the DPG function s=5, i.e., the true number of variables that generates the true process, we will analyze the results of models of size 5.

TRUE.idx = dgp$TRUE.idx
sets = sort(table(MS$goodModels$`Model Size 5`), decreasing=TRUE)

### frequency table of variables selected
print(sets)

##
##    379    479    732    438    613    152    901    333    596    408
##  12650  12650  12649  11710  10753   7810   7809   7787   7716   7491
##    756     24    123    366    986    670 333 ^2    335    200    661
##   7283   7276   7257   7233   7193   7160   7119   7108   7099   7094
##    634    171    844     20    225    909
##   7066   7048   7032   7017   7002   6978

### intersection between the true model and the variables obtained in the Model Selection phase
intersect(TRUE.idx,names(sets))

## [1] "152" "379" "479" "613" "901"

Note that the true variables are fully recovered at the end with also a few other variables and non-linear effects which might also be relevant to fit the observed process y_t.

References

Cox, D. R., & Battey, H. S. (2017). Large numbers of explanatory variables, a semi-descriptive analysis. Proceedings of the National Academy of Sciences, 114(32), 8592-8595.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

Battey, H. S., & Cox, D. R. (2018). Large numbers of explanatory variables: a probabilistic assessment. Proc. R. Soc. A, 474(2215), 20170631.

Hoeltgebaum HH (2018). HCmodelSets: Regression with a Large Number of Potential Explanatory Variables. R package version 0.1.1, URL https://CRAN.R-project.org/package=HCmodelSets.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)