**R – insightR**, and kindly contributed to R-bloggers)

**By Henrique Helfer Hoeltgebaum**

## Introduction

I am happy to introduce the package HCmodelSets, which is now available on CRAN. This package implements the methods proposed by Cox, D.R. and Battey, H.S. (2017). In particular it performs the reduction, exploratory and model selection phases given in the aforementioned reference. The software supports linear regression, likelihood-based fitting of generalized linear regression models and the proportional hazards model fitted by partial likelihood.

The standard method described in the literature to deal with sparse regression is the LASSO proposed by Tibshirani (1996), which assumes sparsity of the effects. This results in only one single model, leaving open the possibility that other sparse choices of explanatory features fit equally as well. To explore these other sparse possibilities, Cox, D.R. and Battey, H.S. (2017) provide methods that specify alternative models that are essentially equally as descriptive as the models fitted using LASSO. The key idea of Cox, D.R. and Battey, H.S. (2017) is to organize the variable indexes into a hypercube and then run regressions over rows and columns (and equivalently in higher dimensions). Further, significant variables in the hypercube are retained and then organized into a lower dimensional one. This process is repeated until a user specified low dimensional hypercube and returns the number of times each variable is considered significant at each dimension across all models fitted. Such a strategy conducts numerous combinations of a sparse models which allows for the separate analysis of subsets of explanatory variables.

## Constructing sets of well-fitting models with HCmodelSets package

Now let us make use of the R package HCmodelSets, its functions will be used and detailed to estimate model sets when dealing with a high number of exploratory variables. Note that different response variables, namely binary and continuous, have distinct statistical proprieties which, under some mild conditions, were recently discussed in Cox, D.R. and Battey, H.S. (2018).

We demonstrate the procedure of Cox, D.R. and Battey, H.S. (2017) using simulated data to recover the true model. Consider a continuous variable which depends on only a sparse number of variables. Firstly we create this Data Generator Process (DGP) assuming continuous values using the **dgp** function.

library("HCmodelSets") dgp = DGP(s=5, a=3, sigStrength=1, rho=0.9, n=100, intercept=5, noise=1, var=1, d=1000, DGP.seed = 2019)

The DGP sample size is 100 and out of 1000 variables, only 5 are relevant. To access which ones are relevant, just use the **TRUE.idx** value. Next we perform the reduction phase using the first 70 observations, which successively discard variables according to appropriate decision rules. At the end, it provides the number and indexes of variables selected at each stage.

outcome.Reduction.Phase = Reduction.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70], family=gaussian, seed.HC = 1093)

Also, one can use the **vector.signif** argument to create a vector of p-values proposing its own decision rule for each reduction of the hypercube. If it is not provided, as we did in the example above, the default option selects the two most significant variables in each transverse of the hypercube for the highest dimensional one and then adopts pvalues of 0.01 for the sub-sequential lower dimensions. Also, since , **family=”gaussian”** must be specified. In the former example we did not provided the value of dimension of the hypercube to be used in the first-stage reduction, however this could be easily done by using the **dmHC** argument.

The next step is to perform the exploratory phase on the variables retained through the reduction phase in the same 70 observations, which will return any significant squared and/or interaction terms. We opt to use the variables in the hypercube of dimension 2 that were selected at least once as this retains 25 variables.

idxs = outcome.Reduction.Phase$List.Selection$`Hypercube with dim 2`$numSelected1 outcome.Exploratory.Phase = Exploratory.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70], list.reduction = idxs, family=gaussian, signif=0.01)

At the end, only one squared variable is relevant, and will be used to construct sets of well-fitting models using the remaining 30 observations.

sq.terms = outcome.Exploratory.Phase$mat.select.SQ in.terms = outcome.Exploratory.Phase$mat.select.INTER MS = ModelSelection.Phase(X=dgp$X[71:100,],Y=dgp$Y[71:100], list.reduction = idxs, sq.terms = sq.terms,in.terms = in.terms, signif=0.01)

Note that since we did not specified the **modelSize** argument, the procedure will adopt the minimum value between the sum of 25 (the number of variables in the reduction phase) and 1 (the number of variables in the exploratory phase) with 5 (i.e. min(26,5)). Since we specified in the **DPG** function **s=5**, i.e., the true number of variables that generates the true process, we will analyze the results of models of size 5.

TRUE.idx = dgp$TRUE.idx sets = sort(table(MS$goodModels$`Model Size 5`), decreasing=TRUE) ### frequency table of variables selected print(sets)

## ## 379 479 732 438 613 152 901 333 596 408 ## 12650 12650 12649 11710 10753 7810 7809 7787 7716 7491 ## 756 24 123 366 986 670 333 ^2 335 200 661 ## 7283 7276 7257 7233 7193 7160 7119 7108 7099 7094 ## 634 171 844 20 225 909 ## 7066 7048 7032 7017 7002 6978

### intersection between the true model and the variables obtained in the Model Selection phase intersect(TRUE.idx,names(sets))

## [1] "152" "379" "479" "613" "901"

Note that the true variables are fully recovered at the end with also a few other variables and non-linear effects which might also be relevant to fit the observed process .

## References

Cox, D. R., & Battey, H. S. (2017). Large numbers of explanatory variables, a semi-descriptive analysis. Proceedings of the National Academy of Sciences, 114(32), 8592-8595.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

Battey, H. S., & Cox, D. R. (2018). Large numbers of explanatory variables: a probabilistic assessment. Proc. R. Soc. A, 474(2215), 20170631.

Hoeltgebaum HH (2018). HCmodelSets: Regression with a Large Number of Potential Explanatory Variables. R package version 0.1.1, URL https://CRAN.R-project.org/package=HCmodelSets.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – insightR**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...