Many statistical modeling problems reduce to a minimization problem of the general form:
where is some type of loss function, denotes the data, and is a penalty, also referred to by other names, such as “regularization term” (problems (1) and (2-3) are often equivalent by the way). Of course both, and , may depend on further parameters.
There are multiple reasons why it can be helpful to check out the contours of such penalty functions :
- When is two-dimensional, the solution of problem (2-3) can be found by simply taking a look at the contours of and .
- That builds intuition for what happens in more than two dimensions, and in other more general cases.
- From a Bayesian point of view, problem (1) can often be interpreted as an MAP estimator, in which case the contours of are also contours of the prior distribution of .
Therefore, it is meaningful to visualize the set of points that maps onto the unit ball in , i.e., the set
Below you see GIF images of such sets for various penalty functions in 2D, capturing the effect of varying certain parameters in . The covered penalty functions include the family of -norms, the elastic net penalty, the fused penalty, the sorted norm, and several others.
:white_check_mark: R code to reproduce the GIFs is provided.
p-norms in 2D
First we consider the -norm,
with a varying parameter (which actually isn’t a proper norm for ). Many statistical methods, such as LASSO (Tibshirani 1996) and Ridge Regression (Hoerl and Kennard 1970), employ -norm penalties. To find all on the boundary of the 2D unit -norm ball, given (the first entry of ), is easily obtained as
Elastic net penalty in 2D
The elastic net penalty can be written in the form
for . It is quite popular with a variety of regression-based methods (such as the Elastic Net, of course). We obtain the corresponding 2D unit “ball”, by calculating from a given as
Fused penalty in 2D
The fused penalty can be written in the form
It encourages neighboring coefficients to have similar values, and is utilized by the fused LASSO (Tibshirani et. al. 2005) and similar methods.
(Here I have simply evaluated the fused penalty function on a grid of points in , because figuring out equations in parametric form for the above polygons was too painful for my taste… :stuck_out_tongue:)
Sorted L1 penalty in 2D
The Sorted penalty is used in a number of regression-based methods, such as SLOPE (Bogdan et. al. 2015) and OSCAR (Bondell and Reich 2008). It has the form
where are the absolute values of the entries of arranged in a decreasing order. In 2D this reduces to
Difference of p-norms
It holds that
or more generally, for all -norms it holds that
Thus, it is meaningful to define a penalty function of the form
for , which results in the following.
We visualize the same for varying fixing , i.e., we define
and we obtain the following GIF.
Hyperbolic tangent penalty in 2D
The hyperbolic tangent penalty, which is for example used in the method of variable selection via subtle uprooting (Su, 2015), has the form