# Contours of statistical penalty functions as GIF images

March 17, 2017
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many statistical modeling problems reduce to a minimization problem of the general form:

or

where $f\inline$ is some type of loss function, $\mathbf{X}\inline$ denotes the data, and $g\inline$ is a penalty, also referred to by other names, such as “regularization term” (problems (1) and (2-3) are often equivalent by the way). Of course both, $f\inline$ and $g\inline$, may depend on further parameters.

There are multiple reasons why it can be helpful to check out the contours of such penalty functions $g\inline$:

1. When $\boldsymbol{\beta}\inline$ is two-dimensional, the solution of problem (2-3) can be found by simply taking a look at the contours of $f\inline$ and $g\inline$.
2. That builds intuition for what happens in more than two dimensions, and in other more general cases.
3. From a Bayesian point of view, problem (1) can often be interpreted as an MAP estimator, in which case the contours of $g\inline$ are also contours of the prior distribution of $\boldsymbol{\beta}\inline$.

Therefore, it is meaningful to visualize the set of points that $g\inline$ maps onto the unit ball in $\mathbb{R}^2\inline$, i.e., the set

$B_{g} := \{ \mathbf{x}\in\mathbb{R}^2 : g(\mathbf{x}) \leq 1 \}.$

Below you see GIF images of such sets $B_{g}\inline$ for various penalty functions $g\inline$ in 2D, capturing the effect of varying certain parameters in $g\inline$. The covered penalty functions include the family of $p\inline$-norms, the elastic net penalty, the fused penalty, the sorted $\ell_1\inline$ norm, and several others.

:white_check_mark: R code to reproduce the GIFs is provided.

## p-norms in 2D

First we consider the $p\inline$-norm,

$g_{p}(\boldsymbol{\beta}) = \lVert\boldsymbol{\beta}\rVert_{p}^{p} = \lvert\beta_{1}\rvert^p + \lvert\beta_{2}\rvert^p,$

with a varying parameter $p \in (0, \infty]\inline$ (which actually isn’t a proper norm for $p < 1\inline$). Many statistical methods, such as LASSO (Tibshirani 1996) and Ridge Regression (Hoerl and Kennard 1970), employ $p\inline$-norm penalties. To find all $\boldsymbol{\beta}\inline$ on the boundary of the 2D unit $p\inline$-norm ball, given $\beta_1\inline$ (the first entry of $\boldsymbol{\beta}\inline$), $\beta_2\inline$ is easily obtained as

$\beta_2 = \pm (1-|\beta_1|^p)^{1/p}, \quad \forall\beta_1\in[-1, 1].$

## Elastic net penalty in 2D

The elastic net penalty can be written in the form

$g_{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert_{1} + (1 - \alpha) \lVert \boldsymbol{\beta} \rVert_{2}^{2},$

for $\alpha\in(0,1)\inline$. It is quite popular with a variety of regression-based methods (such as the Elastic Net, of course). We obtain the corresponding 2D unit “ball”, by calculating $\beta_{2}\inline$ from a given $\beta_{1}\in[-1,1]\inline$ as

$\beta_{2} = \pm \frac{-\alpha + \sqrt{\alpha^2 - 4 (1 - \alpha) ((1 - \alpha) \beta_{1}^2 + \alpha \beta_{1} - 1)}}{2 - 2 \alpha}.$

## Fused penalty in 2D

The fused penalty can be written in the form

$g_{\alpha}(\boldsymbol{\beta}) = \alpha \lVert \boldsymbol{\beta} \rVert_{1} + (1 - \alpha) \sum_{i = 2}^m \lvert \beta_{i} - \beta_{i-1} \rvert.$

It encourages neighboring coefficients $\beta_{i}\inline$ to have similar values, and is utilized by the fused LASSO (Tibshirani et. al. 2005) and similar methods.

(Here I have simply evaluated the fused penalty function on a grid of points in $[-2,2]^2\inline$, because figuring out equations in parametric form for the above polygons was too painful for my taste… :stuck_out_tongue:)

## Sorted L1 penalty in 2D

The Sorted $\ell_{1}\inline$ penalty is used in a number of regression-based methods, such as SLOPE (Bogdan et. al. 2015) and OSCAR (Bondell and Reich 2008). It has the form

$g_{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \sum_{i = 1}^m \lambda_{i} \lvert \beta \rvert_{(i)},$

where $\lvert \beta \rvert_{(1)} \geq \lvert \beta \rvert_{(2)} \geq \ldots \geq \lvert \beta \rvert_{(m)}\inline$ are the absolute values of the entries of $\boldsymbol{\beta}\inline$ arranged in a decreasing order. In 2D this reduces to

$g_{\boldsymbol{\lambda}}(\boldsymbol{\beta}) = \lambda_{1} \max\{|\beta_{1}|, |\beta_{2}|\} + \lambda_{2} \min\{|\beta_{1}|, |\beta_{2}|\}.$

## Difference of p-norms

It holds that

$\lVert \boldsymbol{\beta} \rVert_{1} \geq \lVert \boldsymbol{\beta} \rVert_{2},$

or more generally, for all $p\inline$-norms it holds that

$(\forall p \leq q) : \lVert \boldsymbol{\beta} \rVert_{p} \geq \lVert \boldsymbol{\beta} \rVert_{q}.$

Thus, it is meaningful to define a penalty function of the form

$g_{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert_{1} - \alpha \lVert \boldsymbol{\beta} \rVert_{2},$

for $\alpha\in[0,1]\inline$, which results in the following.

We visualize the same for varying $p \geq 1\inline$ fixing $\alpha = 0.6\inline$, i.e., we define

$g_{\alpha}(\boldsymbol{\beta}) = \lVert \boldsymbol{\beta} \rVert_{1} - 0.6 \lVert \boldsymbol{\beta} \rVert_{p},$

and we obtain the following GIF.

## Hyperbolic tangent penalty in 2D

The hyperbolic tangent penalty, which is for example used in the method of variable selection via subtle uprooting (Su, 2015), has the form

$g_{a}(\boldsymbol{\beta}) = \sum_{i = 1}^p \tanh(a \beta_{i}^2), \quad a>0.$