Priors

Priors in GradVI

GradVI can be used with a wide range of prior distributions. It is designed to obtain the prior family as an input module from the user. Before running any task, the user needs to specify which prior to use. They can use one of the priors already provided with the software, or they can define their own choice of prior.

Using prior distributions

In our software, we have already implemented a few prior distributions. In the following, we discuss how to use those prior distributions nd the available options.

Adaptive Shrinkage (Ash)

The adaptive shrinkage (ash) prior was introduced by Matthew Stephens in this paper. It is defined as a scale mixture of normals, specifically,

$$\mathcal{G}(\sigma_1^2, \ldots, \sigma_K^2) \triangleq \left\{g(\cdot) = \sum_{k=1}^{K}w_k\mathcal{N}(\cdot \mid 0, \sigma_k^2) : w_k \in S\right\}\,,$$

$$S \triangleq \left\{w_k \in \mathbb{R}_{+}^{K} : \sum_{k=1}^{K}w_k = 1\right\}\,.$$

Here, $0 \le \sigma_1^2 \le \ldots \le \sigma_K^2 < \infty$ are the component variances which need to be speficied by the user. The unknown mixture proportions, $w_1, \ldots, w_K$, are estimated from the data. Here is an example to define a prior family with $K=20$ pre-specified components and $\sigma_k^2 = (2^{(k-1)/K} - 1)$:

import numpy as np
from gradvi.priors import Ash

sk = np.abs(np.power(2.0, np.arange(20) / 20) - 1)
prior = Ash(sk)

There are several optional flags which can used to modify the prior family, as discussed below in the respective subsections.

Initialization. The solution may depend on the initialization of $w_k$. By default, the software initializes all weights as equal, $w_k = 1 / K$. Since $w_0$ controls the sparsity of the model (intuitively one can think of it as a fraction of null / zero coefficients in the regression model), it is often helpful to initialize $w_0$ at a desired sparsity level. For example, if we guess that 90% of the coefficients are zero, we can initialize $w_0 = 0.9$:

import numpy as np
from gradvi.priors import Ash

k = 20
wk = np.zeros(k)
wk[0] = 0.9
wk[1:(k-1)] = np.repeat((1 - wk[0])/(k-1), (k - 2))
wk[k-1] = 1 - np.sum(wk) # to prevent overflow error leading to a sum greater than 1.
sk = np.abs(np.power(2.0, np.arange(k) / k) - 1)
prior = Ash(sk, wk = wk)

The above recipe is very commonly used and the above code can be replaced simply by setting the sparsity flag:

import numpy as np
from gradvi.priors import Ash

sk = np.abs(np.power(2.0, np.arange(20) / 20) - 1)
prior = Ash(sk, sparsity = 0.9)

In general, one can define any set of initial values $w_k$ with the constraint that they must sum to 1.

Scaling. While solving a multiple regression problem $\mathbf{Y} = \mathbf{Xb} + \epsilon$, the coefficients $b_j$ are assumed to have a prior distribution $b_j \sim g(\cdot)$. We call this the unscaled prior. In this paper, Kim et. al. discusses that there could be computational benefits in assuming a scaled prior, in which the scaled coefficients $b_j / \sigma$ are i.i.d. from $g$, $b_j \mid g, \sigma^2 \sim g_{\sigma}$, where $\epsilon = \mathcal{N}(0, \sigma^2)$ and $g_{\sigma}(b_j) \triangleq g(b_j / \sigma) / \sigma$. The user can switch between the two variations using the scaled flag:

prior = Ash(sk, scaled = True) # for scaled prior
prior = Ash(sk, scaled = False) # for unscaled prior

By default, we use scaled = True. For regression, we suggest using the scaled prior and for trendfiltering, we suggest using the unscaled prior.

Softmax parametrization. In order to impose the constraints $w_k \ge 0$ and $\sum_k w_k = 1$, we use a softmax representation of $w_k$, that is,

$$w_k = \frac{\exp(a_k)}{\sum_k \exp(a_k)}\,,$$

and update $a_k$ instead of $w_k$ in the gradient descent iterations. In general, instead of $\exp(\cdot)$, a different base $b \gt 0$ can be used and can be set using the smbase flag:

prior = Ash(sk, smbase = 2.0)

Point-Normal (Spike-and-slab)

The point-normal distribution is another popular choice of prior family in sparse variable selection, for example, see this paper. It is defined as,

$$g(\cdot) = w\delta_0 + (1 - w) \mathcal{N}(\cdot \mid 0, \sigma_1^2)\,,$$

where $\delta_0$ is the Dirac delta function with a point mass at zero. Here, $w$ and $\sigma_1$ are unknown parameters, which are estimated from the data. In our software, this prior can be chosen using:

from gradvi.priors import PointNormal
prior = PointNormal()

Initialization. By default, the software initializes $w = 0.5$ and $\sigma_1^2 = 1.0$. Here, $w$ controls the sparsity of the model (intuitively one can think of it as a fraction of null / zero coefficients in the regression model). It is often helpful to initialize with more weight on the delta function. For example, if we guess that 90% of the coefficients are zero, we can initialize $w=0.9$. In our software, this can be set using the sparsity flag, that is,

prior = PointNormal(sparsity = 0.9)

Similarly, the initial value of $\sigma_1^2$ can be set using the s2 flag, for example,

prior = PointNormal(sparsity = 0.9, s2 = 2.0)

Custom class for prior distributions

The user can define and use any prior distribution if it is analytically tractable in the Normal Means model. If you are defining a prior family, please reach out to us so that we can test and include the prior in the software, saving time and resources for other users in the future. The Python class for the prior family primarily includes:

The number of parameters that needs to be estimated,
A real bound for each of those parameters
A function for the gradient descent step for those parameters

Provide feedback

Saved searches