Why doesn't qUpperConfidenceBound use ReLU of the deviation instead of the absolute deviation from the posterior mean? #1672

AlexStreicher · 2023-02-12T00:29:59Z

AlexStreicher
Feb 12, 2023

I was looking at the definition of qUpperConfidenceBound and I was getting a little nervous about the fact that it uses absolute deviation even while sampling q correlated points simultaneously.

To review, in the referred paper (Wilson et al. Appendix A) they start with the q=1 UpperConfidenceBound
$$UCB(x; \beta) \equiv \mu(x)+\beta \sigma(x),\quad y|x\sim\mathcal{N}(\mu(x),\sigma(x)).$$ For reference, when q=1, $y, \mu, \sigma$ are all scalars, and we will be dropping explicit arguments of x moving forward.
To prepare for Monte Carlo sampling and reparametrization, they reformulate it by rewriting the standard deviation an integral over the distribution of $y|x$: $$UCB(x; \beta) \equiv \mu+\beta \sigma=\intop dy \left(\mu+\beta\sqrt{\frac{\pi}{2}}|y-\mu|\right) \rho(y),\quad y|x\sim\mathcal{N}(\mu,\sigma),$$ using the fact that for a Gaussian variable, $\mathbb{E}[|y-\mu|]=\sigma\sqrt{2/{\pi}}$.

They then propose a parallel version of UpperConfidenceBound for q>1 simultaneous points:
$$qUCB(\mathbf{x}; \beta) \coloneqq \intop d\mathbf{y} \; max_{q} \left(\boldsymbol{\mu}+\beta\sqrt{\frac{\pi}{2}}|\mathbf{y}-\boldsymbol{\mu}|\right) \rho(\boldsymbol{y}),\quad \mathbf{y}|\mathbf{x}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$$ $$\mathbf{y}=(y^{(1)},...,y^{(q)}),\quad \boldsymbol{\mu}=(\mu^{(1)},...,\mu^{(q)}), \quad \boldsymbol{\Sigma}=((Cov(y^{(1)},y^{(1)}),Cov(y^{(1)},y^{(2)}),...),...,(Cov(y^{(q)},y^{(1)}),...,Cov(y^{(q)},y^{(q)}))) $$ The idea here is that we're integrating this max-quantity over our multivariate posterior distribution of q simultaneous points.

However, note what we're integrating. During integration, at any point the $\mathbf{y}$-domain we're not taking the value of the 1-UCB for whichever of the q-points has the max UCB. Instead, what we're doing is taking the max value of a quantity that when marginalized, and integrated over, becomes the 1-UCB. For the toy example where the q points are completely uncorrelated, there's no difference between those two statements. However, when the multivariate posterior being considered has strong (anti)correlation between the outcomes (e.g. q=2 and the $y$'s are strongly (anti)correlated), it is not clear to me that the above quantity should be referred to as "UpperConfidenceBound"

Why not replace $\beta\sqrt{{\pi}/{2}}|y-\mu|$ with $\beta\sqrt{2\pi}*\mathrm{ReLU}(y-\mu)$? In the code, this would amount to taking the line ucb_samples = mean + self.beta_prime * (obj - mean).abs() with ucb_samples = mean + 2 * self.beta_prime * (obj - mean).clamp_min(0).

Balandat · 2023-02-12T16:55:38Z

Balandat
Feb 12, 2023
Collaborator

Hmm this is a good point. @j-wilson is the right person for this question :)

1 reply

Balandat Feb 12, 2023
Collaborator

Also, @SebastianAment

eytan · 2023-02-12T17:28:11Z

eytan
Feb 12, 2023
Collaborator

@AlexStreicher thank you for the suggestion. Have you done any benchmarks with this alternative formulation?

0 replies

j-wilson · 2023-02-12T21:25:27Z

j-wilson
Feb 12, 2023

Hi @AlexStreicher,

This is an interesting question. The main reason for introducing q-UCB in the paper you referenced was to argue that reinterpreting acquisition functions as expectations can facilitate the design of batch variants. We introduced a UCB-like batch acquisition function, but this is not necessarily the best one. If you think that it can be improved, then I'd love to hear more! Also, it sounds like you're less than thrilled with our loose usage of the term UCB, and I agree that we could have done a better job here.

As I recall it, we primarily considered two candidates for q-UCB:

$\mathbb{E}[\max(\mathbf{y}) \vert \mathbf{y} \ge \boldsymbol{\mu}]$
$\mathbb{E}[\boldsymbol{\mu} + \operatorname{Abs}(\mathbf{y} - \boldsymbol{\mu})]$.

Both of these were intended to retain the spirit of optimism that UCB is known for. In (1), this is accomplished by explicitly assuming that each outcome will be favorable. Unfortunately, this formulation strongly discourages batches of anti-correlated queries (since $\mathbf{y} \vert \mathbf{y} \ge \boldsymbol{\mu}$ becomes heavily concentrated around $\boldsymbol{\mu}$). The basic idea for (2) was to simply pretend that "any deviation is a good deviation".

I'm curious to learn more about your suggestion of using ReLU. Why do you think this makes (more) sense? At a glance, the two seem similar, but I would expect ReLU to place greater emphasis on exploitation?

3 replies

AlexStreicher Feb 16, 2023
Author

As I recall it, we primarily considered two candidates for q-UCB:

E[max(y)|y≥μ]

One post-hoc observation is that this conditioning puts one in a very rare (weird?) subset of the posterior domain by restricting all the posterior values to be simultaneously fluctuating above their means, especially if any of them are anticorrelated.

Both of these were intended to retain the spirit of optimism that UCB is known for. In (1), this is accomplished by explicitly assuming that each outcome will be favorable. Unfortunately, this formulation strongly discourages batches of anti-correlated queries (since y|y≥μ becomes heavily concentrated around μ). The basic idea for (2) was to simply pretend that "any deviation is a good deviation".

I'm curious to learn more about your suggestion of using ReLU. Why do you think this makes (more) sense? At a glance, the two seem similar, but I would expect ReLU to place greater emphasis on exploitation?

I was explaining to a colleague how a seemingly-simple quantity such as $qSimpleRegret(\mathbf{x})\equiv\mathbb{E}[\mathrm{max}_q(\mathbf{y})]$ is actually amazing. It takes the max over the $q$ posterior values, so attempting to exploit it will unknowingly cause exploration and even better it will "hedge" its suggested points. After all, two copies of the same point (perfect correlation) is a wasteful suggestion, and anticorrelated points guarantee at least one value fluctuating positively. When I tried to transfer these statements to qUpperConfidenceBound, I realized that it wouldn't work.

What sunk it in is that for q=2, we can consider the toy example where the two posterior values have the same marginal statistics, $\mu_1=\mu_2=\mu, \; \sigma_1=\sigma_2=\sigma$. Then, one can see how qUCB varies as a function of the points' remaining statistic: correlation coefficient $\rho_{12}\in[-1,1]$. One finds that qUCB reports its maximum value when the points are completely uncorrelated $\rho_{12}=0$. Furthermore, it is symmetric about this point, reporting its worst values for $\rho_{12}=\pm1$ .

I noticed that this issue was related to to the absolute value in qUCB rewarding downward fluctuations in addition to upward fluctuations. I figured the simplest fix was to replace $\mathrm{abs} \rightarrow 2 \, \mathrm{ReLU}$, thereby "optimistically" ignoring any individual underperformance.

The main motivations/factors leading me to this choice:

Require acquisition function to become UpperConfidenceBound when $q \rightarrow 1$
Having at least one parameter (i.e. $\beta$ ), that can be dialed to enhance/diminish wanderlust.
For the q=2 case, require that holding $\mu_1=\mu_2=\mu, \; \sigma_1=\sigma_2=\sigma$ for $\max(\boldsymbol{\mu} + 2\beta \; \mathrm{ReLU}(\mathbf{y}-\boldsymbol{\mu}))$ results in a quantity that is minimized when the two points are perfectly correlated and monotonically increases as $\rho$ moves towards perfect anticorrelation. Examples of functions with this q=2 "hedging" property are qSimpleRegret, qProbabilityOfImprovement, qExpectedImprovement, and the ReLU version of qUCB. Examples of functions without this property are qUpperConfidenceBound as well as E[max(y)|y≥μ] and E[max(y)|any(y≥μ)] (my first attempt to come up with a fix)

So, qUCB with abs favors points with completely uncorrelated outcomes, while qUCB with ReLU favors points with outcomes that hedge against eachother's negative fluctuations, resulting in a behavior similar to qSimpleRegret, qProbabilityOfImprovement, or qExpectedImprovement, but now with a parameter dial of $\beta$ to control how exploitative it is.

j-wilson Feb 21, 2023

I'm not sure if/when we'll get around to investigating the ReLU variant you've proposed (since we do not currently use qUCB), but I think that what you've said makes sense intuitively. Time allowing, it would be very interesting to see how the ReLU-based qUCB acquisition function performs in comparison to, e.g., qEI.

Balandat Feb 22, 2023
Collaborator

@SebastianAment, @sdaulton if we're running benchmarks in the context of the log(qN)EI work, maybe that's something we can do with minimal overhead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't qUpperConfidenceBound use ReLU of the deviation instead of the absolute deviation from the posterior mean? #1672

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why doesn't qUpperConfidenceBound use ReLU of the deviation instead of the absolute deviation from the posterior mean? #1672

AlexStreicher Feb 12, 2023

Replies: 3 comments · 4 replies

Balandat Feb 12, 2023 Collaborator

Balandat Feb 12, 2023 Collaborator

eytan Feb 12, 2023 Collaborator

j-wilson Feb 12, 2023

AlexStreicher Feb 16, 2023 Author

j-wilson Feb 21, 2023

Balandat Feb 22, 2023 Collaborator

AlexStreicher
Feb 12, 2023

Replies: 3 comments 4 replies

Balandat
Feb 12, 2023
Collaborator

Balandat Feb 12, 2023
Collaborator

eytan
Feb 12, 2023
Collaborator

j-wilson
Feb 12, 2023

AlexStreicher Feb 16, 2023
Author

Balandat Feb 22, 2023
Collaborator