notebooks/boundary-time-warping_final.Rmd

---
title: "Boundary Time Warping (new)"
bibliography: ../inst/REFERENCES.bib
header-includes:
  - \usepackage{bm}
output:
  pdf_document:
    number_sections: true
    toc: true
    toc_depth: 6
    df_print: kable
  html_document:
    number_sections: true
    toc: true
    toc_depth: 6
    toc_float: true
    df_print: kable
  md_document:
    toc: true
    toc_depth: 6
    df_print: kable
  word_document: default
---

```{r echo=FALSE, warning=FALSE}
source(file = "../helpers.R")
source(file = "./common-funcs.R")
source(file = "../models/modelsR6.R")
source(file = "../models/BTW.R")

library(ggplot2)
library(ggpubr)
```


# Boundary Time Warping (final)

Note that we have abandoned the previous BTW notebook (`boundary-time-warping_new.Rmd`), as we want to attempt a final reformulation of BTW, using the initial translate-and-scale approach. The reason behind this is simple: In our previous approach we reduced the required parameters to only one, $\tau$. While that simplifies many things, it is not possible to deduce a gradient-delta that is required for adjusting the boundaries, bummer! So in this notebook, we want to use the initial approach and make an explicit formulation.

Our formulation here is most explicit, as to avoid any confusions. The overall model

* subdivides the query-signal into a set of $\mathcal{I}$ intervals (of length $q$), and
* in each interval there are __two__ parameters, the absolute offset (begin) and end of it.

The translation and scale parameters result from putting the original- and query-interval into relation. The original interval is the one the query-signal was divided into initially, it is constant and never changed, and given by $b_{\text{start}}^{\text{org}},b_{\text{end}}^{\text{org}}$ (or short: $b_{o_s},b_{o_e}$). The query-interval is the one delimited by the hyperparameters, given by $b_{\text{start}}^{\text{query}},b_{\text{end}}^{\text{query}}$ (short: $b_{q_s},b_{q_e}$). Note that from here on and after, "start", "end", "org" and "query" will often be denoted by "s", "e", "o" and "q", respectively).

## Formal description of the model

The model is now much simpler, and follow the structure of the original notebook here. Note that almost all of the remarks of the previous notebook apply, only the way we calculate BTW has changed.

Given a vector of boundaries, $\bm{\theta}_b$, the first step is to establish the two vectors of absolute start- and end-parameters, i.e.,

$$
\begin{aligned}
  \bm{\theta}\;\dots&\;\text{ordered vector of boundaries, such that}
  \\[0ex]
  &\;\bm{\theta}_j\prec\bm{\theta}_{j+1}\;\text{,}
  \\[0ex]
  i=&\;\lvert\,\bm{\theta}\,\rvert\;\text{, the length of that vector,}
  \\[1ex]
  \bm{\theta}^{\text{start}}=&\;\{\bm{\theta}_{1},\;\dots\;,\bm{\theta}_{i-1}\}\;\text{(all but the last boundary),}
  \\[0ex]
  \bm{\theta}^{\text{end}}=&\;\{\bm{\theta}_{2},\;\dots\;,\bm{\theta}_{i}\}\;\text{(all but the first boundary),}
  \\[1ex]
  q\in Q;\;Q=&\;\{1,\;\dots\;,i-1\}\;\text{, where}\;Q\;\text{is ordered low to high, i.e.,}\;q_i\prec q_{i+1}\text{,}
  \\[0ex]
  &\;\mathcal{I}\;\text{, the set of intervals with length}\;[\max{Q}]\text{, with each interval delimited by}
  \\[0ex]
  I_q=&\;\Big[\,\bm{\theta}_q^{\text{start}},\bm{\theta}_q^{\text{end}}\,\Big)\;\text{, resp.}
  \\[0ex]
  I_{q}=&\;\Big[\,\bm{\theta}_q^{\text{start}},\bm{\theta}_q^{\text{end}}\,\Big]\;\text{, for the last interval (}q=\max{Q}\text{).}
\end{aligned}
$$

Our overall model hence is a __Multi-task learning__ model. The relation between intervals (and hence the sub-models for each) is:

* Each interval $I_q,\;q<\max{Q}$ is defined as $I_q=[\bm{\theta}_q^s,\bm{\theta}_q^e\,)$, i.e., it does __not__ include the end-boundary. Each last interval, i.e., $I_q,\;q=\max{Q}$ is defined as $I_q=[\bm{\theta}_q^s,\bm{\theta}_q^e\,]$, i.e., it includes __both__ start- and end-boundary.
* That leads to two consecutive intervals, $I_q,I_{q+1}$, having no space between them, such that the end-boundary of interval $q$, $\bm{\theta}_q^e$, and the start-boundary of interval $q+1$, $\bm{\theta}_{q+1}^s$, satisfy the inequality $\bm{\theta}_q^e < \bm{\theta}_{q+1}^s$.
* This implies that for all but the last interval, $I_q^e\neq I_{q+1}^s$. This is important later for the gradient calculations, as we do have partial derivatives for $\bm{\theta}_q^s,\bm{\theta}_q^e$, and only need to calculate $\partial\,\bm{\theta}_q^e$ for the last interval, while we calculate $\partial\,\bm{\theta}_q^s$ for all intervals.

The last property is how we exploit Boundary Time Warping as a Multi-task learning model. The overal model, $\mathsf{M}$, produces $\hat{\bm{y}}$ by piece-wise computing and then concatenating it, where each piece is an interval and its translation and scaling.

Now recall that signals are represented as functions, and for the query-signal, we'll also have the derivative and optionally the 2nd derivative:

$$
\begin{aligned}
  r(x) &= \dots\;\text{, a function over the reference signal,}
  \\
  f(x),f\prime(x),f\prime\prime(x) &= \dots\;\text{, functions over the query signal (and its 1st and 2nd derivatives).}
\end{aligned}
$$

There may be cases where the derivatives of $f$ can be determined analytically. If that is not the case, one may choose to compute a numerical gradient by, e.g., a finite difference approximation. Formally, each sub-model $m_q$ and the overall model $\mathsf{M}$ are defined as:

$$
\begin{aligned}
  \bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e\;\dots&\;\text{ordered sets (same length) of start- and end original- and query-boundaries,}
  \\[1ex]
  \mathcal{X}^r=&\;[\min{\bm{\theta}^s}\,,\,\max{\bm{\theta}^e}]\;\text{, the support of the reference-signal,}
  \\[1ex]
  \mathcal{X}^q=&\;[\min{\bm{\vartheta}^s}\,,\,\max{\bm{\vartheta}^e}]\;\text{, the support of the query-signal,}
  \\[1ex]
  \mathbf{x}_q\subset\mathcal{X}^q=&\;[\bm{\vartheta}^s_q\,,\,\bm{\vartheta}^e_q]\text{, proper sub-support for model}\;m_q\;\text{and its interval}\;I_q\text{, such that}
  \\[1ex]
  \mathbf{x}_q\prec&\;\mathbf{x}_{q+1}\;\text{, i.e., sub-supports are ordered,}
  \\[1ex]
  m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,x)=&\;f\Bigg(\frac{(x - b_{q_s})\times(b_{o_e} - b_{o_s})}{b_{q_e}-b_{q_s}} + b_{o_s}\Bigg)\;\text{, model for interval}\;q\text{,}
  \\[1ex]
  \mathbf{y}=&\;r(\mathcal{X}^r)\;\text{, the reference-signal,}
  \\[1ex]
  \hat{\mathbf{y}}=&\;\Big\{\;m_1(\dots)\,\frown\,\dots\,\frown m_q(\bm{\theta}^s_q,\bm{\theta}^e_q,\bm{\vartheta}^s_q,\bm{\vartheta}^e_q,f,\mathbf{x}_q)\;\Big\},\;\forall\,q\in Q
  \\
  &\;\text{(concatenate all models' results), and finally}
  \\[1ex]
  \mathsf{M}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e, r,f)=&\;\Big[\,\mathbf{y}^\top,\hat{\mathbf{y}}^\top\Big]\;\text{, compute the reference- and transformed query-signal.}
\end{aligned}
$$

In some of the subsequent computations we require the 1st and 2nd derivative of $f$, both of which can be supplied to $\mathsf{M}$ instead of $f$ for computing $\hat{\mathbf{y}}\prime$ and $\hat{\mathbf{y}}\prime\prime$, respectively.

Now the problem we previously faced, was to get from a set of boundaries to a (finite) set of model-parameters, and back again. In our previous notebook it turned out that for $\bm{\tau}$ no direct path back exists. So before we move any further, we need to fully describe both ways.

$$
\begin{aligned}
  &\;b_{q_s}, b_{q_e}\;\text{are the start- and end-boundaries delimiting the interval}\;q\text{.}
\end{aligned}
$$

Let's use over ubiquitous example of scaling and translating a portion of the sine signal again:

```{r}
b_os <- .75
b_oe <- 1
b_qs <- .4
b_qe <- .7

temp <- sin(seq(0, pi * 2/3, length.out = 1e2))
tempf <- approxfun(x = seq(0, 1, length.out = 1e2), y = temp)
curve(tempf, 0, 1)
abline(v = .75)
abline(v = 1)
```

The goal is to determine $s,t$ from the boundary-delimited interval $[0.75,1]$ (the two vertical lines above), if we were to move this interval to $[0.4, 0.7]$.

```{r}
s_q <- (b_qe - b_qs) / (b_oe - b_os)
t_q <- b_os - b_qs

tempfPrime <- function(x) {
  tempf((x - b_qs) / (b_qe - b_qs) * (b_oe - b_os) + b_os)
}
curve(tempfPrime, 0, 1)
abline(v = b_qs)
abline(v = b_qe)
```

## Gradient for the sub-models

Each sub-model can be defined as a function over the extents of the original- and query-interval:

$$
\begin{aligned}
  m(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,x) =&\;f\Big((x - b_{q_s})\times(b_{q_e}-b_{q_s})^{-1}\times(b_{o_e} - b_{o_s}) + b_{o_s}\Big)\;\text{, with}\;b_{o_s},b_{o_e}\;\text{being constant,}
  \\
  \delta_o =&\;b_{o_e}-b_{o_s}\;\text{,}\;m\;\text{becomes}
  \\[1ex]
  m(b_{o_s}, \delta_o, b_{q_s}, b_{q_e},\;f,x) =&\;f\Big((x - b_{q_s})\times\frac{\delta_o}{b_{q_e}-b_{q_s}} + b_{o_s}\Big)\;\text{.}
  \\[1ex]
  \nabla\,m(b_{o_s}, \delta_o, b_{q_s}, b_{q_e},\;f,x) =&\;\Bigg[\frac{\partial\,m}{\partial\,b_{q_s}}\;,\;\frac{\partial\,m}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\Bigg[-\frac{\delta_o\times(b_{q_e}-x)\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\;,\;\frac{\delta_o\times(b_{q_s}-x)\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\frac{\delta_o\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\times\Bigg[x-b_{q_e}\;,\;b_{q_s}-x\;\Bigg]\;\text{.}
\end{aligned}
$$

So we're having a very pleasing gradient and a clear path between the pairs of boundaries and the models' parameters. Actually, we didn't derive extra parameters, but now plug in the boundaries directly. Now, for example, to calculate the required change for $b_{q_s}$, we simply compute the partial derivative for it, done! That however needs to be done when we plug in $\mathsf{M}$ into a cost-function. Again, we'll be using the residual sum of squares (RSS) for that.


Let's show a short demonstration:

```{r echo=FALSE}
btwRef <- data.frame(
  x = seq(0, 1, length.out = 1e3),
  y = sin(seq(0, 2 * pi, length.out = 1e3))
)

btwQueryBounds <- c(.1, .2, .5, .6, .8)
btwQuery <- data.frame(
  x = c(
    seq(0, btwQueryBounds[1], length.out =  75),
    seq(btwQueryBounds[1], btwQueryBounds[2], length.out =  25),
    seq(btwQueryBounds[2], btwQueryBounds[3], length.out = 150),
    seq(btwQueryBounds[3], btwQueryBounds[4], length.out = 300),
    seq(btwQueryBounds[4], btwQueryBounds[5], length.out =  50),
    seq(btwQueryBounds[5], 1, length.out = 400)
  ),
  y = btwRef$y
)

plotBtw <- function(df, bounds = c()) {
  g <- ggplot(data = df, aes(x = x, y = y)) + theme_light() + geom_line()
  for (i in 1:length(bounds)) {
    g <- g + geom_vline(xintercept = bounds[i])
  }
  g
}

signal_ref <- stats::approxfun(x = btwRef$x, y = btwRef$y, ties = mean)
signal_query <- stats::approxfun(x = btwQuery$x, y = btwQuery$y, ties = mean)

query_bounds <- seq(0, 1, by = 0.1)
```


```{r}
temp <- M_final(
  theta_b_org = query_bounds,
  theta_b = query_bounds,
  r = signal_ref,
  f = signal_query)

# We know how the reference signal was distorted previously. Let's make
# a test where we manually undo this by moving the boundaries.
temp2_bounds <- c(
  0,
  .075, .1, .15, .2, .25,
  .55, .575, .6,  .8, 1)
temp2 <- M_final(
  theta_b_org = query_bounds,
  theta_b = temp2_bounds,
  r = signal_ref,
  f = signal_query)

ggarrange(
  ncol = 1,
  plotBtw(data.frame(x = temp$X, y = temp$y)),
  plotBtw(data.frame(x = temp$X, y = temp$y_hat), bounds = query_bounds),
  plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp2_bounds)
)
```

Great, this model works! The third plot is a manually corrected version of the distorted query signal shown in the second plot. The goal of `BTW` is to provide us with a differentiable algorithm that optimizes such that a cost, usually a distortion, is minimized.


## Compute the loss

For the scope of this notebook we will mostly stick with the residual sum of squares loss function.

$$
\begin{aligned}
  m, \mathbf{x}\;\dots&\;\text{a given sub-model and its sub-support (for brevity, we omit}\;q\text{),}
  \\[1ex]
  \text{RSS}_m =&\;\sum_{i=1}^{N}\;\big(\mathbf{y}_i - m(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_i)\big)^2\;\text{, recall}\;\delta_o\;\text{as}
  \\[1ex]
  \delta_o =&\;b_{o_e}-b_{o_s}\;\text{,}
  \\[1ex]
  \nabla\,\text{RSS}_m =&\;\Bigg[\frac{\partial\,\text{RSS}_m}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\text{RSS}_m}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{i=1}^{\lvert\mathbf{x}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\Big(\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)\Big)\times f'\Big(\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_i\;,\;\mathbf{x}_i-b_{q_s}\bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{i=1}^{\lvert\mathbf{x}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_i\;,\;\mathbf{x}_i-b_{q_s}\bigg]\;\text{, using}\;\psi_i\;\text{as}
  \\[1ex]
  \psi_i =&\;\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\;\text{.}
\end{aligned}
$$

Now we got us a great gradient of the loss- or cost-function.

Note: Like in the previous attempts, recall that an interval is defined such that it includes the starting boundary, but excludes the ending boundary, except for the last interval, which includes both, i.e., $i_{\forall\,q<\max{Q}}=[b_{s_q}, b_{e_q})$ and $i_{\max{Q}}=[b_{s_q}, b_{e_q}]$. that means, that for all but the last sub-model, we will compute the gradient only for $\partial\,b_{q_s}$, but not $\partial\,b_{q_e}$, as that would be redundant (and slightly wrong).

### Loss with logarithm

It turns out we made a mistake when not taking the logarithm of the loss function, because the two regularizers do so. We need to define an alternate version that uses the logarithm. Below we define that alternate version. We add $1$ to the RSS before taking the logarithm, such that the result will always be $\geq 0$.

$$
\begin{aligned}
  \text{RSS}_{\mathsf{M}}^{\text{log}}=&\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\big(\mathbf{y}_i - m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_{q,i})\big)^2\Bigg)}\;\text{using}
  \\[1ex]
  \gamma_i=&\;\delta_o\times(\mathbf{x}_{q,i}-b_{q_s})\;\text{,}
  \\[1ex]
  \delta_q=&\;b_{q_e}-b_{q_s}\;\text{, }\;\psi_i\;\text{, and}
  \\[1ex]
  \psi_i=&\;\frac{\gamma_i}{\delta_q}+b_{o_s}\;\text{, the gradient now becomes:}
  \\[1em]
  \nabla\,\text{RSS}_{\mathsf{M}}^{\text{log}}=&\;\Bigg[\frac{\partial\,\text{RSS}_m^{\text{log}}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\text{RSS}_m^{\text{log}}}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\Bigg[-1\times\frac{2\Big(\frac{\gamma_i}{{\delta_q}^2}-\frac{\delta_o}{\delta_q}\Big)\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi\big)}{1+\Big(\mathbf{y_i}-f(\psi_i)\Big)^2}\;,
  \\[1ex]
  &\;\;\;\;\;\;\;\;\;\;\;\;\frac{2\,\gamma_i\,\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{{\delta_q}^2\times\Big(1+\big(\mathbf{y}_i-f(\psi_i)\big)^2\Big)}\Bigg]\;\text{, can be shortened to}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\frac{2\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi\big)}{1+\big(\mathbf{y}_i-f(\psi_i)\big)^2}\times\Bigg[-\bigg(\frac{\gamma_i}{{\delta_q}^2}-\frac{\delta_o}{\delta_q}\bigg)\;,\;\frac{\gamma_i}{{\delta_q}^2}\Bigg]\;\text{.}
\end{aligned}
$$

When combining losses, all of them should use the same realm. So, in our cases, the regularizers return a logarithmic result, and with the above definition, we now have a logarithmic loss, too.


## Regularizing

We need to think briefly about regularizing such a model. I suggest we add two different regularizers: One that imposes a penalty on overlapping boundaries, and one that penalizes selecting too few of the data (i.e., when the sum of the lengths of all intervals is small).

### Penalize overlapping boundaries

For each boundary $b_i$, we need to ascertain that all subsequent boundaries are greater than $b_i$, i.e., $b_k,\,k>i,\,\forall\,b_k>b_i$. Similarly, all preceding boundaries must be less than $b_i$, i.e., $b_j,\,j<i,\,\forall\,b_j<b_i$.

The first kind of regularizer could be similar to this:

$$
\begin{aligned}
  R_1(\bm{\theta},\bm{\sigma}) &=\Bigg(1+\sum_{i=1}^{\lvert\,\bm{\theta}-1\,\rvert}\,\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\;\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)\times(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)\Bigg)^{\sum_{i=1}^{\lvert\,\bm{\theta}-1\,\rvert}\,\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\;\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)}-1\;\text{,}
  \\[1ex]
  &\;\text{where}\;\mathcal{H}\;\text{is the Heaviside step function, and}
  \\[0ex]
  &\;\bm{\sigma}\;\text{is an optional vector holding the minimum distance between two boundaries.}
\end{aligned}
$$
This regularizer does two things: Add up all violations for each boundary with each other subsequent boundary, and then exponentiate the sum with the number of violations.

The first version does not take the second condition into account. We define an alternative that does:

$$
\begin{aligned}
  \bm{\theta}\;\dots&\;\text{vector of boundaries,}
  \\[0ex]
  \mathcal{E}=&\;\text{matrix that defines pair-wise minimum distances between two boundaries,}
  \\[1ex]
  n=&\;\sum_{j=1}^{\lvert\,i-1\,\rvert}\mathcal{H}(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j}) + \sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\;\text{,}
  \\[0ex]
  &\;\text{the number of mutual violations,}
  \\[1ex]
  k=&\;\sum_{j=1}^{\lvert\,i-1\,\rvert}\mathcal{H}(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j})\times(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j})
  \\[0ex]
  &\;+\;\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\times(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\;\text{,}
  \\[0ex]
  &\;\text{the sum of of the degrees of all mutual violations,}
  \\[1ex]
  R_1^{\text{alt}}(k,n)=&\;(1+k)^n-1\;\text{, and its gradient}
  \\[1ex]
  \nabla\,R_1^{\text{alt}}(k,n)=&\;0\;\text{, as}
  \\[1ex]
  \nabla\,\mathcal{H}(x)=&\;\mathcal{H}'(x)\times x'\;\text{and}\;\mathcal{H}'(x)=0\;\text{.}
\end{aligned}
$$

This regularizer is similar to the first alternative. It considers however also backward (hence mutual) violations. Also note that the gradient for $R_1$ is always $0$, as the derivative of the Heaviside step function is $0$, thus introducing a multiplier of $0$.

For inclusion in our loss function later, we need to reformulate $R_1^{\text{alt}}$ using $b_{q_s},b_{q_e}$. However, we also simplify the $R_1$ loss and define:

$$
\begin{aligned}
  \bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime\;\dots&\;\text{ordered versions of the query start-/end boundaries,}
  \\[1ex]
  \lambda\;\dots&\;\text{constant that allows by how much an order may be violated,}
  \\[1ex]
  \mathcal{H},\mathcal{R}=&\;\text{the Heaviside and Ramp functions,}
  \\[1ex]
  \bm{\phi}=&\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s-\bm{\vartheta}^s\prime-\lambda))\times\mathcal{R}(\bm{\vartheta}^s-\bm{\vartheta}^s\prime-\lambda)
  \\[1ex]
  &\;+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e\prime-\bm{\vartheta}^e-\lambda))\times\mathcal{R}(\bm{\vartheta}^e\prime-\bm{\vartheta}^e-\lambda)\;\text{, reformulated using}\;b_{q_s},b_{q_e}\text{:}
  \\[1ex]
  =&\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda))\times\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda)
  \\[1ex]
  &\;+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda)\;\text{,}
  \\[1ex]
  R_1^{\text{alt}}\prime(\bm{\vartheta}^s,\bm{\vartheta}^e,\bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime)=&\;\Big(1+\big[\Sigma\,\bm{\phi}\big]\Big)^{\big[\Sigma\,\mathcal{H}(\bm{\phi})\big]}-1\;\text{.}
\end{aligned}
$$

It turns out that we can detect "forward"-violations by subtracting the (potentially unordered) $\bm{\vartheta}$ from its ordered version, $\bm{\vartheta}\prime$, and "backward"-violations by interchanging summands. Then we can subtract an optional offset, $\lambda$, that allows some degree of violation (typically this is $0$). Any value remaining that is greater $0$ will remain so after we passed the resulting vector through the Ramp function. The resulting vector, $\bm{\phi}$, can now be used in two ways: first, we can sum up all violations. Second, we exponentiate that sum by the amount of violations.

Hints: We assume all boundaries to be positive. The first term is $1+\dots$, to ascertain that we do not exponentiate values $<1$. We finally subtract $1$ as this regularizer has a lower bound of 1 (for the case of no violations we exponentiate with $0$). The above expression for $\bm{\phi}$ uses the Ramp function to retain violations, and a violation is always positive. This is multiplied with the Heaviside vector of it, such that forward-/backward-violations cannot cancel out each other.

Hints: The gradient for this version is also $0$, as the exponent will become $0$ with the derivative of the Heaviside step function.

Consider the following example in `R`:

```{r}
R <- function(x) {
  x[x < 0] <- 0
  x
}
H <- function(x) {
  x[x < 0] <- 0
  x[x > 0] <- 1
  x
}

vt <- c(1,3,2,4,6,7,5,8,9) / 10
vtp <- sort(vt)

vt - vtp
vtp - vt

phi <- H(R(vt - vtp)) * R(vt - vtp) + H(R(vtp - vt)) * R(vtp - vt)
phi
sum(phi)
sum(H(phi))
((1 + sum(phi))^sum(H(phi)))-1
```

While this alternative regularizer works, it would be great if we can formulate it similar to the second regularizer, using a logarithm and a bounded expression. Given the previous example, it appears that we can compute an upper boundary for the sum (and amount) of violations: The worst possible boundary constellation is reached when we compare the ordered vector against its ordered reverse, and vice versa (for mutual violations).

$$
\begin{aligned}
  \overleftarrow{\bm{\vartheta}^s\prime},\overleftarrow{\bm{\vartheta}^e\prime}\;\dots&\;\text{reversed and ordered versions of the query start-/end boundaries,}
  \\[1ex]
  \mathbf{p}(\mathbf{v}_1,\mathbf{v_2},\lambda)=&\;\mathcal{H}(\mathcal{R}(\mathbf{v}_1-\mathbf{v}_2-\lambda))\times \mathcal{R}(\mathbf{v}_1-\mathbf{v}_2-\lambda)
  \\[0ex]
  &\;+\mathcal{H}(\mathcal{R}(\mathbf{v}_2-\mathbf{v}_1-\lambda))\times\mathcal{R}(\mathbf{v}_2-\mathbf{v}_1-\lambda)\;\text{,}
  \\[1ex]
  &\;\text{the mutual difference between vectors}\;\mathbf{v}_1,\mathbf{v}_2,
  \\[1ex]
  \mathbf{v}_s\;,\;\mathbf{v}_e=&\;\mathbf{p}(\bm{\vartheta}^s,\bm{\vartheta}^s\prime,\;k)\;,\;\mathbf{p}(\bm{\vartheta}^e,\bm{\vartheta}^e\prime,\;k)\;\text{, (}k\;\text{is an arbitrary value),}
  \\[1ex]
  \mathbf{u}_s\;,\;\mathbf{u}_e=&\;\mathbf{p}(\bm{\vartheta}^s\prime,\overleftarrow{\bm{\vartheta}^s\prime},\;k)\;,\;\mathbf{p}(\bm{\vartheta}^e\prime,\overleftarrow{\bm{\vartheta}^e\prime},\;k)\;\text{,}
  \\[1ex]
  &\;\text{vectors for violations (}\mathbf{v}_{s,e}\text{) and upper bounds (}\mathbf{u}_{s,e}\text{),}
  \\[1ex]
  \bm{\phi}_v\;,\;\bm{\phi}_u=&\;\mathbf{v}_s+\mathbf{v}_e\;,\;\mathbf{u}_s+\mathbf{u}_e\;\text{,}
  \\[1ex]
  R_1^{\text{new}}=&\;-\log{\bigg(1-\frac{\bm{\phi}_v}{\bm{\phi}_u}\bigg)}\;\text{, fully-expanded, using}\;b_{q_s},b_{q_e}\;\text{:}
  \\[1ex]
  =&\;-\log{\Bigg(1-\frac{\sum_{q=1}^{\max{Q}}\,\begin{matrix}\;\;\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda))\times \mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s_q\prime-b_{q_s}-\lambda))\times\mathcal{R}(\bm{\vartheta}^s_q\prime-b_{q_s}-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^e_q\prime-\lambda))\times \mathcal{R}(b_{q_e}-\bm{\vartheta}^e_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda)\end{matrix}}{\sum_{q=1}^{\max{Q}}\,\begin{matrix}\;\;\;\mathcal{H}(\mathcal{R}(\overleftarrow{\bm{\vartheta}^s_q\prime}-\bm{\vartheta}^s_q\prime-\lambda))\times \mathcal{R}(\overleftarrow{\bm{\vartheta}^s_q\prime}-\bm{\vartheta}^s_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s_q\prime-\overleftarrow{\bm{\vartheta}^s_q\prime}-\lambda))\times\mathcal{R}(\bm{\vartheta}^s_q\prime-\overleftarrow{\bm{\vartheta}^s_q\prime}-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\overleftarrow{\bm{\vartheta}^e_q\prime}-\bm{\vartheta}^e_q\prime-\lambda))\times \mathcal{R}(\overleftarrow{\bm{\vartheta}^e_q\prime}-\bm{\vartheta}^e_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-\overleftarrow{\bm{\vartheta}^e_q\prime}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-\overleftarrow{\bm{\vartheta}^e_q\prime}-\lambda)\end{matrix}}\Bigg)}\;\text{.}
\end{aligned}
$$

However large the above definition, we can (and should!) break it down significantly before deriving the gradient. First of all, the entire denominator becomes constant the moment the query-boundaries are known, so we can pre-compute and replace it again with $\bm{\phi}_u$. Second, subtracting some $\lambda$ can also be pre-computed, so that'll clean up things further:

$$
\begin{aligned}
  \text{Using}\;\bm{\vartheta}^{s,\lambda}_q\prime,\bm{\vartheta}^{e,\lambda}_q\prime\;\dots&\;\text{vectors that already subtracted some}\;\lambda\text{,}
  \\[1ex]
  R_1^{\text{new}}=&\;\sum_{q=1}^{\max{Q}}-\log{\Bigg(1-\frac{
  \begin{matrix}
  \;\;\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))\times \mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
  &\\+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))\times\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})
  &\\+\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\times \mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)
  &\\+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))\times\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})
  \end{matrix}
  }{\bm{\phi}_{u_q}}\Bigg)}\;\text{, note that}
  \\[1ex]
  \frac{\partial\,\mathcal{R}}{\partial\;x}=&\;\mathcal{H}\;\text{, and}\;\frac{\partial\,\mathcal{H}}{\partial\,x}=0\;\text{, and the gradient will become}
  \\[1ex]
  \nabla\,R_1^{\text{new}}=&\;\Bigg[\frac{\partial\,R_1^{\text{new}}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_1^{\text{new}}}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\Bigg[\frac{-\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))\times\mathcal{H}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})+\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))\times\mathcal{H}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)}{
\begin{matrix}
\bm{\phi}_{u_q}-\Big(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})\times\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))+\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
&\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\times\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))+\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})\times\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))
&\\+\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)\times\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\Big)\end{matrix}}\;\text{,}
  \\[1ex]
  &\;\frac{-\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))\times\mathcal{H}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})+\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\times\mathcal{H}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)}{
  \begin{matrix}
  \bm{\phi}_{u_q}-(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})\times \mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))+\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
&\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\times \mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))+\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})\times \mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))
&\\+\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)\times \mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)))
  \end{matrix}
  }\Bigg]\;\text{.}
\end{aligned}
$$

Well, at least the denominators are the same for the partial derivatives. Note that we have already removed all summands that would be zero due to them being multiplied with the derivative of the Heaviside step function, which is $0$.


### Penalize supports that are too short

The second kind of regularizer would, in its simplest form, put the sum of the length of the selected query-intervals into relation with the total length of the original-interval:

$$
\begin{aligned}
R_2(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e) =&\;-\log\Bigg(\Bigg[\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\,\bm{\vartheta}^e_i-\bm{\vartheta}^s_i\Bigg]\;\bigg/\;\Bigg[\sum_{i=1}^{\lvert\,\bm{\theta}^s\,\rvert}\,\bm{\theta}^e_i-\bm{\theta}^s_i\Bigg]\Bigg)
\end{aligned}
$$

It is important to recall that the query-support cannot extend beyond the original-support in any direction, resulting in the first partial sum being less than or equal to the second partial sum, hence the ratio always lies within $[0,1]$.

The gradient for this regularizer is not $0$. It can further be simplified as we do know the length of the reference-support, which is constant:

$$
\begin{aligned}
  \eta=&\;\max{\mathcal{X}^r}-\min{\mathcal{X}^r}\;\text{, the extent of the reference-interval,}
  \\[1ex]
  \equiv&\;\sum_{i=1}^{\lvert\,\bm{\theta}^s\,\rvert}\,\bm{\theta}^e_i-\bm{\theta}^s_i\;\text{, so}\;R_2\;\text{becomes}
  \\[1ex]
  R_2(\bm{\vartheta}^s,\bm{\vartheta}^e)=&\;-\log\Bigg(\Bigg[\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\,\bm{\vartheta}^e_i-\bm{\vartheta}^s_i\Bigg]\;\Big/\;\eta\Bigg)\;\text{, with}\;b_{q_s},b_{q_e}\text{:}
  \\[1ex]
  =&\;-\log\Bigg(\Bigg[\sum_{q=1}^{\max{Q}}\,b_{q_e}-b_{q_s}\Bigg]\;\Big/\;\eta\Bigg)\;\text{, with gradient}
  \\[1ex]
  \nabla\,R_2=&\;\Bigg[\frac{\partial\,R_2}{\partial\,\bm{\vartheta}^s_i}\;,\;\frac{\partial\,R_2}{\partial\,\bm{\vartheta}^e_i}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\Bigg[\frac{1}{\bm{\vartheta}^e_i-\bm{\vartheta}^s_i}\;,\;\frac{1}{\bm{\vartheta}^s_i-\bm{\vartheta}^e_i}\Bigg]\;\text{, reformulated using}\;b_{q_s},b_{q_e}\;\text{:}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{.}
\end{aligned}
$$

The last reformulation is important, as it makes the regularizer compatible with our existing formulations for the residual sum of squares loss function and its gradient.


### Penalize extreme intervals

The second regularizer is very simple and straightforward, but it does not penalize extreme cases. For example, if all but one interval are of length zero, and the remaining interval covers the whole reference, the second regularizer would not impose a penalty. Another extreme case we have observed is intervals of negative length. So, not only need all intervals together cover a feasible range, intervals among themselves need to retain a somewhat healthy relation, too. At the same time, we must allow very short and very long intervals, but those should come with a cost that is proportional to their disagreement with the other intervals. One measure that describes all intervals is the mean length of them. The following regularizer sums up the squared differences. Note that we do not take the absolute value of $b_{q_e}-b_{q_s}$, as described previously, the optimization may choose intervals of negative length, and that will result in an even farther difference from the mean.

$$
\begin{aligned}
  \mu=&\;\overline{\mathcal{I}}\;\text{, the mean length of all reference-intervals,}
  \\[1ex]
  =&\;\frac{\max{\mathcal{X}^r} - \min{\mathcal{X}^r}}{\lvert\,\bm{\theta}_b-1\,\rvert}
  \\[1ex]
  R_3=&\;\log\Bigg({1+\sum_{q=1}^{\max{Q}}}\;(b_{q_e}-b_{q_s}-\mu)^2\Bigg)\;\text{, with gradient}
  \\[1ex]
  \nabla\,R_3=&\;\Bigg[\frac{\partial\,R_3}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_3}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\;\Bigg[-1\times\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\;,\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\Bigg]\;\text{, or shorter}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]\;\text{.}
\end{aligned}
$$


### Constrain Box-bounds

Similar to DTW, the adjusted query must not extend beyond the reference signal, i.e., it must not be less or greater than the support of the reference. We introduce a 4th kind of regularizer that can be added to enforce these constraints. The idea is similar to the 3rd kind of regularizer, we will sum up all transgressions and return the logarithm of it. However, a transgression should be penalized based on its boundaries, not the interval delimited by it. In other words, we penalize every boundary and whether its outside the box, as the other regularizers take already care of the other properties, such as whether the starting boundary comes after then end. Each boundary is checked for whether it is larger than (or equal to) the lower bound, and less than (or equal to) the upper bound:


$$
\begin{aligned}
  \nu_l,\nu_u\;\dots&\;\text{the lower and upper box-bounds, respectively,}
  \\[1ex]
  \mathcal{R},\mathcal{H}\;\dots&\;\text{, the ramp- and Heaviside step functions,}
  \\[1ex]
  R_4=&\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\;\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg)\;\text{, with gradient}
  \\[1ex]
  \nabla\,R_4=&\;\Bigg[\frac{\partial\,R_4}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_4}{\partial\,b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\;\Bigg[\frac{\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;,
  \\[1ex]
  &\;\;\;\;\;\;\;\;\;\;\;\;\;\frac{\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg]\;\text{, or shorter}
  \\[1em]
  =&\;\sum_{q=1}^{\max{Q}}\;\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;\text{.}
\end{aligned}
$$


# Optimization

We have previously introduced the model Boundary Time Warping, and shown how to compute its gradient and loss (and gradient thereof), as well as presenting two regularizers for this model. The goal now is to bring everything together, such that we can train the model to minimize an objective function.

```{r}
Stabilize <- function(f, lb, ub) {
  Vectorize(function(x) {
    if (x < lb) f(lb) else if (x > ub) f(ub) else f(x)
  })
}

r <- Stabilize(signal_ref, 0, 1)
f <- Stabilize(signal_query, 0, 1)
```


## Objective function

The objective function is used to calculate a cost, loss or error, given some reference data, as well as query-data together with some parameters that should set up the model such that the cost, loss or error between reference- and transformed query-data becomes minimal. The objective function wraps the model, its data and parameters:

* Model: Given the reference- and query-data, and parameters that determine how to transform the query-data, it computes a derived version of the query-data that is supposedly closer to the reference-data than the non-derived query-data.
* Loss function: Provides the means to aggregate the differences between the reference-data and the model-derived query-data. The loss function usually aggregates the entirety of all differences to a single value.
* Regularizer: One or more additional terms optionally added to the loss function. Regularizers are used to penalize extreme, albeit good solutions as per Occam's razor. Regularizers are added to guide the training of the involved parameters such that over-fitting is avoided.
* Gradient of loss function: Vector of all partial derivatives of the loss function. Computing the gradient of the loss function results in a suggested change for each parameter, along with how to change it, i.e., whether it should be increased or decreased.
* Hessian of loss function: Matrix with all 2nd-order partial derivatives of the loss function. This matrix can be used for Newton-Raphson optimization and should be preferred whenever possible over purely gradient-based optimization methods.

We have defined all of the above (except for the Hessian, that we will do later). Our loss function is given by:

$$
\begin{aligned}
  \text{Using}\;\bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime,\bm{\phi}_v,\bm{\phi}_u,\mu,\nu_l,\nu_u,\mathcal{X}^r\;\dots&\;\text{as defined previously,}
  \\[1em]
  \mathcal{L}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e,r,f)=&\;\text{RSS}_{\mathsf{M}}^{\text{log}} + R_1^{\text{new}} + R_2 + R_3 + R_4\;\text{, with}
  \\[1ex]
  \mathbf{y},\hat{\mathbf{y}}=&\;\mathsf{M}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e, r,f)\text{, }\mathcal{L}\;\text{becomes}
  \\[1ex]
  =&\;\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\big(\mathbf{y}_i - m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_{q,i})\big)^2\Bigg)}
  \\
  &\;-\log{\bigg(1-\frac{\bm{\phi}_v}{\bm{\phi}_u}\bigg)}
  \\
  &\;-\log\Bigg(\Bigg[\sum_{q=1}^{\max{Q}}\,b_{q_e}-b_{q_s}\Bigg]\;\Big/\;\Big(\max{\mathcal{X}^r}-\min{\mathcal{X}^r}\Big)\Bigg)
  \\[1ex]
  &\;+\log\Bigg({1+\sum_{q=1}^{\max{Q}}}\;(b_{q_e}-b_{q_s}-\mu)^2\Bigg)
  \\[1ex]
  &\;+\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\;\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg)
\end{aligned}
$$

The gradient of this loss function is:

$$
\begin{aligned}
  \delta_o,\psi_i,\mathbf{x}_{q,i}\;\dots&\;\text{as defined previously,}
  \\[1ex]
  \nabla\,\mathcal{L}(\dots)=&\;\Big[\frac{\partial\,\mathcal{L}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\mathcal{L}}{\partial\,b_{q_e}}\Big]\;\text{,}
  \\[1ex]
  =&\;\nabla\,\text{RSS}_{\mathsf{M}}^{\text{log}}\;+\;\nabla\,R_1^{\text{new}}\;+\;\nabla\,R_2\;+\;\nabla\,R_3\;+\;\nabla\,R_4\text{,}
  \\[1ex]
  =&\;\Bigg[\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_{q}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_{q_i}\;,\;\mathbf{x}_{q_i}-b_{q_s}\bigg]\Bigg]\;\text{,}
  \\[1ex]
  &\;+\nabla\,R_1^{\text{new}}=\;\dots\;\text{(too big to paste here))}
  \\[1ex]
  &\;+\sum_{q=1}^{\max{Q}}\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{(the gradient of}\;R_2\;\text{is no longer negative)}
  \\[1ex]
  &\;+\sum_{q=1}^{\max{Q}}\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]
  \\[1ex]
  &\;+\sum_{q=1}^{\max{Q}}\;\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;\text{,}
  \\[1ex]
  =&\;\sum_{q=1}^{\max{Q}}\Bigg[\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_{q_i}\;,\;\mathbf{x}_{q_i}-b_{q_s}\bigg]\Bigg]
  \\[1ex]
  &\;+\nabla\,R_1^{\text{new}}\dots
  \\[1ex]
  &\;+\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{,}
  \\[1ex]
  &\;+\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]
  \\[1ex]
  &\;+\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}
\end{aligned}
$$

Now we have a loss and its gradient, and we will implement and test this next, before going further to the Hessian.


## Implementation

Here we implement the loss function with its regularizers and gradient.

### Using optim's gradient

First we test how well the loss function works, by having `optim()` use its own approximation of a gradient.

```{r}
temp <- loadResultsOrCompute(file = "../results/btw-final_test-loss.rds", computeExpr = {
  pNames <- c(paste0("b", 1:length(query_bounds)), "loss")
  fitResult <- FitResult$new(paramNames = pNames)$start()
  
  optR <- optim(
    par = query_bounds,
    method = "L-BFGS-B",
    lower = rep(0, length(query_bounds)),
    upper = rep(1, length(query_bounds)),
    fn = function(x) {
      loss <- L_final(theta_b_org = query_bounds, theta_b = x, r = r, f = f)
      fitResult$stopStep(resultParams = c(x, loss), verbose = interactive())
      loss
    }
  )
  
  list(
    fr = fitResult$finish(),
    optR = optR
  )
})

plot(log(temp$fr$getFitHist()[, "loss"]), type="l")

temp2 <- M_final(
  theta_b_org = query_bounds,
  theta_b = temp$optR$par,
  r = signal_ref,
  f = signal_query)

plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp$optR$par)
```

The loss function apparently works. Note that the above plot includes also gradient iterations, as we cannot distinguish these using the `optim()`-approach.


### Implementing the analytical gradient

We have previously derived the gradient for our loss function, including regularizers. Now we should implement it in code and try to use `optim()` with that gradient instead. For the analytical gradient, we also need to provide a derivative of the query-signal. While it is actually a sine, we go the numeric way here, approximating a function for the gradient.

```{r}
library(pracma)

#' Note that here we use the stabilized version of the query-signal.
#' TODO: We should supply helper functions to the user that use the
#' best possible method of derivation, like we do here.
f_deriv1 <- Vectorize(function(x) {
  m <- if (x == 0) "forward" else if (x == 1) "backward" else "central"
  pracma::fderiv(f = f, x = x, method = m)
})

curve(f, 0, 1)
curve(f_deriv1, 0, 1)
```

Let's make a test without the regularizers:

```{r}
temp <- loadResultsOrCompute(file = "../results/btw-final_test-loss_no-reg.rds", computeExpr = {
  pNames <- c(paste0("b", 1:length(query_bounds)), "loss")
  fitResult <- FitResult$new(paramNames = pNames)$start()
  
  optR <- stats::optim(
    par = query_bounds,
    method = "L-BFGS-B",
    lower = rep(0, length(query_bounds)),
    upper = rep(1, length(query_bounds)),
    fn = function(x) {
      loss <- L_final(theta_b_org = query_bounds, theta_b = x, r = r, f = f, useR1 = FALSE, useR2 = FALSE)
      fitResult$stopStep(resultParams = c(x, loss), verbose = interactive())
      loss
    },
    gr = function(x) {
      L_final_grad(theta_b_org = query_bounds, theta_b = x, r = r, f = f, f_p = f_deriv1, useR1 = FALSE, useR2 = FALSE)
    }
  )
  
  list(
    fr = fitResult$finish(),
    optR = optR
  )
})

print(round(temp$optR$par, 5))

plot(log(temp$fr$getFitHist()[, "loss"]), type="l")

temp2 <- M_final(
  theta_b_org = query_bounds,
  theta_b = temp$optR$par,
  r = signal_ref,
  f = signal_query)

plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp$optR$par)
```

This works somewhat, with the major problem being overlapping boundaries, and BTW not being able to solve this problem without the regularizers (the $R_1$ regularizer does this). However, it seems to cover the whole reference in this example, even without the $R_2$ regularizer.

```{r}
set.seed(1337)
useBounds <- c(0, sort(runif(23)), 1)
#useBounds <- query_bounds

fitResult <- FitResult$new(paramNames = "loss")$start()

optR <- stats::optim(
  par = useBounds,
  method = "BFGS",
  # lower = rep(0, length(useBounds)),
  # upper = rep(1, length(useBounds)),
  fn = function(x) {
    #print(format(x, digits = 3, nsmall = 3, scientific = FALSE))
    fitResult$startStep()
    loss <- L_final_log(
      theta_b_org = useBounds, theta_b = x, r = r, f = f, zNormalize = FALSE,
      weightErr = 2/3, weightR1 = 1, weightR2 = .2, weightR3 = .5, weightR4 = 1)
    #print(loss)
    fitResult$stopStep(resultParams = loss, verbose = TRUE)
    loss
  # },
  # gr = function(x) {
  #   pracma::grad(x0 = x, f = function(x) {
  #     L_final_log(theta_b_org = useBounds, theta_b = x, r = r, f = f, weightErr = .5, weightR2 = .5)
  #   })
  #   # grad <- L_final_log_grad(
  #   #   theta_b_org = useBounds, theta_b = x, r = r, f = f, f_p = f_deriv1,
  #   #   weightErr = .5, weightR1 = 1, weightR2 = .5, weightR3 = 1, weightR4 = 1)
  #   # 
  #   # print(format(grad, digits = 3, nsmall = 3, scientific = FALSE))
  #   # grad
  }
)

fitResult$finish()
```

```{r}
temp3 <- M_final(
  theta_b_org = useBounds,
  theta_b = optR$par,
  r = signal_ref,
  f = signal_query)

plotBtw(data.frame(x = temp3$X, y = temp3$y_hat), optR$par)
```