-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathboundary-time-warping_final.Rmd
797 lines (640 loc) · 44.4 KB
/
boundary-time-warping_final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
---
title: "Boundary Time Warping (new)"
bibliography: ../inst/REFERENCES.bib
header-includes:
- \usepackage{bm}
output:
pdf_document:
number_sections: true
toc: true
toc_depth: 6
df_print: kable
html_document:
number_sections: true
toc: true
toc_depth: 6
toc_float: true
df_print: kable
md_document:
toc: true
toc_depth: 6
df_print: kable
word_document: default
---
```{r echo=FALSE, warning=FALSE}
source(file = "../helpers.R")
source(file = "./common-funcs.R")
source(file = "../models/modelsR6.R")
source(file = "../models/BTW.R")
library(ggplot2)
library(ggpubr)
```
# Boundary Time Warping (final)
Note that we have abandoned the previous BTW notebook (`boundary-time-warping_new.Rmd`), as we want to attempt a final reformulation of BTW, using the initial translate-and-scale approach. The reason behind this is simple: In our previous approach we reduced the required parameters to only one, $\tau$. While that simplifies many things, it is not possible to deduce a gradient-delta that is required for adjusting the boundaries, bummer! So in this notebook, we want to use the initial approach and make an explicit formulation.
Our formulation here is most explicit, as to avoid any confusions. The overall model
* subdivides the query-signal into a set of $\mathcal{I}$ intervals (of length $q$), and
* in each interval there are __two__ parameters, the absolute offset (begin) and end of it.
The translation and scale parameters result from putting the original- and query-interval into relation. The original interval is the one the query-signal was divided into initially, it is constant and never changed, and given by $b_{\text{start}}^{\text{org}},b_{\text{end}}^{\text{org}}$ (or short: $b_{o_s},b_{o_e}$). The query-interval is the one delimited by the hyperparameters, given by $b_{\text{start}}^{\text{query}},b_{\text{end}}^{\text{query}}$ (short: $b_{q_s},b_{q_e}$). Note that from here on and after, "start", "end", "org" and "query" will often be denoted by "s", "e", "o" and "q", respectively).
## Formal description of the model
The model is now much simpler, and follow the structure of the original notebook here. Note that almost all of the remarks of the previous notebook apply, only the way we calculate BTW has changed.
Given a vector of boundaries, $\bm{\theta}_b$, the first step is to establish the two vectors of absolute start- and end-parameters, i.e.,
$$
\begin{aligned}
\bm{\theta}\;\dots&\;\text{ordered vector of boundaries, such that}
\\[0ex]
&\;\bm{\theta}_j\prec\bm{\theta}_{j+1}\;\text{,}
\\[0ex]
i=&\;\lvert\,\bm{\theta}\,\rvert\;\text{, the length of that vector,}
\\[1ex]
\bm{\theta}^{\text{start}}=&\;\{\bm{\theta}_{1},\;\dots\;,\bm{\theta}_{i-1}\}\;\text{(all but the last boundary),}
\\[0ex]
\bm{\theta}^{\text{end}}=&\;\{\bm{\theta}_{2},\;\dots\;,\bm{\theta}_{i}\}\;\text{(all but the first boundary),}
\\[1ex]
q\in Q;\;Q=&\;\{1,\;\dots\;,i-1\}\;\text{, where}\;Q\;\text{is ordered low to high, i.e.,}\;q_i\prec q_{i+1}\text{,}
\\[0ex]
&\;\mathcal{I}\;\text{, the set of intervals with length}\;[\max{Q}]\text{, with each interval delimited by}
\\[0ex]
I_q=&\;\Big[\,\bm{\theta}_q^{\text{start}},\bm{\theta}_q^{\text{end}}\,\Big)\;\text{, resp.}
\\[0ex]
I_{q}=&\;\Big[\,\bm{\theta}_q^{\text{start}},\bm{\theta}_q^{\text{end}}\,\Big]\;\text{, for the last interval (}q=\max{Q}\text{).}
\end{aligned}
$$
Our overall model hence is a __Multi-task learning__ model. The relation between intervals (and hence the sub-models for each) is:
* Each interval $I_q,\;q<\max{Q}$ is defined as $I_q=[\bm{\theta}_q^s,\bm{\theta}_q^e\,)$, i.e., it does __not__ include the end-boundary. Each last interval, i.e., $I_q,\;q=\max{Q}$ is defined as $I_q=[\bm{\theta}_q^s,\bm{\theta}_q^e\,]$, i.e., it includes __both__ start- and end-boundary.
* That leads to two consecutive intervals, $I_q,I_{q+1}$, having no space between them, such that the end-boundary of interval $q$, $\bm{\theta}_q^e$, and the start-boundary of interval $q+1$, $\bm{\theta}_{q+1}^s$, satisfy the inequality $\bm{\theta}_q^e < \bm{\theta}_{q+1}^s$.
* This implies that for all but the last interval, $I_q^e\neq I_{q+1}^s$. This is important later for the gradient calculations, as we do have partial derivatives for $\bm{\theta}_q^s,\bm{\theta}_q^e$, and only need to calculate $\partial\,\bm{\theta}_q^e$ for the last interval, while we calculate $\partial\,\bm{\theta}_q^s$ for all intervals.
The last property is how we exploit Boundary Time Warping as a Multi-task learning model. The overal model, $\mathsf{M}$, produces $\hat{\bm{y}}$ by piece-wise computing and then concatenating it, where each piece is an interval and its translation and scaling.
Now recall that signals are represented as functions, and for the query-signal, we'll also have the derivative and optionally the 2nd derivative:
$$
\begin{aligned}
r(x) &= \dots\;\text{, a function over the reference signal,}
\\
f(x),f\prime(x),f\prime\prime(x) &= \dots\;\text{, functions over the query signal (and its 1st and 2nd derivatives).}
\end{aligned}
$$
There may be cases where the derivatives of $f$ can be determined analytically. If that is not the case, one may choose to compute a numerical gradient by, e.g., a finite difference approximation. Formally, each sub-model $m_q$ and the overall model $\mathsf{M}$ are defined as:
$$
\begin{aligned}
\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e\;\dots&\;\text{ordered sets (same length) of start- and end original- and query-boundaries,}
\\[1ex]
\mathcal{X}^r=&\;[\min{\bm{\theta}^s}\,,\,\max{\bm{\theta}^e}]\;\text{, the support of the reference-signal,}
\\[1ex]
\mathcal{X}^q=&\;[\min{\bm{\vartheta}^s}\,,\,\max{\bm{\vartheta}^e}]\;\text{, the support of the query-signal,}
\\[1ex]
\mathbf{x}_q\subset\mathcal{X}^q=&\;[\bm{\vartheta}^s_q\,,\,\bm{\vartheta}^e_q]\text{, proper sub-support for model}\;m_q\;\text{and its interval}\;I_q\text{, such that}
\\[1ex]
\mathbf{x}_q\prec&\;\mathbf{x}_{q+1}\;\text{, i.e., sub-supports are ordered,}
\\[1ex]
m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,x)=&\;f\Bigg(\frac{(x - b_{q_s})\times(b_{o_e} - b_{o_s})}{b_{q_e}-b_{q_s}} + b_{o_s}\Bigg)\;\text{, model for interval}\;q\text{,}
\\[1ex]
\mathbf{y}=&\;r(\mathcal{X}^r)\;\text{, the reference-signal,}
\\[1ex]
\hat{\mathbf{y}}=&\;\Big\{\;m_1(\dots)\,\frown\,\dots\,\frown m_q(\bm{\theta}^s_q,\bm{\theta}^e_q,\bm{\vartheta}^s_q,\bm{\vartheta}^e_q,f,\mathbf{x}_q)\;\Big\},\;\forall\,q\in Q
\\
&\;\text{(concatenate all models' results), and finally}
\\[1ex]
\mathsf{M}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e, r,f)=&\;\Big[\,\mathbf{y}^\top,\hat{\mathbf{y}}^\top\Big]\;\text{, compute the reference- and transformed query-signal.}
\end{aligned}
$$
In some of the subsequent computations we require the 1st and 2nd derivative of $f$, both of which can be supplied to $\mathsf{M}$ instead of $f$ for computing $\hat{\mathbf{y}}\prime$ and $\hat{\mathbf{y}}\prime\prime$, respectively.
Now the problem we previously faced, was to get from a set of boundaries to a (finite) set of model-parameters, and back again. In our previous notebook it turned out that for $\bm{\tau}$ no direct path back exists. So before we move any further, we need to fully describe both ways.
$$
\begin{aligned}
&\;b_{q_s}, b_{q_e}\;\text{are the start- and end-boundaries delimiting the interval}\;q\text{.}
\end{aligned}
$$
Let's use over ubiquitous example of scaling and translating a portion of the sine signal again:
```{r}
b_os <- .75
b_oe <- 1
b_qs <- .4
b_qe <- .7
temp <- sin(seq(0, pi * 2/3, length.out = 1e2))
tempf <- approxfun(x = seq(0, 1, length.out = 1e2), y = temp)
curve(tempf, 0, 1)
abline(v = .75)
abline(v = 1)
```
The goal is to determine $s,t$ from the boundary-delimited interval $[0.75,1]$ (the two vertical lines above), if we were to move this interval to $[0.4, 0.7]$.
```{r}
s_q <- (b_qe - b_qs) / (b_oe - b_os)
t_q <- b_os - b_qs
tempfPrime <- function(x) {
tempf((x - b_qs) / (b_qe - b_qs) * (b_oe - b_os) + b_os)
}
curve(tempfPrime, 0, 1)
abline(v = b_qs)
abline(v = b_qe)
```
## Gradient for the sub-models
Each sub-model can be defined as a function over the extents of the original- and query-interval:
$$
\begin{aligned}
m(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,x) =&\;f\Big((x - b_{q_s})\times(b_{q_e}-b_{q_s})^{-1}\times(b_{o_e} - b_{o_s}) + b_{o_s}\Big)\;\text{, with}\;b_{o_s},b_{o_e}\;\text{being constant,}
\\
\delta_o =&\;b_{o_e}-b_{o_s}\;\text{,}\;m\;\text{becomes}
\\[1ex]
m(b_{o_s}, \delta_o, b_{q_s}, b_{q_e},\;f,x) =&\;f\Big((x - b_{q_s})\times\frac{\delta_o}{b_{q_e}-b_{q_s}} + b_{o_s}\Big)\;\text{.}
\\[1ex]
\nabla\,m(b_{o_s}, \delta_o, b_{q_s}, b_{q_e},\;f,x) =&\;\Bigg[\frac{\partial\,m}{\partial\,b_{q_s}}\;,\;\frac{\partial\,m}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\Bigg[-\frac{\delta_o\times(b_{q_e}-x)\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\;,\;\frac{\delta_o\times(b_{q_s}-x)\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\Bigg]\;\text{,}
\\[1ex]
=&\;\frac{\delta_o\times f'\Big(\frac{\delta_o\times(b_{q_s}-x)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_e}-b_{q_s}\big)^2}\times\Bigg[x-b_{q_e}\;,\;b_{q_s}-x\;\Bigg]\;\text{.}
\end{aligned}
$$
So we're having a very pleasing gradient and a clear path between the pairs of boundaries and the models' parameters. Actually, we didn't derive extra parameters, but now plug in the boundaries directly. Now, for example, to calculate the required change for $b_{q_s}$, we simply compute the partial derivative for it, done! That however needs to be done when we plug in $\mathsf{M}$ into a cost-function. Again, we'll be using the residual sum of squares (RSS) for that.
Let's show a short demonstration:
```{r echo=FALSE}
btwRef <- data.frame(
x = seq(0, 1, length.out = 1e3),
y = sin(seq(0, 2 * pi, length.out = 1e3))
)
btwQueryBounds <- c(.1, .2, .5, .6, .8)
btwQuery <- data.frame(
x = c(
seq(0, btwQueryBounds[1], length.out = 75),
seq(btwQueryBounds[1], btwQueryBounds[2], length.out = 25),
seq(btwQueryBounds[2], btwQueryBounds[3], length.out = 150),
seq(btwQueryBounds[3], btwQueryBounds[4], length.out = 300),
seq(btwQueryBounds[4], btwQueryBounds[5], length.out = 50),
seq(btwQueryBounds[5], 1, length.out = 400)
),
y = btwRef$y
)
plotBtw <- function(df, bounds = c()) {
g <- ggplot(data = df, aes(x = x, y = y)) + theme_light() + geom_line()
for (i in 1:length(bounds)) {
g <- g + geom_vline(xintercept = bounds[i])
}
g
}
signal_ref <- stats::approxfun(x = btwRef$x, y = btwRef$y, ties = mean)
signal_query <- stats::approxfun(x = btwQuery$x, y = btwQuery$y, ties = mean)
query_bounds <- seq(0, 1, by = 0.1)
```
```{r}
temp <- M_final(
theta_b_org = query_bounds,
theta_b = query_bounds,
r = signal_ref,
f = signal_query)
# We know how the reference signal was distorted previously. Let's make
# a test where we manually undo this by moving the boundaries.
temp2_bounds <- c(
0,
.075, .1, .15, .2, .25,
.55, .575, .6, .8, 1)
temp2 <- M_final(
theta_b_org = query_bounds,
theta_b = temp2_bounds,
r = signal_ref,
f = signal_query)
ggarrange(
ncol = 1,
plotBtw(data.frame(x = temp$X, y = temp$y)),
plotBtw(data.frame(x = temp$X, y = temp$y_hat), bounds = query_bounds),
plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp2_bounds)
)
```
Great, this model works! The third plot is a manually corrected version of the distorted query signal shown in the second plot. The goal of `BTW` is to provide us with a differentiable algorithm that optimizes such that a cost, usually a distortion, is minimized.
## Compute the loss
For the scope of this notebook we will mostly stick with the residual sum of squares loss function.
$$
\begin{aligned}
m, \mathbf{x}\;\dots&\;\text{a given sub-model and its sub-support (for brevity, we omit}\;q\text{),}
\\[1ex]
\text{RSS}_m =&\;\sum_{i=1}^{N}\;\big(\mathbf{y}_i - m(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_i)\big)^2\;\text{, recall}\;\delta_o\;\text{as}
\\[1ex]
\delta_o =&\;b_{o_e}-b_{o_s}\;\text{,}
\\[1ex]
\nabla\,\text{RSS}_m =&\;\Bigg[\frac{\partial\,\text{RSS}_m}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\text{RSS}_m}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{i=1}^{\lvert\mathbf{x}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\Big(\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)\Big)\times f'\Big(\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\Big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_i\;,\;\mathbf{x}_i-b_{q_s}\bigg]\;\text{,}
\\[1ex]
=&\;\sum_{i=1}^{\lvert\mathbf{x}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_i\;,\;\mathbf{x}_i-b_{q_s}\bigg]\;\text{, using}\;\psi_i\;\text{as}
\\[1ex]
\psi_i =&\;\frac{\delta_o\times(b_{q_s}-\mathbf{x}_i)}{b_{q_s}-b_{q_e}}+b_{o_s}\;\text{.}
\end{aligned}
$$
Now we got us a great gradient of the loss- or cost-function.
Note: Like in the previous attempts, recall that an interval is defined such that it includes the starting boundary, but excludes the ending boundary, except for the last interval, which includes both, i.e., $i_{\forall\,q<\max{Q}}=[b_{s_q}, b_{e_q})$ and $i_{\max{Q}}=[b_{s_q}, b_{e_q}]$. that means, that for all but the last sub-model, we will compute the gradient only for $\partial\,b_{q_s}$, but not $\partial\,b_{q_e}$, as that would be redundant (and slightly wrong).
### Loss with logarithm
It turns out we made a mistake when not taking the logarithm of the loss function, because the two regularizers do so. We need to define an alternate version that uses the logarithm. Below we define that alternate version. We add $1$ to the RSS before taking the logarithm, such that the result will always be $\geq 0$.
$$
\begin{aligned}
\text{RSS}_{\mathsf{M}}^{\text{log}}=&\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\big(\mathbf{y}_i - m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_{q,i})\big)^2\Bigg)}\;\text{using}
\\[1ex]
\gamma_i=&\;\delta_o\times(\mathbf{x}_{q,i}-b_{q_s})\;\text{,}
\\[1ex]
\delta_q=&\;b_{q_e}-b_{q_s}\;\text{, }\;\psi_i\;\text{, and}
\\[1ex]
\psi_i=&\;\frac{\gamma_i}{\delta_q}+b_{o_s}\;\text{, the gradient now becomes:}
\\[1em]
\nabla\,\text{RSS}_{\mathsf{M}}^{\text{log}}=&\;\Bigg[\frac{\partial\,\text{RSS}_m^{\text{log}}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\text{RSS}_m^{\text{log}}}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\Bigg[-1\times\frac{2\Big(\frac{\gamma_i}{{\delta_q}^2}-\frac{\delta_o}{\delta_q}\Big)\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi\big)}{1+\Big(\mathbf{y_i}-f(\psi_i)\Big)^2}\;,
\\[1ex]
&\;\;\;\;\;\;\;\;\;\;\;\;\frac{2\,\gamma_i\,\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{{\delta_q}^2\times\Big(1+\big(\mathbf{y}_i-f(\psi_i)\big)^2\Big)}\Bigg]\;\text{, can be shortened to}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\frac{2\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi\big)}{1+\big(\mathbf{y}_i-f(\psi_i)\big)^2}\times\Bigg[-\bigg(\frac{\gamma_i}{{\delta_q}^2}-\frac{\delta_o}{\delta_q}\bigg)\;,\;\frac{\gamma_i}{{\delta_q}^2}\Bigg]\;\text{.}
\end{aligned}
$$
When combining losses, all of them should use the same realm. So, in our cases, the regularizers return a logarithmic result, and with the above definition, we now have a logarithmic loss, too.
## Regularizing
We need to think briefly about regularizing such a model. I suggest we add two different regularizers: One that imposes a penalty on overlapping boundaries, and one that penalizes selecting too few of the data (i.e., when the sum of the lengths of all intervals is small).
### Penalize overlapping boundaries
For each boundary $b_i$, we need to ascertain that all subsequent boundaries are greater than $b_i$, i.e., $b_k,\,k>i,\,\forall\,b_k>b_i$. Similarly, all preceding boundaries must be less than $b_i$, i.e., $b_j,\,j<i,\,\forall\,b_j<b_i$.
The first kind of regularizer could be similar to this:
$$
\begin{aligned}
R_1(\bm{\theta},\bm{\sigma}) &=\Bigg(1+\sum_{i=1}^{\lvert\,\bm{\theta}-1\,\rvert}\,\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\;\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)\times(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)\Bigg)^{\sum_{i=1}^{\lvert\,\bm{\theta}-1\,\rvert}\,\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\;\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\bm{\sigma}_i)}-1\;\text{,}
\\[1ex]
&\;\text{where}\;\mathcal{H}\;\text{is the Heaviside step function, and}
\\[0ex]
&\;\bm{\sigma}\;\text{is an optional vector holding the minimum distance between two boundaries.}
\end{aligned}
$$
This regularizer does two things: Add up all violations for each boundary with each other subsequent boundary, and then exponentiate the sum with the number of violations.
The first version does not take the second condition into account. We define an alternative that does:
$$
\begin{aligned}
\bm{\theta}\;\dots&\;\text{vector of boundaries,}
\\[0ex]
\mathcal{E}=&\;\text{matrix that defines pair-wise minimum distances between two boundaries,}
\\[1ex]
n=&\;\sum_{j=1}^{\lvert\,i-1\,\rvert}\mathcal{H}(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j}) + \sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\;\text{,}
\\[0ex]
&\;\text{the number of mutual violations,}
\\[1ex]
k=&\;\sum_{j=1}^{\lvert\,i-1\,\rvert}\mathcal{H}(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j})\times(\bm{\theta}_j-\bm{\theta}_i-\mathcal{E}_{i,j})
\\[0ex]
&\;+\;\sum_{j=i+1}^{\lvert\,\bm{\theta}\,\rvert}\mathcal{H}(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\times(\bm{\theta}_i-\bm{\theta}_j-\mathcal{E}_{i,j})\;\text{,}
\\[0ex]
&\;\text{the sum of of the degrees of all mutual violations,}
\\[1ex]
R_1^{\text{alt}}(k,n)=&\;(1+k)^n-1\;\text{, and its gradient}
\\[1ex]
\nabla\,R_1^{\text{alt}}(k,n)=&\;0\;\text{, as}
\\[1ex]
\nabla\,\mathcal{H}(x)=&\;\mathcal{H}'(x)\times x'\;\text{and}\;\mathcal{H}'(x)=0\;\text{.}
\end{aligned}
$$
This regularizer is similar to the first alternative. It considers however also backward (hence mutual) violations. Also note that the gradient for $R_1$ is always $0$, as the derivative of the Heaviside step function is $0$, thus introducing a multiplier of $0$.
For inclusion in our loss function later, we need to reformulate $R_1^{\text{alt}}$ using $b_{q_s},b_{q_e}$. However, we also simplify the $R_1$ loss and define:
$$
\begin{aligned}
\bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime\;\dots&\;\text{ordered versions of the query start-/end boundaries,}
\\[1ex]
\lambda\;\dots&\;\text{constant that allows by how much an order may be violated,}
\\[1ex]
\mathcal{H},\mathcal{R}=&\;\text{the Heaviside and Ramp functions,}
\\[1ex]
\bm{\phi}=&\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s-\bm{\vartheta}^s\prime-\lambda))\times\mathcal{R}(\bm{\vartheta}^s-\bm{\vartheta}^s\prime-\lambda)
\\[1ex]
&\;+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e\prime-\bm{\vartheta}^e-\lambda))\times\mathcal{R}(\bm{\vartheta}^e\prime-\bm{\vartheta}^e-\lambda)\;\text{, reformulated using}\;b_{q_s},b_{q_e}\text{:}
\\[1ex]
=&\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda))\times\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda)
\\[1ex]
&\;+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda)\;\text{,}
\\[1ex]
R_1^{\text{alt}}\prime(\bm{\vartheta}^s,\bm{\vartheta}^e,\bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime)=&\;\Big(1+\big[\Sigma\,\bm{\phi}\big]\Big)^{\big[\Sigma\,\mathcal{H}(\bm{\phi})\big]}-1\;\text{.}
\end{aligned}
$$
It turns out that we can detect "forward"-violations by subtracting the (potentially unordered) $\bm{\vartheta}$ from its ordered version, $\bm{\vartheta}\prime$, and "backward"-violations by interchanging summands. Then we can subtract an optional offset, $\lambda$, that allows some degree of violation (typically this is $0$). Any value remaining that is greater $0$ will remain so after we passed the resulting vector through the Ramp function. The resulting vector, $\bm{\phi}$, can now be used in two ways: first, we can sum up all violations. Second, we exponentiate that sum by the amount of violations.
Hints: We assume all boundaries to be positive. The first term is $1+\dots$, to ascertain that we do not exponentiate values $<1$. We finally subtract $1$ as this regularizer has a lower bound of 1 (for the case of no violations we exponentiate with $0$). The above expression for $\bm{\phi}$ uses the Ramp function to retain violations, and a violation is always positive. This is multiplied with the Heaviside vector of it, such that forward-/backward-violations cannot cancel out each other.
Hints: The gradient for this version is also $0$, as the exponent will become $0$ with the derivative of the Heaviside step function.
Consider the following example in `R`:
```{r}
R <- function(x) {
x[x < 0] <- 0
x
}
H <- function(x) {
x[x < 0] <- 0
x[x > 0] <- 1
x
}
vt <- c(1,3,2,4,6,7,5,8,9) / 10
vtp <- sort(vt)
vt - vtp
vtp - vt
phi <- H(R(vt - vtp)) * R(vt - vtp) + H(R(vtp - vt)) * R(vtp - vt)
phi
sum(phi)
sum(H(phi))
((1 + sum(phi))^sum(H(phi)))-1
```
While this alternative regularizer works, it would be great if we can formulate it similar to the second regularizer, using a logarithm and a bounded expression. Given the previous example, it appears that we can compute an upper boundary for the sum (and amount) of violations: The worst possible boundary constellation is reached when we compare the ordered vector against its ordered reverse, and vice versa (for mutual violations).
$$
\begin{aligned}
\overleftarrow{\bm{\vartheta}^s\prime},\overleftarrow{\bm{\vartheta}^e\prime}\;\dots&\;\text{reversed and ordered versions of the query start-/end boundaries,}
\\[1ex]
\mathbf{p}(\mathbf{v}_1,\mathbf{v_2},\lambda)=&\;\mathcal{H}(\mathcal{R}(\mathbf{v}_1-\mathbf{v}_2-\lambda))\times \mathcal{R}(\mathbf{v}_1-\mathbf{v}_2-\lambda)
\\[0ex]
&\;+\mathcal{H}(\mathcal{R}(\mathbf{v}_2-\mathbf{v}_1-\lambda))\times\mathcal{R}(\mathbf{v}_2-\mathbf{v}_1-\lambda)\;\text{,}
\\[1ex]
&\;\text{the mutual difference between vectors}\;\mathbf{v}_1,\mathbf{v}_2,
\\[1ex]
\mathbf{v}_s\;,\;\mathbf{v}_e=&\;\mathbf{p}(\bm{\vartheta}^s,\bm{\vartheta}^s\prime,\;k)\;,\;\mathbf{p}(\bm{\vartheta}^e,\bm{\vartheta}^e\prime,\;k)\;\text{, (}k\;\text{is an arbitrary value),}
\\[1ex]
\mathbf{u}_s\;,\;\mathbf{u}_e=&\;\mathbf{p}(\bm{\vartheta}^s\prime,\overleftarrow{\bm{\vartheta}^s\prime},\;k)\;,\;\mathbf{p}(\bm{\vartheta}^e\prime,\overleftarrow{\bm{\vartheta}^e\prime},\;k)\;\text{,}
\\[1ex]
&\;\text{vectors for violations (}\mathbf{v}_{s,e}\text{) and upper bounds (}\mathbf{u}_{s,e}\text{),}
\\[1ex]
\bm{\phi}_v\;,\;\bm{\phi}_u=&\;\mathbf{v}_s+\mathbf{v}_e\;,\;\mathbf{u}_s+\mathbf{u}_e\;\text{,}
\\[1ex]
R_1^{\text{new}}=&\;-\log{\bigg(1-\frac{\bm{\phi}_v}{\bm{\phi}_u}\bigg)}\;\text{, fully-expanded, using}\;b_{q_s},b_{q_e}\;\text{:}
\\[1ex]
=&\;-\log{\Bigg(1-\frac{\sum_{q=1}^{\max{Q}}\,\begin{matrix}\;\;\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda))\times \mathcal{R}(b_{q_s}-\bm{\vartheta}^s_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s_q\prime-b_{q_s}-\lambda))\times\mathcal{R}(\bm{\vartheta}^s_q\prime-b_{q_s}-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^e_q\prime-\lambda))\times \mathcal{R}(b_{q_e}-\bm{\vartheta}^e_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-b_{q_e}-\lambda)\end{matrix}}{\sum_{q=1}^{\max{Q}}\,\begin{matrix}\;\;\;\mathcal{H}(\mathcal{R}(\overleftarrow{\bm{\vartheta}^s_q\prime}-\bm{\vartheta}^s_q\prime-\lambda))\times \mathcal{R}(\overleftarrow{\bm{\vartheta}^s_q\prime}-\bm{\vartheta}^s_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^s_q\prime-\overleftarrow{\bm{\vartheta}^s_q\prime}-\lambda))\times\mathcal{R}(\bm{\vartheta}^s_q\prime-\overleftarrow{\bm{\vartheta}^s_q\prime}-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\overleftarrow{\bm{\vartheta}^e_q\prime}-\bm{\vartheta}^e_q\prime-\lambda))\times \mathcal{R}(\overleftarrow{\bm{\vartheta}^e_q\prime}-\bm{\vartheta}^e_q\prime-\lambda)&\\+\;\mathcal{H}(\mathcal{R}(\bm{\vartheta}^e_q\prime-\overleftarrow{\bm{\vartheta}^e_q\prime}-\lambda))\times\mathcal{R}(\bm{\vartheta}^e_q\prime-\overleftarrow{\bm{\vartheta}^e_q\prime}-\lambda)\end{matrix}}\Bigg)}\;\text{.}
\end{aligned}
$$
However large the above definition, we can (and should!) break it down significantly before deriving the gradient. First of all, the entire denominator becomes constant the moment the query-boundaries are known, so we can pre-compute and replace it again with $\bm{\phi}_u$. Second, subtracting some $\lambda$ can also be pre-computed, so that'll clean up things further:
$$
\begin{aligned}
\text{Using}\;\bm{\vartheta}^{s,\lambda}_q\prime,\bm{\vartheta}^{e,\lambda}_q\prime\;\dots&\;\text{vectors that already subtracted some}\;\lambda\text{,}
\\[1ex]
R_1^{\text{new}}=&\;\sum_{q=1}^{\max{Q}}-\log{\Bigg(1-\frac{
\begin{matrix}
\;\;\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))\times \mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
&\\+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))\times\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})
&\\+\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\times \mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)
&\\+\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))\times\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})
\end{matrix}
}{\bm{\phi}_{u_q}}\Bigg)}\;\text{, note that}
\\[1ex]
\frac{\partial\,\mathcal{R}}{\partial\;x}=&\;\mathcal{H}\;\text{, and}\;\frac{\partial\,\mathcal{H}}{\partial\,x}=0\;\text{, and the gradient will become}
\\[1ex]
\nabla\,R_1^{\text{new}}=&\;\Bigg[\frac{\partial\,R_1^{\text{new}}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_1^{\text{new}}}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\Bigg[\frac{-\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))\times\mathcal{H}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})+\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))\times\mathcal{H}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)}{
\begin{matrix}
\bm{\phi}_{u_q}-\Big(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})\times\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))+\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
&\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\times\;\mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))+\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})\times\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))
&\\+\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)\times\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\Big)\end{matrix}}\;\text{,}
\\[1ex]
&\;\frac{-\mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))\times\mathcal{H}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})+\mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime))\times\mathcal{H}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)}{
\begin{matrix}
\bm{\phi}_{u_q}-(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s})\times \mathcal{H}(\mathcal{R}(\bm{\vartheta}^{s,\lambda}_q\prime-b_{q_s}))+\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime)
&\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\times \mathcal{H}(\mathcal{R}(b_{q_s}-\bm{\vartheta}^{s,\lambda}_q\prime))+\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e})\times \mathcal{H}(\mathcal{R}(\bm{\vartheta}^{e,\lambda}_q\prime-b_{q_e}))
&\\+\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)\times \mathcal{H}(\mathcal{R}(b_{q_e}-\bm{\vartheta}^{e,\lambda}_q\prime)))
\end{matrix}
}\Bigg]\;\text{.}
\end{aligned}
$$
Well, at least the denominators are the same for the partial derivatives. Note that we have already removed all summands that would be zero due to them being multiplied with the derivative of the Heaviside step function, which is $0$.
### Penalize supports that are too short
The second kind of regularizer would, in its simplest form, put the sum of the length of the selected query-intervals into relation with the total length of the original-interval:
$$
\begin{aligned}
R_2(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e) =&\;-\log\Bigg(\Bigg[\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\,\bm{\vartheta}^e_i-\bm{\vartheta}^s_i\Bigg]\;\bigg/\;\Bigg[\sum_{i=1}^{\lvert\,\bm{\theta}^s\,\rvert}\,\bm{\theta}^e_i-\bm{\theta}^s_i\Bigg]\Bigg)
\end{aligned}
$$
It is important to recall that the query-support cannot extend beyond the original-support in any direction, resulting in the first partial sum being less than or equal to the second partial sum, hence the ratio always lies within $[0,1]$.
The gradient for this regularizer is not $0$. It can further be simplified as we do know the length of the reference-support, which is constant:
$$
\begin{aligned}
\eta=&\;\max{\mathcal{X}^r}-\min{\mathcal{X}^r}\;\text{, the extent of the reference-interval,}
\\[1ex]
\equiv&\;\sum_{i=1}^{\lvert\,\bm{\theta}^s\,\rvert}\,\bm{\theta}^e_i-\bm{\theta}^s_i\;\text{, so}\;R_2\;\text{becomes}
\\[1ex]
R_2(\bm{\vartheta}^s,\bm{\vartheta}^e)=&\;-\log\Bigg(\Bigg[\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\,\bm{\vartheta}^e_i-\bm{\vartheta}^s_i\Bigg]\;\Big/\;\eta\Bigg)\;\text{, with}\;b_{q_s},b_{q_e}\text{:}
\\[1ex]
=&\;-\log\Bigg(\Bigg[\sum_{q=1}^{\max{Q}}\,b_{q_e}-b_{q_s}\Bigg]\;\Big/\;\eta\Bigg)\;\text{, with gradient}
\\[1ex]
\nabla\,R_2=&\;\Bigg[\frac{\partial\,R_2}{\partial\,\bm{\vartheta}^s_i}\;,\;\frac{\partial\,R_2}{\partial\,\bm{\vartheta}^e_i}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{i=1}^{\lvert\,\bm{\vartheta}^s\,\rvert}\Bigg[\frac{1}{\bm{\vartheta}^e_i-\bm{\vartheta}^s_i}\;,\;\frac{1}{\bm{\vartheta}^s_i-\bm{\vartheta}^e_i}\Bigg]\;\text{, reformulated using}\;b_{q_s},b_{q_e}\;\text{:}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{.}
\end{aligned}
$$
The last reformulation is important, as it makes the regularizer compatible with our existing formulations for the residual sum of squares loss function and its gradient.
### Penalize extreme intervals
The second regularizer is very simple and straightforward, but it does not penalize extreme cases. For example, if all but one interval are of length zero, and the remaining interval covers the whole reference, the second regularizer would not impose a penalty. Another extreme case we have observed is intervals of negative length. So, not only need all intervals together cover a feasible range, intervals among themselves need to retain a somewhat healthy relation, too. At the same time, we must allow very short and very long intervals, but those should come with a cost that is proportional to their disagreement with the other intervals. One measure that describes all intervals is the mean length of them. The following regularizer sums up the squared differences. Note that we do not take the absolute value of $b_{q_e}-b_{q_s}$, as described previously, the optimization may choose intervals of negative length, and that will result in an even farther difference from the mean.
$$
\begin{aligned}
\mu=&\;\overline{\mathcal{I}}\;\text{, the mean length of all reference-intervals,}
\\[1ex]
=&\;\frac{\max{\mathcal{X}^r} - \min{\mathcal{X}^r}}{\lvert\,\bm{\theta}_b-1\,\rvert}
\\[1ex]
R_3=&\;\log\Bigg({1+\sum_{q=1}^{\max{Q}}}\;(b_{q_e}-b_{q_s}-\mu)^2\Bigg)\;\text{, with gradient}
\\[1ex]
\nabla\,R_3=&\;\Bigg[\frac{\partial\,R_3}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_3}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\;\Bigg[-1\times\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\;,\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\Bigg]\;\text{, or shorter}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]\;\text{.}
\end{aligned}
$$
### Constrain Box-bounds
Similar to DTW, the adjusted query must not extend beyond the reference signal, i.e., it must not be less or greater than the support of the reference. We introduce a 4th kind of regularizer that can be added to enforce these constraints. The idea is similar to the 3rd kind of regularizer, we will sum up all transgressions and return the logarithm of it. However, a transgression should be penalized based on its boundaries, not the interval delimited by it. In other words, we penalize every boundary and whether its outside the box, as the other regularizers take already care of the other properties, such as whether the starting boundary comes after then end. Each boundary is checked for whether it is larger than (or equal to) the lower bound, and less than (or equal to) the upper bound:
$$
\begin{aligned}
\nu_l,\nu_u\;\dots&\;\text{the lower and upper box-bounds, respectively,}
\\[1ex]
\mathcal{R},\mathcal{H}\;\dots&\;\text{, the ramp- and Heaviside step functions,}
\\[1ex]
R_4=&\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\;\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg)\;\text{, with gradient}
\\[1ex]
\nabla\,R_4=&\;\Bigg[\frac{\partial\,R_4}{\partial\,b_{q_s}}\;,\;\frac{\partial\,R_4}{\partial\,b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\;\Bigg[\frac{\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;,
\\[1ex]
&\;\;\;\;\;\;\;\;\;\;\;\;\;\frac{\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg]\;\text{, or shorter}
\\[1em]
=&\;\sum_{q=1}^{\max{Q}}\;\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;\text{.}
\end{aligned}
$$
# Optimization
We have previously introduced the model Boundary Time Warping, and shown how to compute its gradient and loss (and gradient thereof), as well as presenting two regularizers for this model. The goal now is to bring everything together, such that we can train the model to minimize an objective function.
```{r}
Stabilize <- function(f, lb, ub) {
Vectorize(function(x) {
if (x < lb) f(lb) else if (x > ub) f(ub) else f(x)
})
}
r <- Stabilize(signal_ref, 0, 1)
f <- Stabilize(signal_query, 0, 1)
```
## Objective function
The objective function is used to calculate a cost, loss or error, given some reference data, as well as query-data together with some parameters that should set up the model such that the cost, loss or error between reference- and transformed query-data becomes minimal. The objective function wraps the model, its data and parameters:
* Model: Given the reference- and query-data, and parameters that determine how to transform the query-data, it computes a derived version of the query-data that is supposedly closer to the reference-data than the non-derived query-data.
* Loss function: Provides the means to aggregate the differences between the reference-data and the model-derived query-data. The loss function usually aggregates the entirety of all differences to a single value.
* Regularizer: One or more additional terms optionally added to the loss function. Regularizers are used to penalize extreme, albeit good solutions as per Occam's razor. Regularizers are added to guide the training of the involved parameters such that over-fitting is avoided.
* Gradient of loss function: Vector of all partial derivatives of the loss function. Computing the gradient of the loss function results in a suggested change for each parameter, along with how to change it, i.e., whether it should be increased or decreased.
* Hessian of loss function: Matrix with all 2nd-order partial derivatives of the loss function. This matrix can be used for Newton-Raphson optimization and should be preferred whenever possible over purely gradient-based optimization methods.
We have defined all of the above (except for the Hessian, that we will do later). Our loss function is given by:
$$
\begin{aligned}
\text{Using}\;\bm{\vartheta}^s\prime,\bm{\vartheta}^e\prime,\bm{\phi}_v,\bm{\phi}_u,\mu,\nu_l,\nu_u,\mathcal{X}^r\;\dots&\;\text{as defined previously,}
\\[1em]
\mathcal{L}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e,r,f)=&\;\text{RSS}_{\mathsf{M}}^{\text{log}} + R_1^{\text{new}} + R_2 + R_3 + R_4\;\text{, with}
\\[1ex]
\mathbf{y},\hat{\mathbf{y}}=&\;\mathsf{M}(\bm{\theta}^s,\bm{\theta}^e,\bm{\vartheta}^s,\bm{\vartheta}^e, r,f)\text{, }\mathcal{L}\;\text{becomes}
\\[1ex]
=&\;\;\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\big(\mathbf{y}_i - m_q(b_{o_s}, b_{o_e}, b_{q_s}, b_{q_e},\;f,\mathbf{x}_{q,i})\big)^2\Bigg)}
\\
&\;-\log{\bigg(1-\frac{\bm{\phi}_v}{\bm{\phi}_u}\bigg)}
\\
&\;-\log\Bigg(\Bigg[\sum_{q=1}^{\max{Q}}\,b_{q_e}-b_{q_s}\Bigg]\;\Big/\;\Big(\max{\mathcal{X}^r}-\min{\mathcal{X}^r}\Big)\Bigg)
\\[1ex]
&\;+\log\Bigg({1+\sum_{q=1}^{\max{Q}}}\;(b_{q_e}-b_{q_s}-\mu)^2\Bigg)
\\[1ex]
&\;+\log{\Bigg(1+\sum_{q=1}^{\max{Q}}\;\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\Bigg)
\end{aligned}
$$
The gradient of this loss function is:
$$
\begin{aligned}
\delta_o,\psi_i,\mathbf{x}_{q,i}\;\dots&\;\text{as defined previously,}
\\[1ex]
\nabla\,\mathcal{L}(\dots)=&\;\Big[\frac{\partial\,\mathcal{L}}{\partial\,b_{q_s}}\;,\;\frac{\partial\,\mathcal{L}}{\partial\,b_{q_e}}\Big]\;\text{,}
\\[1ex]
=&\;\nabla\,\text{RSS}_{\mathsf{M}}^{\text{log}}\;+\;\nabla\,R_1^{\text{new}}\;+\;\nabla\,R_2\;+\;\nabla\,R_3\;+\;\nabla\,R_4\text{,}
\\[1ex]
=&\;\Bigg[\sum_{q=1}^{\max{Q}}\sum_{i=1}^{\lvert\mathbf{x}_{q}\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_{q_i}\;,\;\mathbf{x}_{q_i}-b_{q_s}\bigg]\Bigg]\;\text{,}
\\[1ex]
&\;+\nabla\,R_1^{\text{new}}=\;\dots\;\text{(too big to paste here))}
\\[1ex]
&\;+\sum_{q=1}^{\max{Q}}\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{(the gradient of}\;R_2\;\text{is no longer negative)}
\\[1ex]
&\;+\sum_{q=1}^{\max{Q}}\;\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]
\\[1ex]
&\;+\sum_{q=1}^{\max{Q}}\;\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}\;\text{,}
\\[1ex]
=&\;\sum_{q=1}^{\max{Q}}\Bigg[\sum_{i=1}^{\lvert\mathbf{x}_q\rvert}\;\frac{2\,\delta_o\times\Big(\mathbf{y}_i-f\big(\psi_i\big)\Big)\times f'\big(\psi_i\big)}{\big(b_{q_s}-b_{q_e}\big)^2}\times\bigg[b_{q_e}-\mathbf{x}_{q_i}\;,\;\mathbf{x}_{q_i}-b_{q_s}\bigg]\Bigg]
\\[1ex]
&\;+\nabla\,R_1^{\text{new}}\dots
\\[1ex]
&\;+\Bigg[\frac{1}{b_{q_e}-b_{q_s}}\;,\;\frac{1}{b_{q_s}-b_{q_e}}\Bigg]\;\text{,}
\\[1ex]
&\;+\frac{2(b_{q_e}-b_{q_s}-\mu)}{1+(b_{q_e}-b_{q_s}-\mu)^2}\times\Big[-1\;,\;1\Big]
\\[1ex]
&\;+\frac{\bigg[\mathcal{H}(b_{q_s}-\nu_u)-\mathcal{H}(\nu_l-b_{q_s})\;,\;\mathcal{H}(b_{q_e}-\nu_u)-\mathcal{H}(\nu_l-b_{q_e})\bigg]}{1+\mathcal{R}(\nu_l-b_{q_s})+\mathcal{R}(b_{q_s}-\nu_u)+\mathcal{R}(\nu_l-b_{q_e})+\mathcal{R}(b_{q_e}-\nu_u)}
\end{aligned}
$$
Now we have a loss and its gradient, and we will implement and test this next, before going further to the Hessian.
## Implementation
Here we implement the loss function with its regularizers and gradient.
### Using optim's gradient
First we test how well the loss function works, by having `optim()` use its own approximation of a gradient.
```{r}
temp <- loadResultsOrCompute(file = "../results/btw-final_test-loss.rds", computeExpr = {
pNames <- c(paste0("b", 1:length(query_bounds)), "loss")
fitResult <- FitResult$new(paramNames = pNames)$start()
optR <- optim(
par = query_bounds,
method = "L-BFGS-B",
lower = rep(0, length(query_bounds)),
upper = rep(1, length(query_bounds)),
fn = function(x) {
loss <- L_final(theta_b_org = query_bounds, theta_b = x, r = r, f = f)
fitResult$stopStep(resultParams = c(x, loss), verbose = interactive())
loss
}
)
list(
fr = fitResult$finish(),
optR = optR
)
})
plot(log(temp$fr$getFitHist()[, "loss"]), type="l")
temp2 <- M_final(
theta_b_org = query_bounds,
theta_b = temp$optR$par,
r = signal_ref,
f = signal_query)
plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp$optR$par)
```
The loss function apparently works. Note that the above plot includes also gradient iterations, as we cannot distinguish these using the `optim()`-approach.
### Implementing the analytical gradient
We have previously derived the gradient for our loss function, including regularizers. Now we should implement it in code and try to use `optim()` with that gradient instead. For the analytical gradient, we also need to provide a derivative of the query-signal. While it is actually a sine, we go the numeric way here, approximating a function for the gradient.
```{r}
library(pracma)
#' Note that here we use the stabilized version of the query-signal.
#' TODO: We should supply helper functions to the user that use the
#' best possible method of derivation, like we do here.
f_deriv1 <- Vectorize(function(x) {
m <- if (x == 0) "forward" else if (x == 1) "backward" else "central"
pracma::fderiv(f = f, x = x, method = m)
})
curve(f, 0, 1)
curve(f_deriv1, 0, 1)
```
Let's make a test without the regularizers:
```{r}
temp <- loadResultsOrCompute(file = "../results/btw-final_test-loss_no-reg.rds", computeExpr = {
pNames <- c(paste0("b", 1:length(query_bounds)), "loss")
fitResult <- FitResult$new(paramNames = pNames)$start()
optR <- stats::optim(
par = query_bounds,
method = "L-BFGS-B",
lower = rep(0, length(query_bounds)),
upper = rep(1, length(query_bounds)),
fn = function(x) {
loss <- L_final(theta_b_org = query_bounds, theta_b = x, r = r, f = f, useR1 = FALSE, useR2 = FALSE)
fitResult$stopStep(resultParams = c(x, loss), verbose = interactive())
loss
},
gr = function(x) {
L_final_grad(theta_b_org = query_bounds, theta_b = x, r = r, f = f, f_p = f_deriv1, useR1 = FALSE, useR2 = FALSE)
}
)
list(
fr = fitResult$finish(),
optR = optR
)
})
print(round(temp$optR$par, 5))
plot(log(temp$fr$getFitHist()[, "loss"]), type="l")
temp2 <- M_final(
theta_b_org = query_bounds,
theta_b = temp$optR$par,
r = signal_ref,
f = signal_query)
plotBtw(data.frame(x = temp2$X, y = temp2$y_hat), bounds = temp$optR$par)
```
This works somewhat, with the major problem being overlapping boundaries, and BTW not being able to solve this problem without the regularizers (the $R_1$ regularizer does this). However, it seems to cover the whole reference in this example, even without the $R_2$ regularizer.
```{r}
set.seed(1337)
useBounds <- c(0, sort(runif(23)), 1)
#useBounds <- query_bounds
fitResult <- FitResult$new(paramNames = "loss")$start()
optR <- stats::optim(
par = useBounds,
method = "BFGS",
# lower = rep(0, length(useBounds)),
# upper = rep(1, length(useBounds)),
fn = function(x) {
#print(format(x, digits = 3, nsmall = 3, scientific = FALSE))
fitResult$startStep()
loss <- L_final_log(
theta_b_org = useBounds, theta_b = x, r = r, f = f, zNormalize = FALSE,
weightErr = 2/3, weightR1 = 1, weightR2 = .2, weightR3 = .5, weightR4 = 1)
#print(loss)
fitResult$stopStep(resultParams = loss, verbose = TRUE)
loss
# },
# gr = function(x) {
# pracma::grad(x0 = x, f = function(x) {
# L_final_log(theta_b_org = useBounds, theta_b = x, r = r, f = f, weightErr = .5, weightR2 = .5)
# })
# # grad <- L_final_log_grad(
# # theta_b_org = useBounds, theta_b = x, r = r, f = f, f_p = f_deriv1,
# # weightErr = .5, weightR1 = 1, weightR2 = .5, weightR3 = 1, weightR4 = 1)
# #
# # print(format(grad, digits = 3, nsmall = 3, scientific = FALSE))
# # grad
}
)
fitResult$finish()
```
```{r}
temp3 <- M_final(
theta_b_org = useBounds,
theta_b = optR$par,
r = signal_ref,
f = signal_query)
plotBtw(data.frame(x = temp3$X, y = temp3$y_hat), optR$par)
```