docs: add YaRN notes to rope.md

Signed-off-by: Daniel Bevenius <[email protected]>
danbev · Jul 25, 2024 · 10c831b · 10c831b
1 parent d8751bf
commit 10c831b
Showing 1 changed file with 152 additions and 2 deletions.
diff --git a/notes/rope.md b/notes/rope.md
@@ -407,8 +407,158 @@ Al least in LongRope NTK uses two groups:
 1.  A low-frequency group for shorter positions (smaller scaling factor).
 2.  A high-frequency group for longer positions (larger scaling factor).
 
-### YaRN (Yet another RoPE Network)
-TODO:
+### YaRN (Yet another RoPE ExtensioN method)
+Is also an extention of RoPE and builds upon the NTK idea as well.
+
+The notation that the YaRN paper uses is a bit different from the one I have
+used above:
+```
+f'w(x_m, m, Θ_d) = fw(x_m, g(m), h(Θ_d))
+```
+So the function `fw` is parameterized by W (W for weights). You can think of
+this as a field/member of a struct/class that this function is also a member of.
+The other parameters as the input to the function. `x_m` is an embedding for a
+token in the sequence. `m` is that tokens position in the sequence. And 'Θ_d' is
+the set of angles for the dimensions of the embedding space.
+
+For PI the function becomes:
+```
+g(m) = m/s
+h(Θ_d) = Θ_d
+
+s = L'/L
+```
+
+YaRN introduces a new parameter lambda (λ) which is defined as:
+```
+      2Π     
+λ_d = --- = 2Πb^(2d/|D|)
+      Θ_d
+
+Θ_d = the rotation angle for the d-th dimension.
+b   = the base frequency.
+|D| = the total number of dimensions in the embedding space.
+```
+The wavelength λ_d specifies how far along the sequence of input tokens we need
+to go before the positional embedding for a particular dimension repeats.
+
+'Θ_d' Each dimension in the embedding space has its own value for 'Θ_d'. For a
+given token at position 'm', the positional encoding for the d-th dimension is
+derived from m scaled by theta_d, 'm * Θ_d'.
+
+Lets try to understand  this a little better. Take the following table that
+tries to show the values for λ_d=4
+```
+Token postiion      λ_d      Rotation Angle (radians)  Rotation angle (degrees)
+0                   4        0                         0
+1                   4        π/2                       90
+2                   4        π                         180
+3                   4        3π/2                      270
+4                   4        2π   (same as 0)          360 (same as 0)
+5                   4        5π/2 (same as π/2)        450 (same as 90)                     
+6                   4        3π   (same as π)          540 (same as 180)
+7                   4        7π/2 (same as 3π/2)       630 (same as 270)
+8                   4        4π   (same as 0)          720 (same as 0)
+9                   4        9π/2 (same as π/2)        810 (same as 90)
+```
+Notice that after a full cycle (2π) the rotation angle is reset to 0 so there
+are only 4 unique values for the rotation angle. Recall that sine/cosine are
+cycles (think of the unit circle) and going around the circle and landing only
+same place but going around multiple times we will have the same rotation angle.
+
+So the model will only be able to distinguish between 4 different tokens
+positions for this dimension. Recall that Θ_d is the rotation angle for a
+specific dimension, and after 4 tokens the this rotation angle will repeat so
+at most the model will be able to distinguish between 4 tokens apart or something
+like that.
+
+The positional encoding allows the model to recognize patterns and relationships
+within each span of 4 tokens uniquely. However, beyond this span, the same
+values repeat, providing a way to capture periodic structures.
+
+```
+         emb₀ emb₁  emb₂  emb₃  emb₄
+token₀: 
+token₁:
+token₂:
+token₃:
+token₄:
+...
+```
+Each embedding has its own theta, namely Θ_d. And recall that each embedding
+is/represents a feature. So every feature has a rotation angle.
+```
+         Θ₀   Θ₁    Θ₂    Θ₃    Θ₄
+token₀: 
+token₁:
+token₂:
+token₃:
+token₄:
+...
+```
+In YaRN every dimension can have a λ_d value which specifies how many tokens
+that can be rotated before the cycle repeats. So if we have a λ_d value of 4
+then the rotation angle will repeat every 4 tokens. There will be 4 unique
+rotations.
+
+
+```
+Sequence: "Dan loves ice cream"
+
+                    Θ₁
+token₀ (Dan)        sin(0)=0, cos(0)=1
+token₁ (loves)      sin(π/2)=1, cos(π/2)=0
+token₂ (ice)        sin(π)=0, cos(π)=-1
+token₃ (cream)      sin(3π/2)=-1, cos(3π/2)=0
+
+token₄ (Dan)        sin(2π)=0, cos(2π)=1 (same as token₀)
+token₅ (loves)      sin(5π/2)=1, cos(5π/2)=0 (same as token₁)
+token₆ (ice)        sin(3π)=0, cos(3π)=-1 (same as token₂)
+token₇ (cream)      sin(7π/2)=-1, cos(7π/2)=0 (same as token₃)
+```
+Now, keep in mind that each dimension represents a feature of some kind that
+the model has learned. But keep in mind that we are only dealing with positional
+encodings here so when a rotation angle repeats it does so for a this specific
+dimension and each dimension has its own lambda value.
+
+#### NTK-by-parts:
+In PI and NTK-aware interpolation all RoPE demensions are scaled by the same
+factor. On thing that was observed is that for a given context lenght L there
+were dimensions that end up with a wavelength greater than the max context
+length seen during training (lambda_d > L).
+They also mention in the YaRN paper that when streching the RoPE dimensions
+by either a scale 's' or by changing the base frequency 'b' all tokens become
+closer to each other.
+To address these issues they choose not to interpolate the higher frequencies
+at all, while always interpolating the lower frequencies.
+* if the wavelenght λ_d is much smaller than the context size L no interpolation
+  is done.
+* if the wavelenght λ_d is much larger, or equal, than the context size L the
+  dimension is interpolated (not extrapolated).
+* for dimensions in between the two above they do a bit of both, simlar token
+  NTK-aware?
+
+So there is a need to distinguish between when we don't interpolate and where
+we do interpolate, and the range inbetween where we do a bit of both.
+
+The rotation for a specific dimension is determined by:
+```
+         L          L
+r(d) =  --- =  -----------
+        λ_d     2Πb'^(2d/|D|)
+```
+
+They introduce two extra parameters `α` and `β` which are used to determine 
+these ranges.
+Where rd(d) < `α` we linearly interpolate by the scale `s` (same as PI).
+Where rd(d) > `β` we don't interpolate at all (use 1).
+```
+y(r) = {0,      if r < α
+        1,      if r > β
+        r - a
+        ------,  otherwise
+        β - a
+```
 
 
 ### Theta calculation