Added RMSNorm #545

packquickly · 2023-10-08T15:56:23Z

Added RMS normalisation, a simplified variant of layer norm which computes

$$ \frac{x}{\sqrt{\varepsilon + \frac{1}{n}\Vert x \Vert^2_2}} \gamma + \beta$$

where $n = \dim(x)$ and $\gamma$ is a learnable array if use_weight=True, or $\sqrt{n}$ if use_weight=False. $\beta$ is an optional bias term. This has become somewhat popular in transformers for being simpler than layer norm but having similar/better performance.

Originally proposed in this paper, see Lucidrain's dicussion about this for more details.

The defaults use_weight=False and use_bias=False are intentional. This is meant to be a faster and simpler version of layer norm, so leaving these off by default made the most sense to me.

patrick-kidger · 2023-10-09T17:11:39Z

Nice! This would be good to have.
For use_weight=False, I don't think your description and the code quite match up -- you have both a 1/n and a gamma = sqrt(n) factor; I think you only want one of the two. (I also checked both papers) and only the one scaling factor seems to be consistent with them too; they basically just do x/rms(x)).

Also, what should the default value of eps be? In my quick skim of the papers I didn't see it mentioned, i.e. eps=0. In practice I imagine have something here is a good idea, I'm just wary of divergences from canonical implementations.

equinox/nn/_normalisation.py

packquickly · 2023-10-10T13:29:01Z

Ah! Good catch on the numerics! If this looks right I'll rebase into 1 commit to merge

patrick-kidger · 2023-10-10T20:45:33Z

equinox/nn/_normalisation.py

+
+    $$\frac{x}{\sqrt{\varepsilon + \frac{1}{n}\Vert x \Vert^2_2}} \gamma + \beta$$
+
+    where $\Vert \cdot \Vert_2$ is the 2-norm, $n = \dim(x)$, and $\gamma$ is a


I'd probably write it out explicitly, to avoid ambiguity whether "2-norm" is mean or sum.

patrick-kidger · 2023-10-10T20:45:45Z

equinox/nn/_normalisation.py

+
+    where $\Vert \cdot \Vert_2$ is the 2-norm, $n = \dim(x)$, and $\gamma$ is a
+    learned array with the same shape as $x$ if `use_weight=True`, or
+    $\gamma = 1/\sqrt{n}$ if `use_weight=False`, as proposed in


Not true as per discussion?

patrick-kidger · 2023-10-10T20:46:01Z

equinox/nn/_normalisation.py

+
+        - `shape`: Shape of the input.
+        - `eps`: Value added to denominator for numerical stability.
+        - `use_weight`: Whether the module has learnable affine weights.


use_bias is missing?

patrick-kidger · 2023-10-10T20:46:11Z

equinox/nn/_normalisation.py

+        ```
+    """
+
+    shape: tuple[int] = field(static=True)


this annotation should be tuple[int, ...]

patrick-kidger · 2023-10-10T20:48:10Z

equinox/nn/_normalisation.py

+                "to replace `rms_norm(x)` with `jax.vmap(rms_norm)(x)`.\n"
+            )
+        inv_rms = jax.lax.rsqrt(jnp.sum(x**2) + self.eps)
+        out = jnp.sqrt(self.dim) * inv_rms * x


Note that inv_rms is badly named here, as it's actually an inv-sum-of-squares right now.

Can you just replace jnp.sum with jnp.mean?

patrick-kidger · 2023-10-10T20:48:38Z

equinox/nn/_normalisation.py

+        else:
+            shape = tuple(shape)
+        self.shape = shape
+        self.dim = prod(self.shape)


As this is easily derivable from existing attributes (and after my suggested change in __call__, unused) then I think it can be removed.

jondeaton · 2023-12-26T05:52:05Z

equinox/nn/_normalisation.py

+        self.use_weight = use_weight
+        self.use_bias = use_bias
+        self.weight = jnp.ones(shape) if use_weight else None
+        self.bias = jnp.ones(shape) if use_bias else None


bias should be initialized with jnp.zeros so that at initialization the elementwise affine is a no-op.

Oh, hey Jon!

It looks like this PR isn't going anywhere at the moment -- if you or anyone else wants to pick it up then I'd still be happy to see this merged.

Hey Patrick :)

I'd be happy to pick this up. Would you/Jason prefer that I make a new PR against main, or should I create a PR against packquickly:rmsnorm?

No strong feelings! As you've done it is fine.

patrick-kidger · 2023-12-27T19:28:17Z

Closing in favour of #629.

Added RMSNorm

72d4ff6

patrick-kidger reviewed Oct 9, 2023

View reviewed changes

equinox/nn/_normalisation.py Outdated Show resolved Hide resolved

equinox/nn/_normalisation.py Show resolved Hide resolved

fix numerics to match paper

790de08

patrick-kidger reviewed Oct 10, 2023

View reviewed changes

jondeaton reviewed Dec 26, 2023

View reviewed changes

jondeaton mentioned this pull request Dec 27, 2023

Add RMSNorm #629

Merged

patrick-kidger closed this Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added RMSNorm #545

Added RMSNorm #545

packquickly commented Oct 8, 2023

patrick-kidger commented Oct 9, 2023 •

edited

Loading

packquickly commented Oct 10, 2023

patrick-kidger Oct 10, 2023

patrick-kidger Oct 10, 2023

patrick-kidger Oct 10, 2023

patrick-kidger Oct 10, 2023

patrick-kidger Oct 10, 2023

patrick-kidger Oct 10, 2023

jondeaton Dec 26, 2023

patrick-kidger Dec 26, 2023

jondeaton Dec 27, 2023

patrick-kidger Dec 27, 2023

patrick-kidger commented Dec 27, 2023


		$$\frac{x}{\sqrt{\varepsilon + \frac{1}{n}\Vert x \Vert^2_2}} \gamma + \beta$$

		where $\Vert \cdot \Vert_2$ is the 2-norm, $n = \dim(x)$, and $\gamma$ is a

Added RMSNorm #545

Added RMSNorm #545

Conversation

packquickly commented Oct 8, 2023

patrick-kidger commented Oct 9, 2023 • edited Loading

packquickly commented Oct 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-kidger commented Dec 27, 2023

patrick-kidger commented Oct 9, 2023 •

edited

Loading