Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

Why is lm_head in ESM-2 (35M) (480, 480)? #416

Answered by tomsercu
jasperhyp asked this question in General
Discussion options

You must be logged in to vote

When you look at the definition of lm_head you'll see the forward pass is dense -> nonlinearities -> output layer. So the 480,480 is the first of a 2layer MLP.
And the output layer is shared with the transformer input token embedding.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@jasperhyp
Comment options

Answer selected by jasperhyp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants