Why is `lm_head` in ESM-2 (35M) (480, 480)? #416

jasperhyp · 2022-12-06T16:32:49Z

jasperhyp
Dec 6, 2022

Hi! Please correct me if I am wrong, but I saw here that lm_head is (480, 33). However, when loading the model using model, alphabet = esm.pretrained.esm2_t12_35M_UR50D(), model.lm_head is actually having a dense layer of (480, 480). Why the discrepancy?

Answered by tomsercu

Dec 6, 2022

When you look at the definition of lm_head you'll see the forward pass is dense -> nonlinearities -> output layer. So the 480,480 is the first of a 2layer MLP.
And the output layer is shared with the transformer input token embedding.

View full answer

tomsercu · 2022-12-06T22:29:40Z

tomsercu
Dec 6, 2022

When you look at the definition of lm_head you'll see the forward pass is dense -> nonlinearities -> output layer. So the 480,480 is the first of a 2layer MLP.
And the output layer is shared with the transformer input token embedding.

1 reply

jasperhyp Dec 6, 2022
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is `lm_head` in ESM-2 (35M) (480, 480)? #416

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why is lm_head in ESM-2 (35M) (480, 480)? #416

jasperhyp Dec 6, 2022

Replies: 1 comment · 1 reply

tomsercu Dec 6, 2022

jasperhyp Dec 6, 2022 Author

Why is `lm_head` in ESM-2 (35M) (480, 480)? #416

jasperhyp
Dec 6, 2022

Replies: 1 comment 1 reply

tomsercu
Dec 6, 2022

jasperhyp Dec 6, 2022
Author