This repository has been archived by the owner on Aug 1, 2024. It is now read-only.
-
Hi! Please correct me if I am wrong, but I saw here that |
Beta Was this translation helpful? Give feedback.
Answered by
tomsercu
Dec 6, 2022
Replies: 1 comment 1 reply
-
When you look at the definition of lm_head you'll see the forward pass is dense -> nonlinearities -> output layer. So the 480,480 is the first of a 2layer MLP. |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
jasperhyp
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When you look at the definition of lm_head you'll see the forward pass is dense -> nonlinearities -> output layer. So the 480,480 is the first of a 2layer MLP.
And the output layer is shared with the transformer input token embedding.