Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

Open
ShareLer opened this issue Feb 13, 2025 · 0 comments

Comments

@ShareLer
Copy link

After reading and deriving the formula for the MTP part, I found that there is a difference in the logic of Equation 22 and Figure 3. Since the subscript [1:T−k] is used in Equation 22, the hidden_state of the (k-1) layer for each MTP module starts with the first token and reaches T-k tokens. In the example in this article, according to Equation 22, the input for the main model should be [t1, t2, t3, t4, t5] (perhaps t6 should also be used as the input for the main model), and only then can the first MTP module get the 5th (T-k, k=1) token at the hidden_state of the previous layer.
Combined with Equation 24 of MTP loss, all MTP Modules will calculate the T+1st token when calculating loss, so it can be confirmed that Equation 22 is correct and that there is an error in Figure 3.

I hope to get a reply from the official staff, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant