[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

ShareLer · 2025-02-13T10:10:46Z

After reading and deriving the formula for the MTP part, I found that there is a difference in the logic of Equation 22 and Figure 3. Since the subscript [1:T−k] is used in Equation 22, the hidden_state of the (k-1) layer for each MTP module starts with the first token and reaches T-k tokens. In the example in this article, according to Equation 22, the input for the main model should be [t1, t2, t3, t4, t5] (perhaps t6 should also be used as the input for the main model), and only then can the first MTP module get the 5th (T-k, k=1) token at the hidden_state of the previous layer.
Combined with Equation 24 of MTP loss, all MTP Modules will calculate the T+1st token when calculating loss, so it can be confirmed that Equation 22 is correct and that there is an error in Figure 3.

I hope to get a reply from the official staff, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

ShareLer commented Feb 13, 2025

[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

[Question] There is a difference in the logic of Figure 3 and Equation 22 at the part of MTP in the technical report. #655

Comments

ShareLer commented Feb 13, 2025