You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading and deriving the formula for the MTP part, I found that there is a difference in the logic of Equation 22 and Figure 3. Since the subscript [1:T−k] is used in Equation 22, the hidden_state of the (k-1) layer for each MTP module starts with the first token and reaches T-k tokens. In the example in this article, according to Equation 22, the input for the main model should be [t1, t2, t3, t4, t5] (perhaps t6 should also be used as the input for the main model), and only then can the first MTP module get the 5th (T-k, k=1) token at the hidden_state of the previous layer.
Combined with Equation 24 of MTP loss, all MTP Modules will calculate the T+1st token when calculating loss, so it can be confirmed that Equation 22 is correct and that there is an error in Figure 3.
I hope to get a reply from the official staff, thanks.
The text was updated successfully, but these errors were encountered:
After reading and deriving the formula for the MTP part, I found that there is a difference in the logic of Equation 22 and Figure 3. Since the subscript [1:T−k] is used in Equation 22, the hidden_state of the (k-1) layer for each MTP module starts with the first token and reaches T-k tokens. In the example in this article, according to Equation 22, the input for the main model should be [t1, t2, t3, t4, t5] (perhaps t6 should also be used as the input for the main model), and only then can the first MTP module get the 5th (T-k, k=1) token at the hidden_state of the previous layer.
Combined with Equation 24 of MTP loss, all MTP Modules will calculate the T+1st token when calculating loss, so it can be confirmed that Equation 22 is correct and that there is an error in Figure 3.
I hope to get a reply from the official staff, thanks.
The text was updated successfully, but these errors were encountered: