About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

zhlgg · 2024-12-28T05:45:16Z

Dear authors,
Regarding the experimental results in Section 4.2, I noticed that the authors compared the performance of models using SWA and models using the SelfExtend method on the passkey retrieval task. Although SWA limits the attention window size between tokens, there are many layers in the LLM. It is possible that the last token do not attend to the tokens where the passkey is located in the first layer, but the tokens at the passkey position can propagate backward to the tokens within the SWA window size. This backward propagation continues layer by layer until the final layer. Why can't the passkey information be propagated to the tokens that need to be generated? I am really curious about this question and look forward to your response!

zhlgg · 2024-12-28T06:24:27Z

I also observed that the SelfExtend method causes tokens related to the passkey, when placed far from the last token, to have nearly the same distance to the last token. Why doesn’t this lead to the token order of the passkey being reversed in the output?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

zhlgg commented Dec 28, 2024

zhlgg commented Dec 28, 2024

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

Comments

zhlgg commented Dec 28, 2024

zhlgg commented Dec 28, 2024