Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Experiences in Section 4.2 (Performance on Synthetic Long Context Tasks) in the paper #49

Open
zhlgg opened this issue Dec 28, 2024 · 1 comment

Comments

@zhlgg
Copy link

zhlgg commented Dec 28, 2024

Dear authors,
Regarding the experimental results in Section 4.2, I noticed that the authors compared the performance of models using SWA and models using the SelfExtend method on the passkey retrieval task. Although SWA limits the attention window size between tokens, there are many layers in the LLM. It is possible that the last token do not attend to the tokens where the passkey is located in the first layer, but the tokens at the passkey position can propagate backward to the tokens within the SWA window size. This backward propagation continues layer by layer until the final layer. Why can't the passkey information be propagated to the tokens that need to be generated? I am really curious about this question and look forward to your response!

@zhlgg
Copy link
Author

zhlgg commented Dec 28, 2024

I also observed that the SelfExtend method causes tokens related to the passkey, when placed far from the last token, to have nearly the same distance to the last token. Why doesn’t this lead to the token order of the passkey being reversed in the output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant