-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FoT attention and the scaling trick #4
Comments
I think its due to applied FoT attention, that not use scaled positional embeddings by summing the additional parts |
Hi, thanks for the question. Briefly speaking, we have not tried using scaled positional encodings and FoT attention, so we cannot comment on performance. Originally FoT was designed to allow the model to handle large databases consisting of millions of keys and values from multiple unrelated documents. In such a setup, it is not clear how to apply positional encodings. It is reflected in our experiments with smaller models where we disable positional encodings in memory layers (other layers maintain positional encoding). In other words, let We do this in order to maintain compatibility with the LLaMA code. |
I figured as much after a re-reading of the respective parts of the paper, but the whole "they encode them as if they were at the beginning of the local context" wasn't very clear to me until your explanation, so thanks for that. |
In your paper, you say
Does that mean that one can't use both scaled positional embeddings and FoT attention?
The text was updated successfully, but these errors were encountered: