Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FoT attention and the scaling trick #4

Open
StrangeTcy opened this issue Jul 10, 2023 · 3 comments
Open

FoT attention and the scaling trick #4

StrangeTcy opened this issue Jul 10, 2023 · 3 comments

Comments

@StrangeTcy
Copy link

In your paper, you say

Position Interpolation (PI, [Chen
et al., 2023] and [kaiokendev, 2023]) introduces a modification to the rotary positional encoding
scheme that enables fine-tuning for 32K context. In contrast to this work, our method does not
rely on positional encodings, following the findings from [Haviv et al., 2022]. Removing positional
encoding in memory allows us to extrapolate to 256k tokens, although the model was only trained on
sequences up to 8K, yielding theoretically unbounded context length.

Does that mean that one can't use both scaled positional embeddings and FoT attention?

@soacker
Copy link

soacker commented Jul 11, 2023

I think its due to applied FoT attention, that not use scaled positional embeddings by summing the additional parts

@CStanKonrad
Copy link
Owner

Hi, thanks for the question. Briefly speaking, we have not tried using scaled positional encodings and FoT attention, so we cannot comment on performance.

Originally FoT was designed to allow the model to handle large databases consisting of millions of keys and values from multiple unrelated documents. In such a setup, it is not clear how to apply positional encodings. It is reflected in our experiments with smaller models where we disable positional encodings in memory layers (other layers maintain positional encoding).
There is a slight difference in LongLLaMA models. Mainly all layers except memory layers use positional encodings in the standard way. Memory layers use positional encodings for local context in the standard way. Whereas for the memory keys, they encode them as if they were at the beginning of the local context.

In other words, let
$$t_0, t_1, t_2, t_3, \ldots t_{2047}, t_{2048}, \ldots, t_{4095}, \ldots$$
be some input.
LongLLaMA will process it in context windows. First, it will process
$$t_0, t_1, t_2, t_3, \ldots t_{2047}$$
and move the (key, value) pairs from memory layers to the memory cache,
Then it will process
$$t_{2048}, \ldots, t_{4095}$$
In this step, non-memory layers process only 2048 embeddings,
whereas memory layers see also previous embeddings (keys and values), but as if they were located at the same position as $t_{2048}$.

We do this in order to maintain compatibility with the LLaMA code.

@StrangeTcy
Copy link
Author

I figured as much after a re-reading of the respective parts of the paper, but the whole "they encode them as if they were at the beginning of the local context" wasn't very clear to me until your explanation, so thanks for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants