-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does each token requires KNN search during inference? #7
Comments
@CStanKonrad Is there a practical example that using external Memory? |
Regarding the question, the suggested implementation of kNN retrieves for each query in the memory layer k most matching keys from the memory cache. In the 3B model, there are 3 memory layers, each having 32 heads, which gives 96 retrievals per token. In general, we recommend using the brute force approach (full attention - no kNN; an example of such an approach is implemented in this repository) for memories that fit on GPU. However, if you want to use Faiss you will need to tune the index manually (note that the faster Faiss indexes have a training stage and allow to balance between speed and retrieval accuracy). We currently do not provide practical examples with Faiss. Example times obtained on 40GB A100 GPU with bfloat16 precision using code from this repository |
Got it, thanks! |
If i use faiss as a Memory, during the inference,calculating each token requires 3(becase there are 3 memory attention layers) knn search, right? Will the generation speed become very slow?
The text was updated successfully, but these errors were encountered: