Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

Open
myname36 opened this issue Oct 9, 2023 · 1 comment

Comments

@myname36
Copy link

myname36 commented Oct 9, 2023

I am interested in loading Long Llama with Mojo Framework as mentioned here https://github.com/tairov/llama2.mojo to increase the model speed while applying 4-bit quantization for model compression. Could you provide guidance or examples on how this can be achieved? Particularly, I am curious about how to maintain model performance while reducing the model size using 4-bit quantization , and is it possible to use flash attention 2 , and what do you think about using long llama 3b with code long llama for Speculative execution for LLM as mentioned here https://twitter.com/karpathy/status/1697318534555336961

@myname36
Copy link
Author

myname36 commented Oct 9, 2023

also i wonder what do you think about longlora project https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant