utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

myname36 · 2023-10-09T18:42:20Z

I am interested in loading Long Llama with Mojo Framework as mentioned here https://github.com/tairov/llama2.mojo to increase the model speed while applying 4-bit quantization for model compression. Could you provide guidance or examples on how this can be achieved? Particularly, I am curious about how to maintain model performance while reducing the model size using 4-bit quantization , and is it possible to use flash attention 2 , and what do you think about using long llama 3b with code long llama for Speculative execution for LLM as mentioned here https://twitter.com/karpathy/status/1697318534555336961

myname36 · 2023-10-09T18:44:00Z

also i wonder what do you think about longlora project https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

myname36 commented Oct 9, 2023

myname36 commented Oct 9, 2023

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

utilizing Long Llama with Mojo Framework and applying 4-bit quantization and is it possible to use flash attention 2 and your thoughts about Speculative execution for LLM #17

Comments

myname36 commented Oct 9, 2023

myname36 commented Oct 9, 2023