You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am actively reading on MoE adaptation from Dense model, as a enthusiast hobbyist. So sorry if I may misinterpret some concepts, but as you know, this domain is expanding at a speed way beyond anything we have ever seen and it's quite hard to process it all at once.
I would like to thank Acree-ai for providing the great tool that is Mergekit and their datasets and other tools, models ...
Also, I would like to suggest my own findings in hope it could lead to some future improvement (my wish list) :
Using more flexible loras
A per layer Lora experts as proposed in AlphaLora
This could promote the use of more adapters on some layers and less on others, possibly optimizing the distribution of the Lora weights on more important places. Something like 32x??B Parameter-Efficient Sparsity is also a good read on Lora optimization of MoE. See Qwen2idae-16x14B-v1.0 and sparsetral-16x7B-v2
Pre-existing lora from other mergekit merges
There are quite a few full fine-tunes or qlora existing already, it could be an idea to import them as already fine-tuned lora. I think that having a mix of 33%-75% originally fine-tuned loras that are frozen during training could accelerate things a lot. Those layers are already trained for some work, having the next layer and the router trained could help in knowledge retention and lower training time. Just have to figure what those frozen layers are best at (see planner).
Mistral 8x22b and Deepseek V3 as MoE+Lora ?
I also foresee Mistral 8x22b getting each layer extracted, merged to a base and then generating Loras (r16-64?) between the base and the 8x extracted layers. This would make it so much smaller and could enable running it on a single 3090 at q4 with lots of room for context. Could possibly run without retraining ? Same for Deepseek V3 ?
Frozen experts
By adapting a per layer lora concept and importing already trained loras, it would be possible to elect some experts to be trained as glue between frozen layers. A rule should state that no more than 1-x frozen layers can be passed through before a dynamic layer is reached. This is in concert with previous points and could be unfrozen at a late training stage when reaching some saturation in the dynamic layers.
Domain specific experts and sophisticated planner
At this moment, there seems no true domain specific experts except with mergekit's hidden expert to route to selected merged model in the MoE and that seems like a little overkill. I think that a 0.5-3b tiny planner could :
be faster than the traditional hidden expert
analyses incoming task for what is asked (summarize, code, answer question, creative writing...)
estimate knowledge, language or subjects needed to answer the question (helps selecting experts) nvidia classifier
select optimal or deactivate some loras and provide some sort of possible path, like if it's about coding, use loras 16-20 and 24-28. (could also be used during training)
Directly instruct later routers or tag tensors ?
draft a plan of the output's content or pass this job to a drafter
Dynamic model depth
The model could duplicate middle layers to process task that is expected to require more work. Something like this. In conjunction with per layer lora, this could be a viable way to expand a model with minimal overhead as the layer is already in memory, just need to pass again, but there seems to be some work still needed in KV cache. The other way could be possible by skipping layers if the task is easy, like 22b->8b->4b.
Make big and quantize small
Large model seems to handle quantization better. Even better, QLora is know to help regain most of what was lost. Deepseek V3 at q2-q3 seems to give excellent results. I would recommend QTIP for it's fast throughput compared to other quantization models.
Train by exception (for lora only)
Suppose a 22b dense model is used to generate the MoE, use that reference model to answer to a dataset's question. Compare the answers and mark all those that have a far deviation from the official answer or are just wrong. Then only train that part of the dataset by skipping every thing that the LLM already has mostly right. Repeat if needed. Also this could be a later benchmark to calculate loss after multiple training iterations, some sort of per dataset performance. Also just augmenting the dense model's answer with the dataset answer without changing too much of it's structure could help in minimizing loss during training ? I think this would be great for CoT training too.
Local LLM's future will be APU.
Large and slow RAM (128Gb?) and a APU/CPU will be the way forward to personal AI. MoE with the proposed enhancement could enable empowerment of local AI without having the need to have a 5090 while providing near SOTA performance with complete privacy. If I ran a business, some data would never leave my building and a 200k Nvidia product would be overkill if something close to O1, Deepseek V3/R1 or a bit more than 405b could be run locally without having it take a full rack.
Feel free to implement those if you wish, a personal citation could be appreciated it it leads to some open source innovation.
Cheers and keep it up you all!
The text was updated successfully, but these errors were encountered:
Hi, I am actively reading on MoE adaptation from Dense model, as a enthusiast hobbyist. So sorry if I may misinterpret some concepts, but as you know, this domain is expanding at a speed way beyond anything we have ever seen and it's quite hard to process it all at once.
I would like to thank Acree-ai for providing the great tool that is Mergekit and their datasets and other tools, models ...
Also, I would like to suggest my own findings in hope it could lead to some future improvement (my wish list) :
Using more flexible loras
A per layer Lora experts as proposed in AlphaLora
This could promote the use of more adapters on some layers and less on others, possibly optimizing the distribution of the Lora weights on more important places. Something like 32x??B
Parameter-Efficient Sparsity is also a good read on Lora optimization of MoE. See Qwen2idae-16x14B-v1.0 and sparsetral-16x7B-v2
Pre-existing lora from other mergekit merges
There are quite a few full fine-tunes or qlora existing already, it could be an idea to import them as already fine-tuned lora. I think that having a mix of 33%-75% originally fine-tuned loras that are frozen during training could accelerate things a lot. Those layers are already trained for some work, having the next layer and the router trained could help in knowledge retention and lower training time. Just have to figure what those frozen layers are best at (see planner).
Mistral 8x22b and Deepseek V3 as MoE+Lora ?
I also foresee Mistral 8x22b getting each layer extracted, merged to a base and then generating Loras (r16-64?) between the base and the 8x extracted layers. This would make it so much smaller and could enable running it on a single 3090 at q4 with lots of room for context. Could possibly run without retraining ? Same for Deepseek V3 ?
Frozen experts
By adapting a per layer lora concept and importing already trained loras, it would be possible to
elect
some experts to be trained asglue
between frozen layers. A rule should state that no more than 1-x frozen layers can be passed through before adynamic
layer is reached. This is in concert with previous points and could be unfrozen at a late training stage when reaching some saturation in the dynamic layers.Domain specific experts and sophisticated planner
At this moment, there seems no true domain specific
experts
except with mergekit's hidden expert to route to selected merged model in the MoE and that seems like a little overkill. I think that a 0.5-3b tiny planner could :tag
tensors ?Dynamic model depth
The model could duplicate middle layers to process task that is expected to require more work. Something like this. In conjunction with per layer lora, this could be a viable way to expand a model with minimal overhead as the layer is already in memory, just need to pass again, but there seems to be some work still needed in KV cache. The other way could be possible by skipping layers if the task is easy, like 22b->8b->4b.
Make big and quantize small
Large model seems to handle quantization better. Even better, QLora is know to help regain most of what was lost. Deepseek V3 at q2-q3 seems to give excellent results. I would recommend QTIP for it's fast throughput compared to other quantization models.
Train by exception (for lora only)
Suppose a 22b dense model is used to generate the MoE, use that reference model to answer to a dataset's question. Compare the answers and mark all those that have a
far deviation
from the official answer or are just wrong. Then only train that part of the dataset by skipping every thing that the LLM already hasmostly
right. Repeat if needed. Also this could be a later benchmark to calculate loss after multiple training iterations, some sort of per dataset performance. Also just augmenting the dense model's answer with the dataset answer without changing too much of it's structure could help in minimizing loss during training ? I think this would be great for CoT training too.Local LLM's future will be APU.
Large and slow RAM (128Gb?) and a APU/CPU will be the way forward to personal AI. MoE with the proposed enhancement could enable empowerment of local AI without having the need to have a 5090 while providing near SOTA performance with complete privacy. If I ran a business, some data would never leave my building and a 200k Nvidia product would be overkill if something close to O1, Deepseek V3/R1 or a bit more than 405b could be run locally without having it take a full rack.
Feel free to implement those if you wish, a personal citation could be appreciated it it leads to some open source innovation.
Cheers and keep it up you all!
The text was updated successfully, but these errors were encountered: