Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE wish list and future implementation exploration #492

Open
johnr14 opened this issue Jan 23, 2025 · 0 comments
Open

MoE wish list and future implementation exploration #492

johnr14 opened this issue Jan 23, 2025 · 0 comments

Comments

@johnr14
Copy link

johnr14 commented Jan 23, 2025

Hi, I am actively reading on MoE adaptation from Dense model, as a enthusiast hobbyist. So sorry if I may misinterpret some concepts, but as you know, this domain is expanding at a speed way beyond anything we have ever seen and it's quite hard to process it all at once.

I would like to thank Acree-ai for providing the great tool that is Mergekit and their datasets and other tools, models ...

Also, I would like to suggest my own findings in hope it could lead to some future improvement (my wish list) :

Using more flexible loras

A per layer Lora experts as proposed in AlphaLora
This could promote the use of more adapters on some layers and less on others, possibly optimizing the distribution of the Lora weights on more important places. Something like 32x??B
Parameter-Efficient Sparsity is also a good read on Lora optimization of MoE. See Qwen2idae-16x14B-v1.0 and sparsetral-16x7B-v2

Pre-existing lora from other mergekit merges

There are quite a few full fine-tunes or qlora existing already, it could be an idea to import them as already fine-tuned lora. I think that having a mix of 33%-75% originally fine-tuned loras that are frozen during training could accelerate things a lot. Those layers are already trained for some work, having the next layer and the router trained could help in knowledge retention and lower training time. Just have to figure what those frozen layers are best at (see planner).

Mistral 8x22b and Deepseek V3 as MoE+Lora ?

I also foresee Mistral 8x22b getting each layer extracted, merged to a base and then generating Loras (r16-64?) between the base and the 8x extracted layers. This would make it so much smaller and could enable running it on a single 3090 at q4 with lots of room for context. Could possibly run without retraining ? Same for Deepseek V3 ?

Frozen experts

By adapting a per layer lora concept and importing already trained loras, it would be possible to elect some experts to be trained as glue between frozen layers. A rule should state that no more than 1-x frozen layers can be passed through before a dynamic layer is reached. This is in concert with previous points and could be unfrozen at a late training stage when reaching some saturation in the dynamic layers.

Domain specific experts and sophisticated planner

At this moment, there seems no true domain specific experts except with mergekit's hidden expert to route to selected merged model in the MoE and that seems like a little overkill. I think that a 0.5-3b tiny planner could :

  • be faster than the traditional hidden expert
  • analyses incoming task for what is asked (summarize, code, answer question, creative writing...)
  • estimate knowledge, language or subjects needed to answer the question (helps selecting experts) nvidia classifier
  • estimate the effort needed for the task (light to heavy) (activate CoT, dynamic model depth) Prompt Task/Complexity Classifier
  • select optimal or deactivate some loras and provide some sort of possible path, like if it's about coding, use loras 16-20 and 24-28. (could also be used during training)
  • Directly instruct later routers or tag tensors ?
  • draft a plan of the output's content or pass this job to a drafter

Dynamic model depth

The model could duplicate middle layers to process task that is expected to require more work. Something like this. In conjunction with per layer lora, this could be a viable way to expand a model with minimal overhead as the layer is already in memory, just need to pass again, but there seems to be some work still needed in KV cache. The other way could be possible by skipping layers if the task is easy, like 22b->8b->4b.

Make big and quantize small

Large model seems to handle quantization better. Even better, QLora is know to help regain most of what was lost. Deepseek V3 at q2-q3 seems to give excellent results. I would recommend QTIP for it's fast throughput compared to other quantization models.

Train by exception (for lora only)

Suppose a 22b dense model is used to generate the MoE, use that reference model to answer to a dataset's question. Compare the answers and mark all those that have a far deviation from the official answer or are just wrong. Then only train that part of the dataset by skipping every thing that the LLM already has mostly right. Repeat if needed. Also this could be a later benchmark to calculate loss after multiple training iterations, some sort of per dataset performance. Also just augmenting the dense model's answer with the dataset answer without changing too much of it's structure could help in minimizing loss during training ? I think this would be great for CoT training too.

Local LLM's future will be APU.

Large and slow RAM (128Gb?) and a APU/CPU will be the way forward to personal AI. MoE with the proposed enhancement could enable empowerment of local AI without having the need to have a 5090 while providing near SOTA performance with complete privacy. If I ran a business, some data would never leave my building and a 200k Nvidia product would be overkill if something close to O1, Deepseek V3/R1 or a bit more than 405b could be run locally without having it take a full rack.

Feel free to implement those if you wish, a personal citation could be appreciated it it leads to some open source innovation.

Cheers and keep it up you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant