Mixtral batching support #108

vinx13 · 2023-12-12T03:19:32Z

This PR contains various fixes and workarounds to support mixtral model. Follow up tasks will be needed to clean up this.

cc @masahi @sunggg

sunggg

Fantastic job, @vinx13! 🚀🚀 🚀
To bring things up to speed and start fast iteration, I'll merge this PR first.
Please follow-up.

masahi · 2023-12-12T04:09:02Z

mlc_llm/relax_model/commons.py

+        w = topi.reshape(w, (num_experts, red, num_shards, spatial // num_shards))
+        w = topi.transpose(w, (2, 0, 1, 3))
+        func = te.create_prim_func([a, w])
+        return func


I think these functions are not used for FT + multi gpu. Can you confirm?

will double check. doesn't it depend on disco sharding?

For FT quantization + disco, we need to use https://github.com/vinx13/mlc-llm/blob/113bd1873cb563151ed5675730be0e53560c7ab2/mlc_llm/relax_model/commons.py#L124

It hasn't been upstreamed (we should)

I see, q4 multigpu is probably broken. This is needed for q0f16 (we are using ft kernel for moe) though

masahi · 2023-12-14T12:22:45Z

mlc_llm/relax_model/llama.py

+            scores, is_ascend=False, k=self.num_experts_per_tok, index_dtype="int32"
+        )  # (num_tokens, top_k), (num_tokens, top_k)
+        expert_weights = nn.emit(expert_weights / R.sum(expert_weights, axis=-1, keepdims=True))
+        flattened_indices = nn.emit(relax.op.flatten(expert_indices))


This flattening causes the peak VRAM footprint for the intermediate activation to be multiplied by # experts, correct? For large batch size and large # experts, this can be problematic.

Although that allows one matmul to compute all top-k experts results, I wonder if it can be beneficial to compute it sequentially (loop over top-k) to reduce VRAM usage and memory traffic. And how much perf drops if we do that.

vinx13 added 10 commits December 11, 2023 20:26

moe arch

5789469

param mapping

2643016

fix

c072277

adopt hf weights

8e25c56

fix

f279349

speedup weight preprocess

a39ca01

fix

1f267ea

batch

fca1b39

remove deadcode

4618de1

cleanup

113bd18

sunggg approved these changes Dec 12, 2023

View reviewed changes

sunggg merged commit 87eef11 into octoml:batch-serving Dec 12, 2023
1 check passed

masahi reviewed Dec 12, 2023

View reviewed changes

masahi reviewed Dec 14, 2023

View reviewed changes

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

Project page linking to blog (octoml#108)

0c1f054

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral batching support #108

Mixtral batching support #108

vinx13 commented Dec 12, 2023

sunggg left a comment

masahi Dec 12, 2023

vinx13 Dec 12, 2023

masahi Dec 12, 2023

vinx13 Dec 12, 2023

masahi Dec 14, 2023 •

edited

Loading

Mixtral batching support #108

Mixtral batching support #108

Conversation

vinx13 commented Dec 12, 2023

sunggg left a comment

Choose a reason for hiding this comment

masahi Dec 12, 2023

Choose a reason for hiding this comment

vinx13 Dec 12, 2023

Choose a reason for hiding this comment

masahi Dec 12, 2023

Choose a reason for hiding this comment

vinx13 Dec 12, 2023

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

masahi Dec 14, 2023 •

edited

Loading