Optimal Multi-node Inference Parallel Settings #15

iteratorlee · 2024-04-28T08:18:35Z

Great work!
I noticed in your blog that the multi-node inference is implemented via TP and PP

While challenging, this can be achieved with two-node inference using a combination of system optimizations such as FP8 weights, split-fuse and continuous batching, tensor parallelism within a node and pipeline parallelism across nodes.

I was wondering have you tried DP + TP + EP as described in the DeepSpeed-MoE paper? And what's the best practice to scale such a giant model to a multi-node environment to achieve the best inference efficiency?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal Multi-node Inference Parallel Settings #15

Optimal Multi-node Inference Parallel Settings #15

iteratorlee commented Apr 28, 2024 •

edited

Loading

Optimal Multi-node Inference Parallel Settings #15

Optimal Multi-node Inference Parallel Settings #15

Comments

iteratorlee commented Apr 28, 2024 • edited Loading

iteratorlee commented Apr 28, 2024 •

edited

Loading