You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great work!
I noticed in your blog that the multi-node inference is implemented via TP and PP
While challenging, this can be achieved with two-node inference using a combination of system optimizations such as FP8 weights, split-fuse and continuous batching, tensor parallelism within a node and pipeline parallelism across nodes.
I was wondering have you tried DP + TP + EP as described in the DeepSpeed-MoE paper? And what's the best practice to scale such a giant model to a multi-node environment to achieve the best inference efficiency?
The text was updated successfully, but these errors were encountered:
Great work!
I noticed in your blog that the multi-node inference is implemented via TP and PP
I was wondering have you tried DP + TP + EP as described in the DeepSpeed-MoE paper? And what's the best practice to scale such a giant model to a multi-node environment to achieve the best inference efficiency?
The text was updated successfully, but these errors were encountered: