Update moe branch for torchrun and fix error while loading trainer.ds_module #7

pnunna93 · 2023-11-01T18:40:35Z

What does this PR do?

Fixes error while loading trainer.ds_module
Changes torch.distributed.launch to torchrun

Fix the issue "[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" On DeepSpeed library 0.15.0, the commit 7260890452eb89185f9ab1e09550938f78ea91db changed the return output tensor exp_counts from 'cpu' to device when calling deepspeed.moe.layer.MoE() This change reduces cpu host overhead when using moe. The device type of self.expert_counts tensor in Fairseq transformer_moe_layer module needs to be changed from cuda from cpu, for ds library >= 0.15.0 Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

transformer_moe_layer: Fix Runtime error

pnunna93 and others added 4 commits October 27, 2023 16:11

Remove module for trainer.model

d1d81fa

Changes for torchrun

4b34146

Merge pull request #1 from jagadish-amd/fix-SWDEV-485020

9c68be1

transformer_moe_layer: Fix Runtime error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

pnunna93 commented Nov 1, 2023

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Are you sure you want to change the base?

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Conversation

pnunna93 commented Nov 1, 2023

What does this PR do?