How can I deploy mmpose dist_train.sh on kubernete clusters for distributed training #2427

ThomaswellY · 2023-06-05T07:05:28Z

ThomaswellY
Jun 5, 2023

HI, i have been using mmpose smoothly and successfully, and distributed training with multi-gpus on single node is easy with dist_train.sh.
well now, In my case, i have two nodes on kubernetes cluster. I wonder is there anyway i can apply training script on my kubernets cluster for multi-node-multi-gpu training.
I have successfully applied https://github.com/microsoft/DeepSpeedExamples/ with mpi-operator for distributed training.
Since mmpose use torch.distribute.launch to start distributed training, which is not supported by mpi-operator, and most likely be supported by pytorch-operator, I did several tests on pytorch-operator and still be struggling with BUGS.
So i wonder, is there any method to achieved distributed training on kubernete cluster? the methods from "https://github.com/kubeflow/training-operator/" would be better for me.
Thank you in advance ~
@Ben-Louis

Tau-J · 2023-06-05T07:18:40Z

Tau-J
Jun 5, 2023
Maintainer

Hi @ThomaswellY , I think training with k8s is supported, but sorry we do not have a script for you (because we do not use k8s cluster). Maybe you need to refer to the way ditributed training with pytorch on k8s. If you succeed, would you like to create a PR to contribute the script to us?

2 replies

ThomaswellY Jun 5, 2023
Author

I'd like to. But as you mentioned, if i applied mmpose on k8s clusters, the major works would be on how to modify operator (training-operator or pytorch-operator maybe?), and mmpose project seems needs no mofification. so is it worth of PR?

ThomaswellY Jun 7, 2023
Author

By the way, my friend, i notice that in train.py, args of launcher supports "pytorch""mpi""slurm" . But your project seems only provide the methods of launching distributed train with pytorch or slurm but no mpi.
I don't did i miss anything wrong? If not, could you please tell me how to launch with mpi ? Thanks in advance ~
@Tau-J

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I deploy mmpose dist_train.sh on kubernete clusters for distributed training #2427

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can I deploy mmpose dist_train.sh on kubernete clusters for distributed training #2427

ThomaswellY Jun 5, 2023

Replies: 1 comment · 2 replies

Tau-J Jun 5, 2023 Maintainer

ThomaswellY Jun 5, 2023 Author

ThomaswellY Jun 7, 2023 Author

ThomaswellY
Jun 5, 2023

Replies: 1 comment 2 replies

Tau-J
Jun 5, 2023
Maintainer

ThomaswellY Jun 5, 2023
Author

ThomaswellY Jun 7, 2023
Author