How can I deploy mmpose dist_train.sh on kubernete clusters for distributed training #2427
Unanswered
ThomaswellY
asked this question in
General
Replies: 1 comment 2 replies
-
Hi @ThomaswellY , I think training with k8s is supported, but sorry we do not have a script for you (because we do not use k8s cluster). Maybe you need to refer to the way ditributed training with pytorch on k8s. If you succeed, would you like to create a PR to contribute the script to us? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
HI, i have been using mmpose smoothly and successfully, and distributed training with multi-gpus on single node is easy with dist_train.sh.
well now, In my case, i have two nodes on kubernetes cluster. I wonder is there anyway i can apply training script on my kubernets cluster for multi-node-multi-gpu training.
I have successfully applied https://github.com/microsoft/DeepSpeedExamples/ with mpi-operator for distributed training.
Since mmpose use torch.distribute.launch to start distributed training, which is not supported by mpi-operator, and most likely be supported by pytorch-operator, I did several tests on pytorch-operator and still be struggling with BUGS.
So i wonder, is there any method to achieved distributed training on kubernete cluster? the methods from "https://github.com/kubeflow/training-operator/" would be better for me.
Thank you in advance ~
@Ben-Louis
Beta Was this translation helpful? Give feedback.
All reactions