-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The startup commands in examples and benchmark are different and confusing #228
Comments
@kurisusnowdeng can you fix the benchmark readme? |
Commands for ViT Hybrid Parallel updated in this PR |
Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository. |
@FrankLeeeee in benchmarks it's basically consistent. Add |
I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use |
Seems @binmakeswell mainly concerns that users don't know how to use the python commands with slurm. But I think docker may be already the most convenient way for users to run our codes. Also, we already have a tutorial that shows the usage of slurm, and maybe what we need to do is to make that tutorial compatible to more cases, rather than explain how to run slurm everywhere. |
I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples? |
Yes. Maybe we can provide a dockerfile to pack each single example. Then users just build and run the image. |
OK, my opinion is that dockerfile is usually for complex environment setup. If an example requires complicated setup, then a dockerfile will be good. |
🐛 Describe the bug
The startup commands in examples and benchmark are different and confusing. They should be unified, the current form is very easy to confuse newbies.
Not clear:
https://github.com/hpcaitech/ColossalAI-Benchmark/tree/62904e4ff2f3261c5469c773faa3d9307b6f16f4
More detail in hpcaitech/ColossalAI-Benchmark#5
Only command more than 64 GPUs using srun, how to run with limited GPUs and local machine?
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel
Clear command:
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel
Possible error:
Since we give the command '--master_port 29500', it is possible that users meet the error 'RuntimeError: Address already in use', which needs to use another port number.
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet
Environment
No response
The text was updated successfully, but these errors were encountered: