Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The startup commands in examples and benchmark are different and confusing #228

Closed
binmakeswell opened this issue Feb 14, 2022 · 9 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@binmakeswell
Copy link
Member

binmakeswell commented Feb 14, 2022

🐛 Describe the bug

The startup commands in examples and benchmark are different and confusing. They should be unified, the current form is very easy to confuse newbies.

Not clear:
https://github.com/hpcaitech/ColossalAI-Benchmark/tree/62904e4ff2f3261c5469c773faa3d9307b6f16f4
More detail in hpcaitech/ColossalAI-Benchmark#5

Only command more than 64 GPUs using srun, how to run with limited GPUs and local machine?
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel

Clear command:
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

Possible error:
Since we give the command '--master_port 29500', it is possible that users meet the error 'RuntimeError: Address already in use', which needs to use another port number.
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet

Environment

No response

@binmakeswell binmakeswell added the enhancement New feature or request label Feb 14, 2022
@FrankLeeeee FrankLeeeee added the documentation Improvements or additions to documentation label Feb 14, 2022
@FrankLeeeee
Copy link
Contributor

@kurisusnowdeng can you fix the benchmark readme?

@FrankLeeeee
Copy link
Contributor

Commands for ViT Hybrid Parallel updated in this PR

@FrankLeeeee
Copy link
Contributor

Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.

@kurisusnowdeng
Copy link
Member

Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.

@FrankLeeeee in benchmarks it's basically consistent. Add --from_torch if using torchrun. Otherwise, Colossal-AI launches in a standard way. However, in my opinion, we'd better use docker as the first choice to run benchmarks and examples, so that it can be easier to make the environment consistent as well. What do you think?

@FrankLeeeee
Copy link
Contributor

FrankLeeeee commented Mar 1, 2022

I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use srun or mpirun to start the docker and this may conflict with the entry-point command.

@kurisusnowdeng
Copy link
Member

kurisusnowdeng commented Mar 1, 2022

I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use srun or mpirun to start the docker and this may conflict with the entry-point command.

Seems @binmakeswell mainly concerns that users don't know how to use the python commands with slurm. But I think docker may be already the most convenient way for users to run our codes. Also, we already have a tutorial that shows the usage of slurm, and maybe what we need to do is to make that tutorial compatible to more cases, rather than explain how to run slurm everywhere.

@FrankLeeeee
Copy link
Contributor

I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?

@kurisusnowdeng
Copy link
Member

I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?

Yes. Maybe we can provide a dockerfile to pack each single example. Then users just build and run the image.

@FrankLeeeee
Copy link
Contributor

OK, my opinion is that dockerfile is usually for complex environment setup. If an example requires complicated setup, then a dockerfile will be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants