The startup commands in examples and benchmark are different and confusing #228

binmakeswell · 2022-02-14T10:10:33Z

🐛 Describe the bug

The startup commands in examples and benchmark are different and confusing. They should be unified, the current form is very easy to confuse newbies.

Not clear:
https://github.com/hpcaitech/ColossalAI-Benchmark/tree/62904e4ff2f3261c5469c773faa3d9307b6f16f4
More detail in hpcaitech/ColossalAI-Benchmark#5

Only command more than 64 GPUs using srun, how to run with limited GPUs and local machine?
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/hybrid_parallel

Clear command:
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

Possible error:
Since we give the command '--master_port 29500', it is possible that users meet the error 'RuntimeError: Address already in use', which needs to use another port number.
https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet

Environment

No response

FrankLeeeee · 2022-02-15T03:45:50Z

@kurisusnowdeng can you fix the benchmark readme?

FrankLeeeee · 2022-02-18T07:58:03Z

Commands for ViT Hybrid Parallel updated in this PR

FrankLeeeee · 2022-03-01T01:34:39Z

Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.

kurisusnowdeng · 2022-03-01T02:54:41Z

Hi, @kurisusnowdeng , can you fix this issue in benchmark repository? I have fixed the one in the example repository.

@FrankLeeeee in benchmarks it's basically consistent. Add --from_torch if using torchrun. Otherwise, Colossal-AI launches in a standard way. However, in my opinion, we'd better use docker as the first choice to run benchmarks and examples, so that it can be easier to make the environment consistent as well. What do you think?

FrankLeeeee · 2022-03-01T03:03:33Z

I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use srun or mpirun to start the docker and this may conflict with the entry-point command.

kurisusnowdeng · 2022-03-01T03:19:06Z

I think what @binmakeswell means is that we should provide sample commands for different launchers for clarity. I am ok with docker if this is to provide the user with an environment which has pre-installed dependencies. The problem with docker is that it can only run on single node if we provide pre-defined entry-point command. In multi-node environment, we still need to use srun or mpirun to start the docker and this may conflict with the entry-point command.

Seems @binmakeswell mainly concerns that users don't know how to use the python commands with slurm. But I think docker may be already the most convenient way for users to run our codes. Also, we already have a tutorial that shows the usage of slurm, and maybe what we need to do is to make that tutorial compatible to more cases, rather than explain how to run slurm everywhere.

FrankLeeeee · 2022-03-01T03:24:05Z

I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?

kurisusnowdeng · 2022-03-01T03:41:29Z

I think putting a link to launch colossalai will do. We have provided a docker file in the Colossal-AI repository, do you mean to change the docker entrypoint command for examples?

Yes. Maybe we can provide a dockerfile to pack each single example. Then users just build and run the image.

FrankLeeeee · 2022-03-01T03:44:39Z

OK, my opinion is that dockerfile is usually for complex environment setup. If an example requires complicated setup, then a dockerfile will be good.

binmakeswell added the enhancement New feature or request label Feb 14, 2022

FrankLeeeee added the documentation Improvements or additions to documentation label Feb 14, 2022

FrankLeeeee closed this as completed Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The startup commands in examples and benchmark are different and confusing #228

The startup commands in examples and benchmark are different and confusing #228

binmakeswell commented Feb 14, 2022 •

edited

Loading

FrankLeeeee commented Feb 15, 2022

FrankLeeeee commented Feb 18, 2022

FrankLeeeee commented Mar 1, 2022

kurisusnowdeng commented Mar 1, 2022

FrankLeeeee commented Mar 1, 2022 •

edited

Loading

kurisusnowdeng commented Mar 1, 2022 •

edited

Loading

FrankLeeeee commented Mar 1, 2022

kurisusnowdeng commented Mar 1, 2022

FrankLeeeee commented Mar 1, 2022

The startup commands in examples and benchmark are different and confusing #228

The startup commands in examples and benchmark are different and confusing #228

Comments

binmakeswell commented Feb 14, 2022 • edited Loading

🐛 Describe the bug

Environment

FrankLeeeee commented Feb 15, 2022

FrankLeeeee commented Feb 18, 2022

FrankLeeeee commented Mar 1, 2022

kurisusnowdeng commented Mar 1, 2022

FrankLeeeee commented Mar 1, 2022 • edited Loading

kurisusnowdeng commented Mar 1, 2022 • edited Loading

FrankLeeeee commented Mar 1, 2022

kurisusnowdeng commented Mar 1, 2022

FrankLeeeee commented Mar 1, 2022

binmakeswell commented Feb 14, 2022 •

edited

Loading

FrankLeeeee commented Mar 1, 2022 •

edited

Loading

kurisusnowdeng commented Mar 1, 2022 •

edited

Loading