Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU利用率不稳定问题 #1325

Open
xiayouran opened this issue Jan 10, 2025 · 1 comment
Open

GPU利用率不稳定问题 #1325

xiayouran opened this issue Jan 10, 2025 · 1 comment

Comments

@xiayouran
Copy link

xiayouran commented Jan 10, 2025

我使用旧版本的代码做embedding模型的预训练,在两张A100上进行测试,通过nvitop检测到cpu利用率100%了,但是GPU的利用率只有小部分时间是100%,其余时间都是0,我尝试调整dataloader_num_workers,但是并没有效果。我的脚本如下:

CUDA_VISIBLE_DEVICES=0,1 torchrun --master_port 20036 --nproc_per_node 2 \
-m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \
--output_dir /root/data1/bge-large-zh-v1.5-test \
--model_name_or_path /root/data1/huggingface/BAAI/bge-large-zh-v1.5 \
--train_data /root/data1/BAAI_DATA/PreTrain-Data \
--learning_rate 2e-5 \
--num_train_epochs 2 \
--per_device_train_batch_size 16 \
--dataloader_drop_last True \
--max_seq_length 512 \
--logging_steps 10 \
--dataloader_num_workers 12

对于这样的问题有什么建议吗?感谢

@545999961
Copy link
Collaborator

CPU利用率满了,那说明CPU已经充分利用了,如果GPU利用率小的话应该还是数据处理速度是瓶颈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants