We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我使用旧版本的代码做embedding模型的预训练,在两张A100上进行测试,通过nvitop检测到cpu利用率100%了,但是GPU的利用率只有小部分时间是100%,其余时间都是0,我尝试调整dataloader_num_workers,但是并没有效果。我的脚本如下:
CUDA_VISIBLE_DEVICES=0,1 torchrun --master_port 20036 --nproc_per_node 2 \ -m FlagEmbedding.baai_general_embedding.retromae_pretrain.run \ --output_dir /root/data1/bge-large-zh-v1.5-test \ --model_name_or_path /root/data1/huggingface/BAAI/bge-large-zh-v1.5 \ --train_data /root/data1/BAAI_DATA/PreTrain-Data \ --learning_rate 2e-5 \ --num_train_epochs 2 \ --per_device_train_batch_size 16 \ --dataloader_drop_last True \ --max_seq_length 512 \ --logging_steps 10 \ --dataloader_num_workers 12
对于这样的问题有什么建议吗?感谢
The text was updated successfully, but these errors were encountered:
CPU利用率满了,那说明CPU已经充分利用了,如果GPU利用率小的话应该还是数据处理速度是瓶颈
Sorry, something went wrong.
No branches or pull requests
我使用旧版本的代码做embedding模型的预训练,在两张A100上进行测试,通过nvitop检测到cpu利用率100%了,但是GPU的利用率只有小部分时间是100%,其余时间都是0,我尝试调整dataloader_num_workers,但是并没有效果。我的脚本如下:
对于这样的问题有什么建议吗?感谢
The text was updated successfully, but these errors were encountered: