-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate fails to initialize on Cloud TPUs #3304
Comments
Hi @tengyifei, |
@tengomucho Can you update the accelerate launch code for tpus vm to the following.
|
@radna0 we can change our code, but I think the error is on torch_xla side: it should allow nprocs to be set to the number of devices, according to the documentation. |
Yeah, I thought this too when I started using xmp.spawn from torch_xla, but this has been carried over to their newer and cleaner api as well, either 1 or None, None just uses all devices. |
@tengomucho I think this looks good and should work fine on any TPU VMs. |
System Info
if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.
pip install pytest
git clone https://github.com/huggingface/accelerate.git
pip install ./accelerate
mkdir -p ~/.cache/huggingface/accelerate/
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'HF_CONFIG_EOF'
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
HF_CONFIG_EOF
accelerate env
accelerate test
The text was updated successfully, but these errors were encountered: