Accelerate fails to initialize on Cloud TPUs #3304

tengyifei · 2024-12-18T17:41:55Z

System Info

We have a CI test in PyTorch/XLA that runs `accelerate test`. When `accelerate` is installed from the main branch, the command fail with `ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc)`.


[2024-12-18, 14:28:43 UTC] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='unowned' AIRFLOW_CTX_DAG_ID='pytorchxla-nightly' AIRFLOW_CTX_TASK_ID='pt-nightly-accelerate-smoke-v2-8-1vm.run_model' AIRFLOW_CTX_EXECUTION_DATE='2024-12-17T14:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-12-17T14:00:00+00:00'
[2024-12-18, 14:28:43 UTC] {tpu.py:375} INFO - Connecting to IP addresses of workers: ['10.128.0.130']
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Connected (version 2.0, client OpenSSH_8.9p1)
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Authentication (publickey) successful!
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + sudo echo 'accelerator_type=${1}
  if [[ ${accelerator_type} =~ ^v5.* ]]
  then
    device_name=vfio/*
  else
    device_name=accel*
  fi
  echo "Terminating all processes utilizing the TPU (if any)."
  sudo lsof -t /dev/${device_name} | xargs -r kill -9
  '
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + bash /tmp/kill_process.sh v2-8
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} INFO - Terminating all processes utilizing the TPU (if any).
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PJRT_DEVICE=TPU
+ PJRT_DEVICE=TPU
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
+ accelerate test
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: Traceback (most recent call last):
stderr:   File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
stderr:     launch_command(args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:     tpu_launcher(args)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
stderr:     xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
stderr:     return pjrt.spawn(fn, nprocs, start_method, args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
stderr:     raise ValueError(
stderr: ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} INFO - 
Running:  accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/test.py", line 53, in test_command
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING -     result = execute_subprocess_async(cmd)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/testing.py", line 607, in execute_subprocess_async
    raise RuntimeError(
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - RuntimeError: 'accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
    launch_command(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
    tpu_launcher(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
    return pjrt.
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - spawn(fn, nprocs, start_method, args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
    raise ValueError(
ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1826} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/gcs/dags/xlml/utils/tpu.py", line 404, in ssh_tpu
    ssh_group.run(cmds, env=env)
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 116, in run
    return self._do("run", *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 282, in _do
    raise GroupException(results)
fabric.exceptions.GroupException: {<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1346} INFO - Marking task as FAILED. dag_id=pytorchxla-nightly, task_id=pt-nightly-accelerate-smoke-v2-8-1vm.run_model, execution_date=20241217T140000, start_date=20241218T142842, end_date=20241218T142901
[2024-12-18, 14:29:01 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 6266661 for task pt-nightly-accelerate-smoke-v2-8-1vm.run_model ({<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}; 3375752)
[2024-12-18, 14:29:02 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-12-18, 14:29:03 UTC] {taskinstance.py:2656} INFO - 1 downstream tasks scheduled from follow-on schedule check



### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)

### Reproduction

if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi

Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.

pip install pytest
git clone https://github.com/huggingface/accelerate.git
pip install ./accelerate

mkdir -p ~/.cache/huggingface/accelerate/
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'HF_CONFIG_EOF'
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
HF_CONFIG_EOF

accelerate env

accelerate test


### Expected behavior

Test passes

The text was updated successfully, but these errors were encountered:

tengomucho · 2024-12-19T08:01:30Z

Hi @tengyifei,
I think the error comes from torch_xla's usage of xmp.spawn, in particular from here.
The error says the nprocs argument should be set to either 1 or number of devices, but the code raises an error when it is not None.
I would check that setting it to the number of devices makes it work, otherwise try setting it to None. I do not know if that is something that has changed on torch xla?

radna0 · 2024-12-21T18:26:43Z

@tengomucho Can you update the accelerate launch code for tpus vm to the following. xmp.spawn only accepts nprocs of either 1 or None, None uses all of the devices.

xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=1 if args.num_processes == 1 else None)```

tengomucho · 2025-01-06T14:24:36Z

@radna0 we can change our code, but I think the error is on torch_xla side: it should allow nprocs to be set to the number of devices, according to the documentation.

radna0 · 2025-01-06T15:03:37Z

Yeah, I thought this too when I started using xmp.spawn from torch_xla, but this has been carried over to their newer and cleaner api as well, either 1 or None, None just uses all devices.

tengomucho · 2025-01-06T15:08:00Z

@radna0 I still think they should at least update their documentation. Anyway, I opened #3324 to try to fix this, feel free to provide feedback.

radna0 · 2025-01-06T15:15:34Z

@tengomucho I think this looks good and should work fine on any TPU VMs.

tengomucho linked a pull request Jan 6, 2025 that will close this issue

feat(tpu): remove nprocs from xla.spawn #3324

Open

tengyifei mentioned this issue Jan 7, 2025

[RFC] Add HuggingFace tests with pinned dependencies to CI pytorch/xla#8542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate fails to initialize on Cloud TPUs #3304

Accelerate fails to initialize on Cloud TPUs #3304

tengyifei commented Dec 18, 2024

tengomucho commented Dec 19, 2024

radna0 commented Dec 21, 2024 •

edited

Loading

tengomucho commented Jan 6, 2025

radna0 commented Jan 6, 2025

tengomucho commented Jan 6, 2025

radna0 commented Jan 6, 2025

Accelerate fails to initialize on Cloud TPUs #3304

Accelerate fails to initialize on Cloud TPUs #3304

Comments

tengyifei commented Dec 18, 2024

System Info

Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.

tengomucho commented Dec 19, 2024

radna0 commented Dec 21, 2024 • edited Loading

tengomucho commented Jan 6, 2025

radna0 commented Jan 6, 2025

tengomucho commented Jan 6, 2025

radna0 commented Jan 6, 2025

radna0 commented Dec 21, 2024 •

edited

Loading