Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate fails to initialize on Cloud TPUs #3304

Open
tengyifei opened this issue Dec 18, 2024 · 6 comments · May be fixed by #3324
Open

Accelerate fails to initialize on Cloud TPUs #3304

tengyifei opened this issue Dec 18, 2024 · 6 comments · May be fixed by #3324

Comments

@tengyifei
Copy link

System Info

We have a CI test in PyTorch/XLA that runs `accelerate test`. When `accelerate` is installed from the main branch, the command fail with `ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc)`.


[2024-12-18, 14:28:43 UTC] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='unowned' AIRFLOW_CTX_DAG_ID='pytorchxla-nightly' AIRFLOW_CTX_TASK_ID='pt-nightly-accelerate-smoke-v2-8-1vm.run_model' AIRFLOW_CTX_EXECUTION_DATE='2024-12-17T14:00:00+00:00' AIRFLOW_CTX_TRY_NUMBER='3' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-12-17T14:00:00+00:00'
[2024-12-18, 14:28:43 UTC] {tpu.py:375} INFO - Connecting to IP addresses of workers: ['10.128.0.130']
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Connected (version 2.0, client OpenSSH_8.9p1)
[2024-12-18, 14:28:43 UTC] {transport.py:1909} INFO - Authentication (publickey) successful!
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + sudo echo 'accelerator_type=${1}
  if [[ ${accelerator_type} =~ ^v5.* ]]
  then
    device_name=vfio/*
  else
    device_name=accel*
  fi
  echo "Terminating all processes utilizing the TPU (if any)."
  sudo lsof -t /dev/${device_name} | xargs -r kill -9
  '
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} WARNING - + bash /tmp/kill_process.sh v2-8
[2024-12-18, 14:28:44 UTC] {logging_mixin.py:150} INFO - Terminating all processes utilizing the TPU (if any).
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PJRT_DEVICE=TPU
+ PJRT_DEVICE=TPU
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + export PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
[2024-12-18, 14:28:45 UTC] {logging_mixin.py:150} WARNING - + PATH=/home/ml-auto-solutions/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
+ accelerate test
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr: Traceback (most recent call last):
stderr:   File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
stderr:     launch_command(args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:     tpu_launcher(args)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
stderr:     xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
stderr:     return pjrt.spawn(fn, nprocs, start_method, args)
[2024-12-18, 14:28:59 UTC] {logging_mixin.py:150} WARNING - stderr:   File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
stderr:     raise ValueError(
stderr: ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} INFO - 
Running:  accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/test.py", line 53, in test_command
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING -     result = execute_subprocess_async(cmd)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/testing.py", line 607, in execute_subprocess_async
    raise RuntimeError(
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - RuntimeError: 'accelerate-launch /home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/ml-auto-solutions/.local/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1174, in main
    launch_command(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1164, in launch_command
    tpu_launcher(args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 884, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
    return pjrt.
[2024-12-18, 14:29:00 UTC] {logging_mixin.py:150} WARNING - spawn(fn, nprocs, start_method, args)
  File "/home/ml-auto-solutions/.local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 209, in spawn
    raise ValueError(
ValueError: Unsupported nprocs (8). Please use the environment variable for the hardware you are using (X_NUM_DEVICES where X is CPU, GPU, TPU, NEURONCORE, etc).
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1826} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/gcs/dags/xlml/utils/tpu.py", line 404, in ssh_tpu
    ssh_group.run(cmds, env=env)
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 116, in run
    return self._do("run", *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.11/lib/python3.11/site-packages/fabric/group.py", line 282, in _do
    raise GroupException(results)
fabric.exceptions.GroupException: {<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}
[2024-12-18, 14:29:01 UTC] {taskinstance.py:1346} INFO - Marking task as FAILED. dag_id=pytorchxla-nightly, task_id=pt-nightly-accelerate-smoke-v2-8-1vm.run_model, execution_date=20241217T140000, start_date=20241218T142842, end_date=20241218T142901
[2024-12-18, 14:29:01 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 6266661 for task pt-nightly-accelerate-smoke-v2-8-1vm.run_model ({<Connection host=10.128.0.130>: <UnexpectedExit: cmd='set -xue\nexport PJRT_DEVICE=TPU\nexport PATH=~/.local/bin:$PATH\n\naccelerate test' exited=1>}; 3375752)
[2024-12-18, 14:29:02 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-12-18, 14:29:03 UTC] {taskinstance.py:2656} INFO - 1 downstream tasks scheduled from follow-on schedule check


### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [ ] My own task or dataset (give details below)

### Reproduction

if [ -d "$HOME/.local/bin" ] ; then
export PATH="$HOME/.local/bin:$PATH"
fi

Dependency of accelerate, unfortunately there is no requirements.txt in accelerate.

pip install pytest
git clone https://github.com/huggingface/accelerate.git
pip install ./accelerate

mkdir -p ~/.cache/huggingface/accelerate/
cat > ~/.cache/huggingface/accelerate/default_config.yaml << 'HF_CONFIG_EOF'
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
HF_CONFIG_EOF

accelerate env

accelerate test


### Expected behavior

Test passes
@tengomucho
Copy link

Hi @tengyifei,
I think the error comes from torch_xla's usage of xmp.spawn, in particular from here.
The error says the nprocs argument should be set to either 1 or number of devices, but the code raises an error when it is not None.
I would check that setting it to the number of devices makes it work, otherwise try setting it to None. I do not know if that is something that has changed on torch xla?

@radna0
Copy link

radna0 commented Dec 21, 2024

@tengomucho Can you update the accelerate launch code for tpus vm to the following. xmp.spawn only accepts nprocs of either 1 or None, None uses all of the devices.

xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=1 if args.num_processes == 1 else None)```

@tengomucho
Copy link

@radna0 we can change our code, but I think the error is on torch_xla side: it should allow nprocs to be set to the number of devices, according to the documentation.

@tengomucho tengomucho linked a pull request Jan 6, 2025 that will close this issue
@radna0
Copy link

radna0 commented Jan 6, 2025

Yeah, I thought this too when I started using xmp.spawn from torch_xla, but this has been carried over to their newer and cleaner api as well, either 1 or None, None just uses all devices.

@tengomucho
Copy link

@radna0 I still think they should at least update their documentation. Anyway, I opened #3324 to try to fix this, feel free to provide feedback.

@radna0
Copy link

radna0 commented Jan 6, 2025

@tengomucho I think this looks good and should work fine on any TPU VMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants