LLM model_parallelism_size #45

namansaxena9 · 2025-02-14T00:34:57Z

I am setting model_parallelism_size to 2 but still the LLM is getting allocated to a single GPU core and I am getting CUDA out of memory error. What is wrong with my configuration file ?

lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: true
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 1
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 1 #2
llm_args:
model_type: seq2seq
model_path: t5-base
pretrained: true
minibatch_size: 4
pre_encode_inputs: true
load_in_4bit: false
parallelism:
use_gpu: true #false
model_parallelism_size: 2
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false

ClementRomac · 2025-02-14T13:00:10Z

Hello @namansaxena9,

Could you please share how you launched your code and the stack trace?

namansaxena9 · 2025-02-21T00:12:39Z

Stack Trace:

Error executing job with overrides: ['rl_script_args.path=/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 397, in main
lm_server = Caller(config_args.lamorel_args, custom_updater=PPOUpdater(),
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/caller.py", line 66, in init
Server(
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/server/server.py", line 68, in init
self.run()
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/server/server.py", line 134, in run
current_process_results = self._process_calls(calls_to_process)
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/server/server.py", line 114, in _process_calls
llm_results.append([self._updater.perform_update(**_call)])
File "/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 247, in perform_update
output = self._llm_module([kwargs["scoring_module_key"], 'value'],
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/server/llms/hf_llm.py", line 340, in forward
_outputs = self._LLM_model(**minibatch) # Get scores before softmax
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1891, in forward
decoder_outputs = self.decoder(
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1124, in forward
layer_outputs = layer_module(
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 699, in forward
cross_attention_outputs = self.layer[1](
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 629, in forward
attention_output = self.EncDecAttention(
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 505, in forward
value_states = self.v(current_states)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 10.06 MiB is free. Process 1808195 has 792.00 MiB memory in use. Process 318585 has 3.28 GiB memory in use. Including non-PyTorch memory, this process has 19.55 GiB memory in use. Of the allocated memory 18.73 GiB is allocated by PyTorch, and 572.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

0%| | 0/4 [00:46<?, ?it/s]
Error executing job with overrides: ['rl_script_args.path=/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 498, in main
run_agent(config_args.rl_script_args, algo, id_expe)
File "/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py", line 343, in run_agent
logs = algo.update_parameters()
File "/home/naman/Grounding_LLMs_with_online_RL/experiments/agents/ppo/llm_ppo_agent.py", line 280, in update_parameters
list_dict_return = self.lm_server.update(exps_batch.prompt,
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/caller.py", line 116, in update
result = self.__call_model(InstructionsEnum.UPDATE, True, contexts=contexts, candidates=candidates, **kwargs)
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel/caller.py", line 138, in __call_model
dist.broadcast_object_list(object_list=results, src=self._llm_master_process, group=self._rl_llm_group)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3129, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:33698

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[rank0]:[W220 19:08:30.686106283 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0220 19:08:30.931000 716347 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 716451 closing signal SIGTERM
E0220 19:08:31.546000 716347 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 716450) of binary: /home/naman/anaconda3/envs/dlp/bin/python
Error executing job with overrides: ['rl_script_args.path=/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/naman/Grounding_LLMs_with_online_RL/babyai-text/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
launch_command(accelerate_args)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/naman/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-20_19:08:30
host : gaoqitong-exxact
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 716450)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Command used to launch code:

nohup python -m lamorel_launcher.launch --config-path /home/naman/Grounding_LLMs_with_online_RL/experiments/configs/ --config-name local_gpu_config rl_script_args.path=/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py > output_train.out

ClementRomac · 2025-02-24T17:56:15Z

Hey, thank you.

Your crash seems to happen when doing the forward pass with require_grad=True.
Lamorel should have logged info in your output_train.out, would you mind showing them? It should say the devices used by each process and explain why the model was put on a single GPU instead of two.

ClementRomac self-assigned this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM model_parallelism_size #45

LLM model_parallelism_size #45

namansaxena9 commented Feb 14, 2025 •

edited

Loading

ClementRomac commented Feb 14, 2025

namansaxena9 commented Feb 21, 2025

ClementRomac commented Feb 24, 2025

LLM model_parallelism_size #45

LLM model_parallelism_size #45

Comments

namansaxena9 commented Feb 14, 2025 • edited Loading

ClementRomac commented Feb 14, 2025

namansaxena9 commented Feb 21, 2025

/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-02-20_19:08:30 host : gaoqitong-exxact rank : 0 (local_rank: 0) exitcode : 1 (pid: 716450) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ClementRomac commented Feb 24, 2025

namansaxena9 commented Feb 14, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-02-20_19:08:30
host : gaoqitong-exxact
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 716450)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html