-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM model_parallelism_size #45
Comments
Hello @namansaxena9, Could you please share how you launched your code and the stack trace? |
Stack Trace: Error executing job with overrides: ['rl_script_args.path=/home/naman/Grounding_LLMs_with_online_RL/experiments/train_language_agent.py'] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. 0%| | 0/4 [00:46<?, ?it/s] Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
|
Hey, thank you. Your crash seems to happen when doing the forward pass with |
I am setting model_parallelism_size to 2 but still the LLM is getting allocated to a single GPU core and I am getting CUDA out of memory error. What is wrong with my configuration file ?
lamorel_args:
log_level: info
allow_subgraph_use_whith_gradient: true
distributed_setup_args:
n_rl_processes: 1
n_llm_processes: 1
accelerate_args:
config_file: accelerate/default_config.yaml
machine_rank: 0
num_machines: 1 #2
llm_args:
model_type: seq2seq
model_path: t5-base
pretrained: true
minibatch_size: 4
pre_encode_inputs: true
load_in_4bit: false
parallelism:
use_gpu: true #false
model_parallelism_size: 2
synchronize_gpus_after_scoring: false
empty_cuda_cache_after_scoring: false
The text was updated successfully, but these errors were encountered: