You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am encountering an out-of-memory (OOM) error while training a model using DeepSpeed's ZeRO Stage 3 with CPU offloading enabled. Furthermore, I am using 2 GPUs but I am only getting OOM.
This is the script:
`import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import deepspeed
from accelerate.utils.deepspeed import HfDeepSpeedConfig
import gc
import os
import torch.distributed as dist
if offload_cpu:
ds_config["zero_optimization"]["offload_param"] = dict(device="cpu", pin_memory=True)
ds_config["zero_optimization"]["offload_optimizer"] = dict(device="cpu", pin_memory=True) # Disabled as
# it's only needed in training
Packages and versions used
torch: 2.5.1
transformers: 4.33.0.dev0
deepspeed: 0.16.1
accelerate: 1.1.1
Running the script
deepspeed --num_gpus 2 llama_inference.py
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmp9xahzd5h/test.c -o /tmp/tmp9xahzd5h/test.o
x86_64-linux-gnu-gcc /tmp/tmp9xahzd5h/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmp9xahzd5h/a.out
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmpr9qj7tb_/test.c -o /tmp/tmpr9qj7tb_/test.o
x86_64-linux-gnu-gcc /tmp/tmpr9qj7tb_/test.o -laio -o /tmp/tmpr9qj7tb_/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu124
deepspeed install path ........... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.1, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.4
shared memory (/dev/shm) size .... 62.82 GB
System info:
OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish)
GPU count and types: 2 × NVIDIA GeForce RTX 2070
Python version: 3.10.12
Output
[2025-02-10 07:33:16,718] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:18,745] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-02-10 07:33:18,745] [INFO] [runner.py:607:main] cmd = /home/lorenaromerom/TFM/env/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llama_inference.py
[2025-02-10 07:33:20,277] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:22,275] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-02-10 07:33:22,275] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-02-10 07:33:22,275] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-02-10 07:33:22,275] [INFO] [launch.py:164:main] dist_world_size=2
[2025-02-10 07:33:22,275] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-02-10 07:33:22,276] [INFO] [launch.py:256:main] process 2116656 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=0']
[2025-02-10 07:33:22,277] [INFO] [launch.py:256:main] process 2116657 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1']
[2025-02-10 07:33:24,137] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-10 07:33:24,197] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:25,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,075] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,076] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Rank: 1, Local Rank: 1, World Size: 2 Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Rank: 0, Local Rank: 0, World Size: 2 Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
[2025-02-10 07:33:30,491] [INFO] [utils.py:781:see_memory_usage] pre-from-pretrained
[2025-02-10 07:33:30,492] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:33:30,492] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 5.58 GB, percent = 4.4%
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.07it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.01it/s]
[2025-02-10 07:34:25,339] [INFO] [utils.py:781:see_memory_usage] post-from-pretrained
[2025-02-10 07:34:25,340] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:34:25,340] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.0 GB, percent = 24.7%
[2025-02-10 07:34:25,340] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.1, git-hash=unknown, git-branch=unknown
[2025-02-10 07:34:25,340] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[2025-02-10 07:34:25,384] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank0]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank0]: self.module.to(self.device)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank0]: return super().to(*args, **kwargs)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank0]: return self._apply(convert)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank0]: param_applied = fn(param)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank0]: return t.to(
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 7.61 GiB of which 19.38 MiB is free. Process 2116657 has 114.00 MiB memory in use. Including non-PyTorch memory, this process has 7.47 GiB memory in use. Of the allocated memory 7.36 GiB is allocated by PyTorch, and 1.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank1]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank1]: engine = DeepSpeedEngine(args=args,
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank1]: self._configure_distributed_model(model)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank1]: self.module.to(self.device)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank1]: return super().to(*args, **kwargs)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank1]: return self._apply(convert)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: [Previous line repeated 2 more times]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank1]: param_applied = fn(param)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank1]: return t.to(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 1 has a total capacity of 7.61 GiB of which 2.12 MiB is free. Including non-PyTorch memory, this process has 7.60 GiB memory in use. Of the allocated memory 7.51 GiB is allocated by PyTorch, and 1.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W210 07:34:27.428226840 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116656
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116657
[2025-02-10 07:34:28,459] [ERROR] [launch.py:325:sigkill_handler] ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1'] exits with return code = 1
The text was updated successfully, but these errors were encountered:
Hello,
I am encountering an out-of-memory (OOM) error while training a model using DeepSpeed's ZeRO Stage 3 with CPU offloading enabled. Furthermore, I am using 2 GPUs but I am only getting OOM.
This is the script:
`import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import deepspeed
from accelerate.utils.deepspeed import HfDeepSpeedConfig
import gc
import os
import torch.distributed as dist
Load the model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
generate_batch_size = 100
no_batching = False
zero_optimization = True
offload_cpu = True
torch.cuda.empty_cache()
gc.collect()
local_rank = int(os.environ.get('LOCAL_RANK', "0"))
world_size = int(os.environ.get('WORLD_SIZE', "1"))
#torch.distributed.init_process_group(backend='nccl', init_method='env://')
deepspeed.init_distributed(dist_backend='nccl')
rank = dist.get_rank()
print(f"Rank: {rank}, Local Rank: {local_rank}, World Size: {world_size}")
print(f"Loading model {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
dtype = torch.float16 # con codellama torch.bfloat16
config = AutoConfig.from_pretrained(model_name)
if hasattr(config, 'hidden_size'):
model_hidden_size = config.hidden_size
else:
model_hidden_size = 2048
ds_config = {
"fp16": {
"enabled": dtype == torch.float16,
},
"steps_per_print": 2000,
"train_batch_size": 1 * world_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False,
}
if zero_optimization:
ds_config["zero_optimization"] = {
"stage" : 3,
"overlap_comm": True,
"contiguous_gradients" : True,
"sub_group_size" : 1e9,
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"gather_16bit_weights_on_model_save" : True,
"reduce_bucket_size" : model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size" : int(0.9 * model_hidden_size * model_hidden_size),
"stage3_param_persistence_threshold" : 100 * model_hidden_size,
}
if offload_cpu:
ds_config["zero_optimization"]["offload_param"] = dict(device="cpu", pin_memory=True)
ds_config["zero_optimization"]["offload_optimizer"] = dict(device="cpu", pin_memory=True) # Disabled as
# it's only needed in training
dschf = HfDeepSpeedConfig(ds_config)
torch.cuda.empty_cache()
gc.collect()
deepspeed.runtime.utils.see_memory_usage("pre-from-pretrained", force=True)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=dtype, config=config)
deepspeed.runtime.utils.see_memory_usage("post-from-pretrained", force=True)
ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
dist.destroy_process_group()
`
Packages and versions used
torch: 2.5.1
transformers: 4.33.0.dev0
deepspeed: 0.16.1
accelerate: 1.1.1
Running the script
deepspeed --num_gpus 2 llama_inference.py
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmp9xahzd5h/test.c -o /tmp/tmp9xahzd5h/test.o
x86_64-linux-gnu-gcc /tmp/tmp9xahzd5h/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmp9xahzd5h/a.out
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmpr9qj7tb_/test.c -o /tmp/tmpr9qj7tb_/test.o
x86_64-linux-gnu-gcc /tmp/tmpr9qj7tb_/test.o -laio -o /tmp/tmpr9qj7tb_/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu124
deepspeed install path ........... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.1, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.4
shared memory (/dev/shm) size .... 62.82 GB
System info:
Output
[2025-02-10 07:33:16,718] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:18,745] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-02-10 07:33:18,745] [INFO] [runner.py:607:main] cmd = /home/lorenaromerom/TFM/env/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llama_inference.py
[2025-02-10 07:33:20,277] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:22,275] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-02-10 07:33:22,275] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-02-10 07:33:22,275] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-02-10 07:33:22,275] [INFO] [launch.py:164:main] dist_world_size=2
[2025-02-10 07:33:22,275] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-02-10 07:33:22,276] [INFO] [launch.py:256:main] process 2116656 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=0']
[2025-02-10 07:33:22,277] [INFO] [launch.py:256:main] process 2116657 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1']
[2025-02-10 07:33:24,137] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-10 07:33:24,197] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:25,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,075] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,076] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Rank: 1, Local Rank: 1, World Size: 2
Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
Rank: 0, Local Rank: 0, World Size: 2
Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
[2025-02-10 07:33:30,491] [INFO] [utils.py:781:see_memory_usage] pre-from-pretrained
[2025-02-10 07:33:30,492] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:33:30,492] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 5.58 GB, percent = 4.4%
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.07it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.01it/s]
[2025-02-10 07:34:25,339] [INFO] [utils.py:781:see_memory_usage] post-from-pretrained
[2025-02-10 07:34:25,340] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:34:25,340] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.0 GB, percent = 24.7%
[2025-02-10 07:34:25,340] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.1, git-hash=unknown, git-branch=unknown
[2025-02-10 07:34:25,340] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[2025-02-10 07:34:25,384] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank0]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank0]: self.module.to(self.device)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank0]: return super().to(*args, **kwargs)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank0]: return self._apply(convert)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank0]: param_applied = fn(param)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank0]: return t.to(
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 7.61 GiB of which 19.38 MiB is free. Process 2116657 has 114.00 MiB memory in use. Including non-PyTorch memory, this process has 7.47 GiB memory in use. Of the allocated memory 7.36 GiB is allocated by PyTorch, and 1.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank1]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank1]: engine = DeepSpeedEngine(args=args,
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank1]: self._configure_distributed_model(model)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank1]: self.module.to(self.device)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank1]: return super().to(*args, **kwargs)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank1]: return self._apply(convert)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: [Previous line repeated 2 more times]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank1]: param_applied = fn(param)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank1]: return t.to(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 1 has a total capacity of 7.61 GiB of which 2.12 MiB is free. Including non-PyTorch memory, this process has 7.60 GiB memory in use. Of the allocated memory 7.51 GiB is allocated by PyTorch, and 1.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W210 07:34:27.428226840 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116656
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116657
[2025-02-10 07:34:28,459] [ERROR] [launch.py:325:sigkill_handler] ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1'] exits with return code = 1
The text was updated successfully, but these errors were encountered: