Out-of-Memory (OOM) Error with CPU Offload Using ZeRO Stage 3 #7021

lorenaromerom02 · 2025-02-10T07:31:59Z

Hello,
I am encountering an out-of-memory (OOM) error while training a model using DeepSpeed's ZeRO Stage 3 with CPU offloading enabled. Furthermore, I am using 2 GPUs but I am only getting OOM.

This is the script:

`import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import deepspeed
from accelerate.utils.deepspeed import HfDeepSpeedConfig
import gc
import os
import torch.distributed as dist

Load the model and tokenizer

model_name = "meta-llama/Llama-2-7b-hf"
generate_batch_size = 100
no_batching = False
zero_optimization = True
offload_cpu = True

torch.cuda.empty_cache()
gc.collect()

local_rank = int(os.environ.get('LOCAL_RANK', "0"))
world_size = int(os.environ.get('WORLD_SIZE', "1"))

#torch.distributed.init_process_group(backend='nccl', init_method='env://')
deepspeed.init_distributed(dist_backend='nccl')

rank = dist.get_rank()

print(f"Rank: {rank}, Local Rank: {local_rank}, World Size: {world_size}")
print(f"Loading model {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
dtype = torch.float16 # con codellama torch.bfloat16

config = AutoConfig.from_pretrained(model_name)

if hasattr(config, 'hidden_size'):
model_hidden_size = config.hidden_size
else:
model_hidden_size = 2048

ds_config = {
"fp16": {
"enabled": dtype == torch.float16,
},
"steps_per_print": 2000,
"train_batch_size": 1 * world_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False,
}

if zero_optimization:
ds_config["zero_optimization"] = {
"stage" : 3,
"overlap_comm": True,
"contiguous_gradients" : True,
"sub_group_size" : 1e9,
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"gather_16bit_weights_on_model_save" : True,
"reduce_bucket_size" : model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size" : int(0.9 * model_hidden_size * model_hidden_size),
"stage3_param_persistence_threshold" : 100 * model_hidden_size,
}

if offload_cpu:
ds_config["zero_optimization"]["offload_param"] = dict(device="cpu", pin_memory=True)
ds_config["zero_optimization"]["offload_optimizer"] = dict(device="cpu", pin_memory=True) # Disabled as
# it's only needed in training

dschf = HfDeepSpeedConfig(ds_config)

torch.cuda.empty_cache()
gc.collect()
deepspeed.runtime.utils.see_memory_usage("pre-from-pretrained", force=True)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=dtype, config=config)

deepspeed.runtime.utils.see_memory_usage("post-from-pretrained", force=True)

ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]

dist.destroy_process_group()
`

Packages and versions used
torch: 2.5.1
transformers: 4.33.0.dev0
deepspeed: 0.16.1
accelerate: 1.1.1

Running the script
deepspeed --num_gpus 2 llama_inference.py

ds_report output
DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] FP Quantizer is using an untested triton version (3.1.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmp9xahzd5h/test.c -o /tmp/tmp9xahzd5h/test.o
x86_64-linux-gnu-gcc /tmp/tmp9xahzd5h/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmp9xahzd5h/a.out
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /tmp/tmpr9qj7tb_/test.c -o /tmp/tmpr9qj7tb_/test.o
x86_64-linux-gnu-gcc /tmp/tmpr9qj7tb_/test.o -laio -o /tmp/tmpr9qj7tb_/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch']
torch version .................... 2.5.1+cu124
deepspeed install path ........... ['/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.16.1, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.4
shared memory (/dev/shm) size .... 62.82 GB

System info:

OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish)
GPU count and types: 2 × NVIDIA GeForce RTX 2070
Python version: 3.10.12

Output

[2025-02-10 07:33:16,718] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:18,745] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-02-10 07:33:18,745] [INFO] [runner.py:607:main] cmd = /home/lorenaromerom/TFM/env/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llama_inference.py
[2025-02-10 07:33:20,277] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:22,275] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-02-10 07:33:22,275] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-02-10 07:33:22,275] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-02-10 07:33:22,275] [INFO] [launch.py:164:main] dist_world_size=2
[2025-02-10 07:33:22,275] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-02-10 07:33:22,276] [INFO] [launch.py:256:main] process 2116656 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=0']
[2025-02-10 07:33:22,277] [INFO] [launch.py:256:main] process 2116657 spawned with command: ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1']
[2025-02-10 07:33:24,137] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-02-10 07:33:24,197] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/utils/generic.py:260: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
[2025-02-10 07:33:25,978] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,075] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-10 07:33:26,076] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Rank: 1, Local Rank: 1, World Size: 2
Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Rank: 0, Local Rank: 0, World Size: 2
Loading model meta-llama/Llama-2-7b-hf
/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
[2025-02-10 07:33:30,491] [INFO] [utils.py:781:see_memory_usage] pre-from-pretrained
[2025-02-10 07:33:30,492] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:33:30,492] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 5.58 GB, percent = 4.4%
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.07it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.01it/s]
[2025-02-10 07:34:25,339] [INFO] [utils.py:781:see_memory_usage] post-from-pretrained
[2025-02-10 07:34:25,340] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2025-02-10 07:34:25,340] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.0 GB, percent = 24.7%
[2025-02-10 07:34:25,340] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.1, git-hash=unknown, git-branch=unknown
[2025-02-10 07:34:25,340] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[2025-02-10 07:34:25,384] [INFO] [config.py:733:init] Config mesh_device None world_size = 2
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank0]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank0]: self._configure_distributed_model(model)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank0]: self.module.to(self.device)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank0]: return super().to(*args, **kwargs)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank0]: return self._apply(convert)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank0]: module._apply(fn)
[rank0]: [Previous line repeated 2 more times]
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank0]: param_applied = fn(param)
[rank0]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank0]: return t.to(
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 7.61 GiB of which 19.38 MiB is free. Process 2116657 has 114.00 MiB memory in use. Including non-PyTorch memory, this process has 7.47 GiB memory in use. Of the allocated memory 7.36 GiB is allocated by PyTorch, and 1.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/lorenaromerom/TFM/llama_inference.py", line 83, in
[rank1]: ds_engine = deepspeed.initialize(model=model, config_params = ds_config)[0]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
[rank1]: engine = DeepSpeedEngine(args=args,
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 271, in init
[rank1]: self._configure_distributed_model(model)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1163, in _configure_distributed_model
[rank1]: self.module.to(self.device)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2052, in to
[rank1]: return super().to(*args, **kwargs)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank1]: return self._apply(convert)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank1]: module._apply(fn)
[rank1]: [Previous line repeated 2 more times]
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank1]: param_applied = fn(param)
[rank1]: File "/home/lorenaromerom/TFM/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank1]: return t.to(
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 1 has a total capacity of 7.61 GiB of which 2.12 MiB is free. Including non-PyTorch memory, this process has 7.60 GiB memory in use. Of the allocated memory 7.51 GiB is allocated by PyTorch, and 1.69 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W210 07:34:27.428226840 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116656
[2025-02-10 07:34:28,351] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2116657
[2025-02-10 07:34:28,459] [ERROR] [launch.py:325:sigkill_handler] ['/home/lorenaromerom/TFM/env/bin/python3', '-u', 'llama_inference.py', '--local_rank=1'] exits with return code = 1

The text was updated successfully, but these errors were encountered:

lorenaromerom02 added bug Something isn't working inference labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-Memory (OOM) Error with CPU Offload Using ZeRO Stage 3 #7021

Out-of-Memory (OOM) Error with CPU Offload Using ZeRO Stage 3 #7021

lorenaromerom02 commented Feb 10, 2025 •

edited

Loading

Out-of-Memory (OOM) Error with CPU Offload Using ZeRO Stage 3 #7021

Out-of-Memory (OOM) Error with CPU Offload Using ZeRO Stage 3 #7021

Comments

lorenaromerom02 commented Feb 10, 2025 • edited Loading

Load the model and tokenizer

ds_report output DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

lorenaromerom02 commented Feb 10, 2025 •

edited

Loading

ds_report output
DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]