Skip to content

Commit

Permalink
another reformat
Browse files Browse the repository at this point in the history
  • Loading branch information
jomayeri committed Aug 28, 2024
1 parent 058bf37 commit d17f5a5
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions deepnvme/zero_inference/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Using DeepNVMe for ZeRO-Inference
ZeRO-inference is an ideal use case for the DeepNVMe technology. When you have a model that exceeds the size of availabe GPU memory the [DeepNVMe](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-gds/README.md) library along with ZeRO-inference can be leveraged for high-throughput offline inference.

Maximizing inference in this scenario has two parts. First offloading the model parameters to fast Non-Volatile Memory, either a single device or several devices RAIDed together to further increase the effective bandiwidth of the system. These parameters are then swapped into the GPU memory layer by layer to compute the forward pass for inference. This allows for the second part of the process, maximizing the batch size. By swapping in parameters layer by layer the remaining GPU memory can be used by the computational batch which leads to a maximizing of total inference throughput.
Maximizing inference throughput (measured in tokens/sec) in this scenario has two parts. First offloading the model parameters to fast Non-Volatile Memory, either a single device or several devices RAIDed together to further increase the effective bandiwidth of the system. These parameters are then swapped into the GPU memory layer by layer to compute the forward pass for inference. This allows for the second part of the process, maximizing the batch size. By swapping in parameters layer by layer the remaining GPU memory can be used by the computational batch which leads to a maximizing of total inference throughput.

## Testing Environment
The environment for these tests was a VM with NVIDIA Magnum IO<sup>TM</sup> GPUDirect® Storage (GDS) installed along with a single NVIDIA H100 GPU containing 96 GB of memory. The VM also had two NVMes each with a read bandwidth of ~6 GB/sec. The two NVMes were put into a RAID0 configuration, bringing the effective read bandwidth up to ~12 GB/sec.
Expand All @@ -10,7 +10,7 @@ The environment for these tests was a VM with NVIDIA Magnum IO<sup>TM</sup> GPUD
</div>

## Initial Results
The following models where run from the folder DeepSpeedExamples/inference/huggingface/zero_inference using disk-offload of parameters via the following command:
The following models were run from the folder DeepSpeedExamples/inference/huggingface/zero_inference using disk-offload of parameters via the following command:

```bash
deepspeed --num_gpus 1 run_model.py --model $model_name --batch_size $bsz --prompt-len 512 --gen-len 32 --disk-offload $path_to_foler --use_gds
Expand All @@ -28,4 +28,4 @@ True | 2.321(bsz=24) | 8.876(bsz=96) | 0.293(bsz=8) |
Throughput measured in tokens/sec.

## Batch Size Differences in OPT-66B
In 2 of the 3 model scenarios above GDS outperformed the CPU bounce buffer on throughput. In the OPT-66B scenario the CPU buffer performed better because it was able to accomodate a larger batch size (32 vs 24). This is a result of how parameter swapping is implemented when using GDS. The CPU keeps its bounce buffer for parameters in CPU DRAM, GDS must also keep a bounce buffer in GPU VRAM. This extra space taken up in GPU VRAM by the GDS bounce buffer has the possiblity of causing an Out-of-Memory error when scaling to larger batch sizes.
In 2 of the 3 model scenarios above GDS outperformed the CPU bounce buffer on throughput. In the OPT-66B scenario the CPU buffer performed better because it was able to accomodate a larger batch size (32 vs 24). This is a result of how parameter swapping is implemented when using GDS. The CPU keeps its bounce buffer for parameters in CPU DRAM, GDS also keeps a bounce buffer to swap parameters into and it keeps in GPU memory. This extra space taken up in GPU VRAM by the GDS bounce buffer has the possiblity of causing an Out-of-Memory error when scaling to larger batch sizes.

0 comments on commit d17f5a5

Please sign in to comment.