Add Splitwise implementation to vLLM #1

aashaka · 2024-02-03T01:35:19Z

No description provided.

Initialize MSCCL++ communication group if sep_prompt_token is set in ParallelConfig. Also add documentation for MSCCL++ installation.

- Add worker_type to differentiate prompt, token, and mixed workers - Set a driver for each prompt machines and token machines - Allow broadcasts to take a group - Setup KV Cache communication using MSCCL++ - Add test for KV Cache communication

- Obtain `blocks_to_nw` when creating batches in scheduler. Coalesce blocks where possible for fast network transfers. - Use a Sequence to Semaphore Mapper to allow for fine-grained waiting for kv-cache transfer per sequence - Separately run prompt and token workers using the `_run_stage_workers` helper - Populate KVCacheCommunicator for all PagedAttention modules, which allows implementation of layer-wise sends from within `attention.py` - Populate destination rank for Sampler, which will be used as root in `gather` operations. - Fix `tensor_model_parallel_gather` - use global rank instead of group local rank.

goiri · 2024-02-03T01:37:15Z

tests/distributed/test_kvcache_comm.py

@@ -0,0 +1,41 @@
+"""Test the KV cache communication operators.
+
+Run `python test_kvcache_comm.py`.


I'm guessing there's no CI infra or test running?

Actually, they do have one. I will have to figure out how to add this test to the CI infra. Multiple things to figure out - how to use pytest with argparse for LLMEngine and setting up MSCCL++ environment in the CI infra.

vllm/engine/llm_engine.py

vllm/worker/comm_utils.py

vllm/worker/worker.py

aashaka added 4 commits January 22, 2024 22:48

Add MSCCL++ for KV cache communication

e0a7e1c

Initialize MSCCL++ communication group if sep_prompt_token is set in ParallelConfig. Also add documentation for MSCCL++ installation.

Documentation update for Splitwise

fa05c4f

goiri reviewed Feb 3, 2024

View reviewed changes

Add comments and minor changes for code clarity

778290f

goiri approved these changes Feb 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Splitwise implementation to vLLM #1

Add Splitwise implementation to vLLM #1

aashaka commented Feb 3, 2024

goiri Feb 3, 2024

aashaka Feb 3, 2024

		@@ -0,0 +1,41 @@
		"""Test the KV cache communication operators.

		Run `python test_kvcache_comm.py`.

Add Splitwise implementation to vLLM #1

Are you sure you want to change the base?

Add Splitwise implementation to vLLM #1

Conversation

aashaka commented Feb 3, 2024

goiri Feb 3, 2024

Choose a reason for hiding this comment

aashaka Feb 3, 2024

Choose a reason for hiding this comment