Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS][BUG]Can not run the examples follow vllm-integration-v0.2.md and vllm-integration.md #124

Open
maobaolong opened this issue Feb 28, 2025 · 5 comments

Comments

@maobaolong
Copy link

maobaolong commented Feb 28, 2025

Help wanted!

I try to run the vllm and mooncake examples follow vllm-integration-v0.2.md and vllm-integration.md, but i failed, I guess there are something wrong with me and documents.

Now I try to paste what I did and what I encountered.

Run vllm image with mooncake kv_connector config --- lack of mooncake_vllm_adaptor.

I run the vllm docker image with some mooncake kv_connector config from vllm docker hub, but there are no mooncake_vllm_adaptor in the vllm container.

$ MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True  python3 -m vllm.entrypoints.openai.api_server --model  /disc/data1/Qwen/Qwen2.5-1.5B-Instruct --max-model-len 32768            -tp 1 --enforce-eager --served-model-name Qwen2.5-1.5B-Instruct  --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'
INFO 02-26 07:53:33 __init__.py:207] Automatically detected platform cuda.
INFO 02-26 07:53:33 api_server.py:911] vLLM API server version 0.7.4.dev55+ge7ef74e2.d20250224
INFO 02-26 07:53:33 api_server.py:912] args: Namespace(host=None, port=8100, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=10000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5-1.5B-Instruct'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=2000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-26 07:53:33 api_server.py:208] Started engine process with PID 6852
INFO 02-26 07:53:37 __init__.py:207] Automatically detected platform cuda.
INFO 02-26 07:53:38 config.py:560] This model supports multiple tasks: {'reward', 'embed', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
WARNING 02-26 07:53:38 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-26 07:53:38 config.py:696] Async output processing is not supported on the current platform type cuda.
INFO 02-26 07:53:41 config.py:560] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 02-26 07:53:41 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-26 07:53:41 config.py:696] Async output processing is not supported on the current platform type cuda.
INFO 02-26 07:53:41 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev55+ge7ef74e2.d20250224) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
INFO 02-26 07:53:42 cuda.py:229] Using Flash Attention backend.
INFO 02-26 07:53:44 parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-26 07:53:44 simple_connector.py:60] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=2000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 02-26 07:53:44 mooncake_pipe.py:229] Selecting device: cuda
ERROR 02-26 07:53:44 engine.py:400] Please install mooncake by following the instructions at https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/build.md to run vLLM with MooncakeConnector.
ERROR 02-26 07:53:44 engine.py:400] Traceback (most recent call last):
ERROR 02-26 07:53:44 engine.py:400]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 59, in __init__
ERROR 02-26 07:53:44 engine.py:400]     import mooncake_vllm_adaptor as mva
ERROR 02-26 07:53:44 engine.py:400] ModuleNotFoundError: No module named 'mooncake_vllm_adaptor'
ERROR 02-26 07:53:44 engine.py:400] 
ERROR 02-26 07:53:44 engine.py:400] The above exception was the direct cause of the following exception:
ERROR 02-26 07:53:44 engine.py:400] 
ERROR 02-26 07:53:44 engine.py:400] Traceback (most recent call last):
ERROR 02-26 07:53:44 engine.py:400]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
ERROR 02-26 07:53:44 engine.py:400]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-26 07:53:44 engine.py:400]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Next, I know I should install mooncake_vllm_adaptor, so I use pip install mooncake_vllm_adaptor to try to install it, but I failed too.

Have to construct a docker which can run vllm integrated mooncake connector and mooncake_vllm_adaptor

I try to search docker hub and hope to find an image repository which maintained by mooncake community, but I cannot found it.

So I build an image from Mooncake/Dockerfile, but it contains no vllm and some necessary dependencies. I cannot build successfully through the source code of vllm, it lack of some necessary dependencies.

But, I can install a latest version of vllm through pip install vllm and some of dependencies which I found while each try.

Whatever, I have a container which contains mooncake and vllm after many efforts I did.

Cannot run examples successfully follow vllm-integration-v0.2.md and vllm-integration.md

Run follow vllm-integration-0.2.md in the docker container.

  • Run etcd with a single node success
root@TENCENT64:/# nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379 & > /disc/data1/baoloongmao/etcd.log 

# Proved etcd run health and can put and list

[root@TENCENT64 baoloongmao]#  curl http://192.168.1.206:2379/v2/keys/foo_dir/foo -XPUT -d value=bar
{"action":"set","node":{"key":"/foo_dir/foo","value":"bar","modifiedIndex":6,"createdIndex":6}}
[root@TENCENT64 baoloongmao]#  curl http://192.168.1.206:2379/v2/keys/
{"action":"get","node":{"dir":true,"nodes":[{"key":"/foo_dir","dir":true,"modifiedIndex":6,"createdIndex":6}]}}

  • prepare a mooncake.json file
{
  "prefill_url": "192.168.1.206:13003",
  "decode_url": "192.168.1.208:13003",
  "metadata_server": "192.168.1.206:2379",
  "metadata_backend": "etcd",
  "protocol": "tcp",
  "device_name": ""
}
  • Run vllm process in two nodes as producer and consumer

first node

root@TENCENT64:/# MOONCAKE_CONFIG_PATH=/disc/data1/baoloongmao/mooncake.json VLLM_USE_MODELSCOPE=True \
python3 -m vllm.entrypoints.openai.api_server \
 --model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct \
           --trust-remote-code \
           --served-model-name Qwen2.5-1.5B-Instruct \
           --max-model-len 32768 \
           -tp 1 --enforce-eager \
           --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'

INFO 02-27 19:50:53 __init__.py:207] Automatically detected platform cuda.
INFO 02-27 19:50:53 api_server.py:912] vLLM API server version 0.7.3
INFO 02-27 19:50:53 api_server.py:913] args: Namespace(host=None, port=8100, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=10000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5-1.5B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=2000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-27 19:50:53 api_server.py:209] Started engine process with PID 572
INFO 02-27 19:50:56 __init__.py:207] Automatically detected platform cuda.
INFO 02-27 19:50:58 config.py:549] This model supports multiple tasks: {'score', 'embed', 'generate', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 02-27 19:50:58 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-27 19:50:58 config.py:685] Async output processing is not supported on the current platform type cuda.
INFO 02-27 19:51:01 config.py:549] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
WARNING 02-27 19:51:01 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-27 19:51:01 config.py:685] Async output processing is not supported on the current platform type cuda.
INFO 02-27 19:51:01 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
INFO 02-27 19:51:02 cuda.py:229] Using Flash Attention backend.
INFO 02-27 19:51:08 simple_connector.py:60] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=2000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 02-27 19:51:08 mooncake_pipe.py:229] Selecting device: cuda
INFO 02-27 19:51:08 mooncake_pipe.py:71] Mooncake Configuration loaded successfully.
INFO 02-27 19:51:08 model_runner.py:1110] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]

INFO 02-27 19:51:09 model_runner.py:1115] Loading model weights took 2.8875 GB
INFO 02-27 19:51:10 worker.py:267] Memory profiling takes 0.72 seconds
INFO 02-27 19:51:10 worker.py:267] the current vLLM instance can use total_gpu_memory (95.00GiB) x gpu_memory_utilization (0.80) = 76.00GiB
INFO 02-27 19:51:10 worker.py:267] model weights take 2.89GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 71.56GiB.
INFO 02-27 19:51:10 executor_base.py:111] # cuda blocks: 167489, # CPU blocks: 9362
INFO 02-27 19:51:10 executor_base.py:116] Maximum concurrency for 10000 tokens per request: 267.98x
INFO 02-27 19:51:11 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.45 seconds
INFO 02-27 19:51:13 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8100
INFO 02-27 19:51:13 launcher.py:23] Available routes are:
INFO 02-27 19:51:13 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 02-27 19:51:13 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 02-27 19:51:13 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-27 19:51:13 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 02-27 19:51:13 launcher.py:31] Route: /health, Methods: GET
INFO 02-27 19:51:13 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 02-27 19:51:13 launcher.py:31] Route: /tokenize, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /detokenize, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/models, Methods: GET
INFO 02-27 19:51:13 launcher.py:31] Route: /version, Methods: GET
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /pooling, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /score, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/score, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /rerank, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 02-27 19:51:13 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [502]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

second node

root@TENCENT64:/# MOONCAKE_CONFIG_PATH=/disc/data1/baoloongmao/mooncake.json VLLM_USE_MODELSCOPE=True \
python3 -m vllm.entrypoints.openai.api_server \
 --model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct \
           --trust-remote-code \
           --served-model-name Qwen2.5-1.5B-Instruct \
           --max-model-len 32768 \
           -tp 1 --enforce-eager \
           --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'
INFO 02-27 19:39:56 __init__.py:207] Automatically detected platform cuda.
INFO 02-27 19:39:56 api_server.py:912] vLLM API server version 0.7.3
INFO 02-27 19:39:56 api_server.py:913] args: Namespace(host=None, port=8100, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=10000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5-1.5B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=2000000000.0, kv_role='kv_consumer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-27 19:39:56 api_server.py:209] Started engine process with PID 675
INFO 02-27 19:39:59 __init__.py:207] Automatically detected platform cuda.
INFO 02-27 19:40:01 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 02-27 19:40:01 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-27 19:40:01 config.py:685] Async output processing is not supported on the current platform type cuda.
INFO 02-27 19:40:05 config.py:549] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
WARNING 02-27 19:40:05 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-27 19:40:05 config.py:685] Async output processing is not supported on the current platform type cuda.
INFO 02-27 19:40:05 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, 
INFO 02-27 19:40:06 cuda.py:229] Using Flash Attention backend.
INFO 02-27 19:40:09 simple_connector.py:60] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=2000000000.0 kv_role='kv_consumer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 02-27 19:40:09 mooncake_pipe.py:229] Selecting device: cuda
INFO 02-27 19:40:09 mooncake_pipe.py:71] Mooncake Configuration loaded successfully.
INFO 02-27 19:40:09 model_runner.py:1110] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.80it/s]

INFO 02-27 19:40:10 model_runner.py:1115] Loading model weights took 2.8875 GB
INFO 02-27 19:40:11 worker.py:267] Memory profiling takes 0.71 seconds
INFO 02-27 19:40:11 worker.py:267] the current vLLM instance can use total_gpu_memory (95.00GiB) x gpu_memory_utilization (0.80) = 76.00GiB
INFO 02-27 19:40:11 worker.py:267] model weights take 2.89GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 71.56GiB.
INFO 02-27 19:40:11 executor_base.py:111] # cuda blocks: 167489, # CPU blocks: 9362
INFO 02-27 19:40:11 executor_base.py:116] Maximum concurrency for 10000 tokens per request: 267.98x
INFO 02-27 19:40:12 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.35 seconds
INFO 02-27 19:40:15 api_server.py:958] Starting vLLM API server on http://0.0.0.0:8100
INFO 02-27 19:40:15 launcher.py:23] Available routes are:
INFO 02-27 19:40:15 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
INFO 02-27 19:40:15 launcher.py:31] Route: /docs, Methods: HEAD, GET
INFO 02-27 19:40:15 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-27 19:40:15 launcher.py:31] Route: /redoc, Methods: HEAD, GET
INFO 02-27 19:40:15 launcher.py:31] Route: /health, Methods: GET
INFO 02-27 19:40:15 launcher.py:31] Route: /ping, Methods: POST, GET
INFO 02-27 19:40:15 launcher.py:31] Route: /tokenize, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /detokenize, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/models, Methods: GET
INFO 02-27 19:40:15 launcher.py:31] Route: /version, Methods: GET
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/chat/completions, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/completions, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/embeddings, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /pooling, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /score, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/score, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /rerank, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v1/rerank, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /v2/rerank, Methods: POST
INFO 02-27 19:40:15 launcher.py:31] Route: /invocations, Methods: POST
INFO:     Started server process [605]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
  • Run proxy_server.py in the first node which copy from vllm-integration.md
# I modified the `proxy_server.py` just updated the necessary nodes ip and port.
root@TENCENT64:/# python3 /disc/data1/baoloongmao/proxy_server.py 
  • Run the test by send post request to proxy_server.py process on its node
[root@TENCENT64 baoloongmao]# curl -s http://localhost:8001/v1/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen2.5-1.5B-Instruct",
  "prompt": "San Francisco is a",
  "max_tokens": 1000
}'
<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
  • I found the two vllm server and crash and shutdown after the request coming.
  1. first vllm node logs
INFO 02-27 19:59:49 logger.py:39] Received request cmpl-f75096c299e24ba89ffbe6ab1af0d16c-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 02-27 19:59:49 engine.py:280] Added request cmpl-f75096c299e24ba89ffbe6ab1af0d16c-0.
INFO 02-27 19:59:49 metrics.py:455] Avg prompt throughput: 0.8 tokens/s, Avg generation throughput: 2.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     192.168.1.208:36196 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-27 20:00:03 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-27 20:00:13 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-27 20:00:22 logger.py:39] Received request cmpl-f06f62f3924c455895d33683ae28d22d-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 02-27 20:00:22 engine.py:280] Added request cmpl-f06f62f3924c455895d33683ae28d22d-0.
INFO 02-27 20:00:22 metrics.py:455] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     192.168.1.206:57566 - "POST /v1/completions HTTP/1.1" 200 OK
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0227 20:00:22.206605  1563 transfer_metadata_plugin.cpp:275] EtcdStoragePlugin: unable to get mooncake/ram/192.168.1.208:13003 from 192.168.1206:2379: etcd-cpp-apiv3: key not found
W0227 20:00:22.206632  1563 transfer_metadata.cpp:153] Failed to retrieve segment descriptor, name 192.168.1.208:13003
ERROR 02-27 20:00:22 mooncake_pipe.py:163] Transfer Return Error
Exception in thread Thread-3 (drop_select_handler):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 140, in drop_select_handler
    signal = self.signal_pipe.recv_tensor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 261, in recv_tensor
    tensor = self.transport_thread.submit(self._recv_impl).result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 246, in _recv_impl
    data = self.transfer_engine.recv_bytes()
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 196, in recv_bytes
    self.transfer_sync(dst_ptr, src_ptr, length)
  File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 164, in transfer_sync
    raise Exception("Transfer Return Error")
Exception: Transfer Return Error
INFO 02-27 20:00:32 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

^CINFO 02-27 20:18:06 launcher.py:62] Shutting down FastAPI HTTP server.
ERROR 02-27 20:18:06 engine.py:400] MQLLMEngine terminated
ERROR 02-27 20:18:06 engine.py:400] Traceback (most recent call last):
ERROR 02-27 20:18:06 engine.py:400]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 397, in run_mp_engine
ERROR 02-27 20:18:06 engine.py:400]     engine.start()
ERROR 02-27 20:18:06 engine.py:400]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 145, in start
ERROR 02-27 20:18:06 engine.py:400]     self.cleanup()
ERROR 02-27 20:18:06 engine.py:400]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 385, in signal_handler
ERROR 02-27 20:18:06 engine.py:400]     raise KeyboardInterrupt("MQLLMEngine terminated")
ERROR 02-27 20:18:06 engine.py:400] KeyboardInterrupt: MQLLMEngine terminated
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 397, in run_mp_engine
    engine.start()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 145, in start
    self.cleanup()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 385, in signal_handler
    raise KeyboardInterrupt("MQLLMEngine terminated")
KeyboardInterrupt: MQLLMEngine terminated
[rank0]:[W227 20:18:07.804461464 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
  1. second vllm node logs
INFO:     192.168.1:47126 - "POST /v1/completions HTTP/1.1" 404 Not Found
INFO 02-27 20:00:22 logger.py:39] Received request cmpl-05d4d13293324f12a52c95c9f08724c3-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1000, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 02-27 20:00:22 engine.py:280] Added request cmpl-05d4d13293324f12a52c95c9f08724c3-0.
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0227 20:00:22.206334  1503 transfer_metadata_plugin.cpp:275] EtcdStoragePlugin: unable to get mooncake/ram/192.168.1.208:13003 from 192.168.1.206:2379: etcd-cpp-apiv3: key not found
W0227 20:00:22.206357  1503 transfer_metadata.cpp:153] Failed to retrieve segment descriptor, name 192.168.1.208:13003
ERROR 02-27 20:00:22 mooncake_pipe.py:163] Transfer Return Error
CRITICAL 02-27 20:00:22 launcher.py:104] MQLLMEngine is already dead, terminating server process
INFO:     192.168.1.206:41086 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 02-27 20:00:22 engine.py:140] Exception('Transfer Return Error')
ERROR 02-27 20:00:22 engine.py:140] Traceback (most recent call last):
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in start
ERROR 02-27 20:00:22 engine.py:140]     self.run_engine_loop()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 201, in run_engine_loop
ERROR 02-27 20:00:22 engine.py:140]     request_outputs = self.engine_step()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 219, in engine_step
ERROR 02-27 20:00:22 engine.py:140]     raise e
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 210, in engine_step
ERROR 02-27 20:00:22 engine.py:140]     return self.engine.step()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1391, in step
ERROR 02-27 20:00:22 engine.py:140]     outputs = self.model_executor.execute_model(
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 139, in execute_model
ERROR 02-27 20:00:22 engine.py:140]     output = self.collective_rpc("execute_model",
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-27 20:00:22 engine.py:140]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-27 20:00:22 engine.py:140]     return func(*args, **kwargs)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 02-27 20:00:22 engine.py:140]     output = self.model_runner.execute_model(
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-27 20:00:22 engine.py:140]     return func(*args, **kwargs)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1697, in execute_model
ERROR 02-27 20:00:22 engine.py:140]     get_kv_transfer_group().recv_kv_caches_and_hidden_states(
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 75, in recv_kv_caches_and_hidden_states
ERROR 02-27 20:00:22 engine.py:140]     return self.connector.recv_kv_caches_and_hidden_states(
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 238, in recv_kv_caches_and_hidden_states
ERROR 02-27 20:00:22 engine.py:140]     ret = self.select(current_tokens,
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 139, in select
ERROR 02-27 20:00:22 engine.py:140]     return self.consumer_buffer.drop_select(input_tokens, roi)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 202, in drop_select
ERROR 02-27 20:00:22 engine.py:140]     input_tokens = self.data_pipe.recv_tensor()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 261, in recv_tensor
ERROR 02-27 20:00:22 engine.py:140]     tensor = self.transport_thread.submit(self._recv_impl).result()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
ERROR 02-27 20:00:22 engine.py:140]     return self.__get_result()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
ERROR 02-27 20:00:22 engine.py:140]     raise self._exception
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 02-27 20:00:22 engine.py:140]     result = self.fn(*self.args, **self.kwargs)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 246, in _recv_impl
ERROR 02-27 20:00:22 engine.py:140]     data = self.transfer_engine.recv_bytes()
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 196, in recv_bytes
ERROR 02-27 20:00:22 engine.py:140]     self.transfer_sync(dst_ptr, src_ptr, length)
ERROR 02-27 20:00:22 engine.py:140]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 164, in transfer_sync
ERROR 02-27 20:00:22 engine.py:140]     raise Exception("Transfer Return Error")
ERROR 02-27 20:00:22 engine.py:140] Exception: Transfer Return Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [605]
[rank0]: Traceback (most recent call last):
[rank0]:   File "<string>", line 1, in <module>
[rank0]:   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
[rank0]:     exitcode = _main(fd, parent_sentinel)
[rank0]:   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129, in _main
[rank0]:     return self._bootstrap(parent_sentinel)
[rank0]:   File "/usr/lib/python3.10/multiprocessing/process.py", line 332, in _bootstrap
[rank0]:     threading._shutdown()
[rank0]:   File "/usr/lib/python3.10/threading.py", line 1537, in _shutdown
[rank0]:     atexit_call()
[rank0]:   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
[rank0]:     t.join()
[rank0]:   File "/usr/lib/python3.10/threading.py", line 1096, in join
[rank0]:     self._wait_for_tstate_lock()
[rank0]:   File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
[rank0]:     if lock.acquire(block, timeout):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 385, in signal_handler
[rank0]:     raise KeyboardInterrupt("MQLLMEngine terminated")
[rank0]: KeyboardInterrupt: MQLLMEngine terminated
[rank0]:[W227 20:00:23.698135103 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
  1. Log of proxy_server.py
root@TENCENT64:/# python3 /disc/data1/baoloongmao/proxy_server.py 
 * Serving Quart app 'proxy_server'
 * Debug mode: False
 * Please use an ASGI server (e.g. Hypercorn) directly in production
 * Running on http://0.0.0.0:8001 (CTRL + C to quit)
[2025-02-27 19:55:03 +0000] [1172] [INFO] Running on http://0.0.0.0:8001 (CTRL + C to quit)
[2025-02-27 19:56:08 +0000] [1172] [INFO] 192.168.1.206:50388 POST /v1/completions 1.1 200 - 25535
[2025-02-27 20:00:22 +0000] [1172] [INFO] 192.168.1.206:50088 POST /v1/completions 1.1 200 - 36948
  • Try to did some trouble shooting
  1. ETCD works health and there is no any key created by vllm and mooncake.
  2. Check each vllm node log, there are no more log match mooncake/ram, EtcdStoragePlugin: set: key= and EtcdStoragePlugin: unable to set. Guess none of vllm server request EtcdStoragePlugin to create the key mooncake/ram/192.168.1.208:13003

Failed to Run follow vllm-integration.md in the docker container. --- do not use mooncake

# run this in the first node
VLLM_HOST_IP="192.168.1.206" VLLM_PORT="51000" MASTER_ADDR="192.168.1.206" MASTER_PORT="54324"  MOONCAKE_CONFIG_PATH=/disc/data1/baoloongmao/mooncake.json VLLM_DISTRIBUTED_KV_ROLE=producer  VLLM_USE_MODELSCOPE=True \
python3 -m vllm.entrypoints.openai.api_server \
 --model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct \
           --trust-remote-code \
           --served-model-name Qwen2.5-1.5B-Instruct \
           --port 8100
           --max-model-len 10000 \
           -tp 1 --enforce-eager 
           
# run this in the second node
VLLM_HOST_IP="192.168.1.206" VLLM_PORT="51000" MASTER_ADDR="192.168.1.206" MASTER_PORT="54324"  MOONCAKE_CONFIG_PATH=/disc/data1/baoloongmao/mooncake.json VLLM_DISTRIBUTED_KV_ROLE=consumer  VLLM_USE_MODELSCOPE=True \
python3 -m vllm.entrypoints.openai.api_server \
 --model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct \
           --trust-remote-code \
           --served-model-name Qwen2.5-1.5B-Instruct \
           --port 8200
           --max-model-len 10000 \
           -tp 1 --enforce-eager 
  • Try to troubleshooting
  1. I guess there are something wrong with the VLLM_HOST_IP VLLM_HOST_PORT MASTER_ADDR and MASTER_PORT envvar value, In fact I don't know these vars well.
  2. I do not search the mooncake from any logs, I guess it is not enough to enable mooncake by the given VLLM_DISTRIBUTED_KV_ROLE
@maobaolong
Copy link
Author

@alogfans Hi Would you like to help me run the vllm + mooncake? Any help will be appreciated!

@alogfans
Copy link
Collaborator

alogfans commented Feb 28, 2025

Run vllm image with mooncake kv_connector config --- lack of mooncake_vllm_adaptor.

The Dockerfile is contributed by the community. We will make a Docker image to make deploy more convenient in a near future.

Cannot run examples successfully follow vllm-integration-v0.2.md and vllm-integration.md

I noticed two vllm node logs you provided. It seems that there is problem during the initialization of Transfer Engine (the output is prior to the section you provided). You can use export MC_VERBOSE=1 to make output more wordy. You also need to check the IP address to be connectable between docker instances.

@maobaolong
Copy link
Author

@alogfans Thanks for your reply! It is proved the two node are connectable by the Failed to Run follow vllm-integration.md in the docker container. --- do not use mooncake, it run successfully but do not use mooncake.

The key issue is that why producer do not record metadata key into ETCD?

@alogfans
Copy link
Collaborator

alogfans commented Mar 3, 2025

According to the logs, both the prefill and the decode tried to get segment "192.168.1.208:13003" (i.e. the decode machine in the json file). The decode retrieve segment information locally, without the need of etcd. Therefore the problem could likely lead by the decode node failing to create its segment (transport init failed, memory allocation failed, ...). So you can try to print verbose logs in the decode node.

@ShangmingCai
Copy link
Collaborator

Next, I know I should install mooncake_vllm_adaptor, so I use pip install mooncake_vllm_adaptor to try to install it, but I failed too.

You have to install mooncake manually by following the build doc. The package is not on pypi yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants