[evals] Add support for scaling evals and inference with ray #63

erictang000 · 2025-02-03T01:41:11Z

What does this PR do?

This PR adds support for using ray to speed up evals and data generation. Currently we use are using a preliminary version of using ray data + vllm while we wait for the code at ray.llm to be fully open sourced (coming in the next 1-2 weeks), after which we will migrate over.

Speedups

Example speedups using `ray` can be seen above - we found that for a single 8xA100 or 8xH100 node, setting `tp=4` with 2 replicas or `tp=2` with 4 replicas using ray can be much faster than a single `tp=8` replica, especially for longer evals like MMLU.

We also see faster data generation for sampling n parallel generations (above for Deepseek distilled Qwen-7B). Same inference steps for 32k max tokens for AIME and n=128 with the Qwen math repo using a single tp=4 replica takes ~10 hours.

How to Use

To use the new path: simply add --use_ray to existing commands and set relevant scaling parameters in --ray_config.

Reasonable defaults and examples of how to set advanced vllm engine arguments are provided in ray_configs/ray_config.yaml. For example, to run the Math-500 eval with Sky-T1-32B-Preview, you can use the following command

python inference_and_check.py --model NovaSky-AI/Sky-T1-32B-Preview --task math500  --split test --max_tokens 8192 --use_ray --ray_config ray_configs/ray_config.yaml

where the ray_config looks like:

llm_engine: vllm # currently only vllm supported
accelerator_type: A100-80G # accelerator name as specified here: https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types
engine_kwargs: # vllm engine kwargs 
  tensor_parallel_size: 2
  gpu_memory_utilization: 0.9
runtime_env:
  env_vars:
    VLLM_ATTENTION_BACKEND: "FLASH_ATTN"
env_config:
  num_replicas: 4 # number of vllm replicas 
  batch_size: 128 # ray data internal batch size (used for map_batches call internally). Should usually be set to a value in [64, 128, 256] for best performance.

…to rayllm

kouroshHakha

We don't have any unitests in this repo so some stuff to verify manually:

does the regular single node cli still work after these changes?
does it still work with openAI models?

There is a bit of extra stuff that we can trim down to not confuse people in workload.py

skythought/skythought_evals/inference_and_check.py

skythought/skythought_evals/batch/workload.py

skythought/skythought_evals/inference_and_check.py

skythought/skythought_evals/ray_configs/ray_config.yaml

… + ray functions)

erictang000 · 2025-02-04T03:19:59Z

does the regular single node cli still work after these changes?

does it still work with openAI models?

Yep, checked with the new updates and made sure everything is working e2e for both inference_and_check as well as inference_and_save with n = 1 and n > 1 for all 3 paths for getting completions (oai, vllm, ray + vllm)

… HEAD

skythought/skythought_evals/inference_and_check.py

SumanthRH · 2025-02-04T20:57:15Z

skythought/skythought_evals/inference_and_check.py

+
+        responses = copy.deepcopy(responses)


Could you add a NOTE + TODO comment here for now explaining the issue we saw?

bug details:

a new Response object that's just a python dataclass with str, int, int attributes is initialized from the values of the ds.iter_rows() of a ray dataset

these responses are processed in a ProcessPoolExecutor, but when we exit the context of the executor, and it tries to clean up the response objects, we runs into a SIGSEV on the ray object store level for some reason (see below)

Traceback for posterity

(raylet) *** SIGSEGV received at time=1738696160 on cpu 214 *** (raylet) PC: @ 0x56080cd7f1ae (unknown) plasma::ReadReleaseRequest() (raylet) @ 0x7fcb9fbaf520 4656 (unknown) (raylet) @ 0x56080cd5715f 1456 plasma::PlasmaStore::ProcessMessage() (raylet) @ 0x56080cd50f15 32 std::_Function_handler<>::_M_invoke() (raylet) @ 0x56080cd86d33 1280 plasma::Client::Create()::{lambda()#1}::operator()() (raylet) @ 0x56080cf56aad 1376 ray::ClientConnection::ProcessMessage() (raylet) @ 0x56080cf6de98 1168 EventTracker::RecordExecution() (raylet) @ 0x56080cf58fb8 400 boost::asio::detail::reactive_socket_recv_op<>::do_complete() (raylet) @ 0x56080d557f9b 128 boost::asio::detail::scheduler::do_run_one() (raylet) @ 0x56080d55a529 288 boost::asio::detail::scheduler::run() (raylet) @ 0x56080d55aa42 96 boost::asio::io_context::run() (raylet) @ 0x56080cd50b20 1424 plasma::PlasmaStoreRunner::Start() (raylet) @ 0x56080ccc4b05 208 std::thread::_State_impl<>::_M_run() (raylet) @ 0x56080d6bafb0 258531312 execute_native_thread_routine (raylet) @ ... and at least 3 more frames (raylet) {"asctime":"2025-02-04 19:09:20,137","levelname":"E","message":"*** SIGSEGV received at time=1738696160 on cpu 214 ***","component":"raylet","filename":"logging.cc","lineno":447} (raylet) {"asctime":"2025-02-04 19:09:20,137","levelname":"E","message":" @ ... and at least 3 more frames","component":"raylet","filename":"logging.cc","lineno":447} (raylet) {"asctime":"2025-02-04 19:09:20,137","levelname":"E","message":"PC: @ 0x56080cd7f1ae (unknown) plasma::ReadReleaseRequest()","component":"raylet","filename":"logging.cc","lineno":447} (raylet) @ 0x56080cd873f8 48 std::_Function_handler<>::_M_invoke() (raylet) {"asctime":"2025-02-04 19:09:20,137","levelname":"E","message":" @ 0x56080d6bafb0 258531312 execute_native_thread_routine","component":"raylet","filename":"logging.cc","lineno":447} [repeated 16x across cluster]

…rayllm Signed-off-by: SumanthRH <[email protected]>

SumanthRH

LGTM. Thanks!!

erictang000 added 23 commits January 28, 2025 19:19

add rayllm batch path

bf56116

fix typo

fe0d518

temp make mmlu pro smaller

94e9009

update vllm version, rayllm config, add repartition

88ad60f

Add submodule repo

efc6a01

move evalworkload code to outside module

1fadec9

Update submodule to latest commit

cf6a9fd

remove [:4000] for mmlupro

ed759e7

updates to inference_and_save path

6429201

fix small issues

7d2f39e

disable n > 1 for inference and save rayllm path

91a7d5a

Merge branch 'rayllm' of https://github.com/erictang000/SkyThought in…

ea7694a

…to rayllm

inference_and_save works for n = 1 use_rayllm

f69406a

Remove submodule

461a43c

remove submodule and rename to pipeline

f47dcfa

remove unnecessary model_id

7a0ee7b

remove .gitmodules

ec1b192

rename main to pipeline

224bbab

add support for n > 1

d419fc3

remove old code

a92dfe4

fix unflatten logic for n > 1

bd27b8d

merge

972e6e5

finish merge stuff

00bfc5b

lynnliu030 self-requested a review February 3, 2025 04:20

kouroshHakha reviewed Feb 3, 2025

View reviewed changes

split small datasets

dd615ae

SumanthRH reviewed Feb 4, 2025

View reviewed changes

skythought/skythought_evals/inference_and_check.py Outdated Show resolved Hide resolved

SumanthRH reviewed Feb 4, 2025

View reviewed changes

skythought/skythought_evals/ray_configs/ray_config.yaml Show resolved Hide resolved

erictang000 added 2 commits February 4, 2025 02:46

address some comments (add response object and add separate inference…

e02d0d9

… + ray functions)

small comment use_ray

c37cf6d

erictang000 added 2 commits February 4, 2025 03:04

resolve some more comments

4b15c25

reduce workload

9ef28f9

erictang000 added 3 commits February 4, 2025 03:42

changes

50e79a0

fix ProcessPoolExecutor + response ray sigsev bug

cc7dd8b

Merge branch 'main' of https://github.com/erictang000/SkyThought into…

1b6843e

… HEAD

SumanthRH reviewed Feb 4, 2025

View reviewed changes

skythought/skythought_evals/inference_and_check.py Outdated Show resolved Hide resolved

SumanthRH reviewed Feb 4, 2025

View reviewed changes

erictang000 and others added 2 commits February 4, 2025 13:12

add comment

2045247

Merge branch 'main' of https://github.com/NovaSky-AI/SkyThought into …

6615a78

…rayllm Signed-off-by: SumanthRH <[email protected]>

SumanthRH approved these changes Feb 6, 2025

View reviewed changes

SumanthRH merged commit 806f09c into NovaSky-AI:main Feb 6, 2025
2 checks passed

SumanthRH mentioned this pull request Feb 6, 2025

Fix reproducibility issues, save metrics to disk and cleanup scripts #67

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add support for scaling evals and inference with ray #63

[evals] Add support for scaling evals and inference with ray #63

erictang000 commented Feb 3, 2025 •

edited

Loading

kouroshHakha left a comment

erictang000 commented Feb 4, 2025

SumanthRH Feb 4, 2025

erictang000 Feb 4, 2025 •

edited

Loading

SumanthRH left a comment

[evals] Add support for scaling evals and inference with ray #63

[evals] Add support for scaling evals and inference with ray #63

Conversation

erictang000 commented Feb 3, 2025 • edited Loading

What does this PR do?

Speedups

How to Use

kouroshHakha left a comment

Choose a reason for hiding this comment

erictang000 commented Feb 4, 2025

SumanthRH Feb 4, 2025

Choose a reason for hiding this comment

erictang000 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

SumanthRH left a comment

Choose a reason for hiding this comment

erictang000 commented Feb 3, 2025 •

edited

Loading

erictang000 Feb 4, 2025 •

edited

Loading