[Inference] Add debug mode #257

Deegue · 2024-06-20T07:03:50Z

Debug mode with logs and some improvements & questions.

Deegue · 2024-06-20T07:08:53Z

llm_on_ray/inference/predictors/hpu_predictor.py

@@ -219,7 +225,9 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput:

    def streaming_generate(self, prompt, streamer, **config):
        self._process_config(config)
+        # Q1: Why it is handled here when using both deepspeed and hpu?
        if self.infer_conf.deepspeed:


Here in hpu_predictor,py, it is a little bit confused since we have another predictor called deepspeed_predicotr.
Two predictors are for hpu and cpu, maybe we can change the name of deepspeed_predicotr, like cpu or base predictor.

there is a TODO comment to consolidate these two predictors.

Deegue · 2024-06-20T07:13:57Z

llm_on_ray/inference/predictors/hpu_predictor.py

@@ -196,6 +200,8 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput:

        self._process_config(config)

+        # TODO: Maybe we should get realtime load info of all cards, set a heathy usage ratio and pick the usable cards for serving.
+        #       So that some errors like OOM can be prevented, and the server will be more robust.
        if self.infer_conf.deepspeed:
            return ray.get(
                [worker.generate.remote(prompt, **config) for worker in self.deepspeed_workers]


Instead of using the fixed work, maybe we should spread the load to all cards when deepspeed is enabled.

Currently tensor parallelism is used to process single request, 1 process per card is the industrial best practice given the load is balanced. You might think each request send to different card, that is not the case.

Signed-off-by: Yizhong Zhang <[email protected]>

xwu99 · 2024-07-01T02:22:16Z

@KepingYan could you help to review this and discuss with @Deegue

KepingYan · 2024-07-02T02:24:19Z

llm_on_ray/inference/predictors/hpu_predictor.py

+        # optimize transformers for gaudi
+        from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi
+
+        adapt_transformers_to_gaudi()
+


Why move this function here?

Move this function out of this if:

llm-on-ray/llm_on_ray/inference/predictors/hpu_predictor.py

Line 85 in 0b44ac4

if infer_conf.deepspeed:

Both with deepspeed or not will execute this function.

As I understand from the PR title, this PR is to add debug mode, why touch other code? Could you submit a separate PR to address other issues.

KepingYan · 2024-07-02T03:01:45Z

llm_on_ray/inference/predictors/hpu_predictor.py

-        # optimize transformers for gaudi
-        from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi
-
-        adapt_transformers_to_gaudi()


If this function is not executed in every worker, will it work as expected?

Same as above, this function will be executed earlier.

KepingYan · 2024-07-02T03:30:13Z

llm_on_ray/inference/predictors/hpu_predictor.py

+            # Q2: Why always use the first worker?
            return ray.get(self.deepspeed_workers[0].get_streamer.remote())


Hi @xwu99 , Could you please help explain this question, I think it is the same idea as in deepspeed_predictor.py.

This is a distributed inference which involving a group of worker processes, worker 0 is assigned as rank 0 (according to the standard MPI model) which is the main rank to return the result, other ranks only engage in calculation, not returning the result. In fact, all ranks holding the same result in this case. You can consider rank 0 is head process of the distributed worker group that is usually used to return the result.

KepingYan · 2024-07-02T03:30:40Z

llm_on_ray/inference/predictors/hpu_predictor.py

+            # Q2: Why always use the first worker?
            self.deepspeed_workers[0].streaming_generate.remote(prompt, streamer, **config)


@xwu99 Same here.

llm_on_ray/inference/predictor_deployment.py

examples/inference/api_server_openai/query_openai_sdk.py

llm_on_ray/inference/predictor_deployment.py

llm_on_ray/inference/serve.py

llm_on_ray/inference/predictors/hpu_predictor.py

Deegue added 2 commits June 19, 2024 07:40

add log & init

1295df5

init

651bb4b

Deegue commented Jun 20, 2024

View reviewed changes

Merge branch 'main' into add_debug_mode

3e868cd

Signed-off-by: Yizhong Zhang <[email protected]>

xwu99 requested a review from KepingYan July 1, 2024 02:21

KepingYan reviewed Jul 2, 2024

View reviewed changes

xwu99 reviewed Jul 2, 2024

View reviewed changes

llm_on_ray/inference/predictor_deployment.py Outdated Show resolved Hide resolved

fix

7f4a017

xwu99 requested a review from yutianchen666 July 5, 2024 01:31

xwu99 reviewed Jul 5, 2024

View reviewed changes

examples/inference/api_server_openai/query_openai_sdk.py Outdated Show resolved Hide resolved

xwu99 reviewed Jul 5, 2024

View reviewed changes

llm_on_ray/inference/predictor_deployment.py Outdated Show resolved Hide resolved

xwu99 reviewed Jul 5, 2024

View reviewed changes

llm_on_ray/inference/predictor_deployment.py Outdated Show resolved Hide resolved

xwu99 reviewed Jul 5, 2024

View reviewed changes

llm_on_ray/inference/predictor_deployment.py Outdated Show resolved Hide resolved

xwu99 reviewed Jul 5, 2024

View reviewed changes

llm_on_ray/inference/serve.py Outdated Show resolved Hide resolved

xwu99 reviewed Jul 5, 2024

View reviewed changes

llm_on_ray/inference/predictors/hpu_predictor.py Outdated Show resolved Hide resolved

Deegue added 2 commits July 8, 2024 01:28

nit

21c9c7e

remove

2358e3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Add debug mode #257

[Inference] Add debug mode #257

Deegue commented Jun 20, 2024

Deegue Jun 20, 2024

xwu99 Jul 5, 2024

Deegue Jun 20, 2024

xwu99 Jul 2, 2024 •

edited

Loading

xwu99 commented Jul 1, 2024

KepingYan Jul 2, 2024

Deegue Jul 3, 2024

xwu99 Jul 5, 2024

KepingYan Jul 2, 2024

Deegue Jul 3, 2024

KepingYan Jul 2, 2024

xwu99 Jul 2, 2024 •

edited

Loading

KepingYan Jul 2, 2024

		# Q2: Why always use the first worker?
		return ray.get(self.deepspeed_workers[0].get_streamer.remote())

		# Q2: Why always use the first worker?
		self.deepspeed_workers[0].streaming_generate.remote(prompt, streamer, **config)

[Inference] Add debug mode #257

Are you sure you want to change the base?

[Inference] Add debug mode #257

Conversation

Deegue commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwu99 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

xwu99 commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwu99 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwu99 Jul 2, 2024 •

edited

Loading

xwu99 Jul 2, 2024 •

edited

Loading