Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] Add debug mode #257

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

Deegue
Copy link
Contributor

@Deegue Deegue commented Jun 20, 2024

Debug mode with logs and some improvements & questions.

@@ -219,7 +225,9 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput:

def streaming_generate(self, prompt, streamer, **config):
self._process_config(config)
# Q1: Why it is handled here when using both deepspeed and hpu?
if self.infer_conf.deepspeed:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in hpu_predictor,py, it is a little bit confused since we have another predictor called deepspeed_predicotr.
Two predictors are for hpu and cpu, maybe we can change the name of deepspeed_predicotr, like cpu or base predictor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a TODO comment to consolidate these two predictors.

@@ -196,6 +200,8 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput:

self._process_config(config)

# TODO: Maybe we should get realtime load info of all cards, set a heathy usage ratio and pick the usable cards for serving.
# So that some errors like OOM can be prevented, and the server will be more robust.
if self.infer_conf.deepspeed:
return ray.get(
[worker.generate.remote(prompt, **config) for worker in self.deepspeed_workers]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the fixed work, maybe we should spread the load to all cards when deepspeed is enabled.

Copy link
Contributor

@xwu99 xwu99 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently tensor parallelism is used to process single request, 1 process per card is the industrial best practice given the load is balanced. You might think each request send to different card, that is not the case.

@xwu99 xwu99 requested a review from KepingYan July 1, 2024 02:21
@xwu99
Copy link
Contributor

xwu99 commented Jul 1, 2024

@KepingYan could you help to review this and discuss with @Deegue

Comment on lines +90 to +94
# optimize transformers for gaudi
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

adapt_transformers_to_gaudi()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function out of this if:

Both with deepspeed or not will execute this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from the PR title, this PR is to add debug mode, why touch other code? Could you submit a separate PR to address other issues.

Comment on lines -287 to -290
# optimize transformers for gaudi
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

adapt_transformers_to_gaudi()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is not executed in every worker, will it work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this function will be executed earlier.

Comment on lines 187 to 188
# Q2: Why always use the first worker?
return ray.get(self.deepspeed_workers[0].get_streamer.remote())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xwu99 , Could you please help explain this question, I think it is the same idea as in deepspeed_predictor.py.

Copy link
Contributor

@xwu99 xwu99 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a distributed inference which involving a group of worker processes, worker 0 is assigned as rank 0 (according to the standard MPI model) which is the main rank to return the result, other ranks only engage in calculation, not returning the result. In fact, all ranks holding the same result in this case. You can consider rank 0 is head process of the distributed worker group that is usually used to return the result.

Comment on lines 230 to 231
# Q2: Why always use the first worker?
self.deepspeed_workers[0].streaming_generate.remote(prompt, streamer, **config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xwu99 Same here.

@xwu99 xwu99 requested a review from yutianchen666 July 5, 2024 01:31
llm_on_ray/inference/serve.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants