-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference] Add debug mode #257
base: main
Are you sure you want to change the base?
Conversation
@@ -219,7 +225,9 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput: | |||
|
|||
def streaming_generate(self, prompt, streamer, **config): | |||
self._process_config(config) | |||
# Q1: Why it is handled here when using both deepspeed and hpu? | |||
if self.infer_conf.deepspeed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here in hpu_predictor,py, it is a little bit confused since we have another predictor called deepspeed_predicotr.
Two predictors are for hpu and cpu, maybe we can change the name of deepspeed_predicotr, like cpu or base predictor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a TODO comment to consolidate these two predictors.
@@ -196,6 +200,8 @@ def generate(self, input: GenerateInput, **config) -> GenerateOutput: | |||
|
|||
self._process_config(config) | |||
|
|||
# TODO: Maybe we should get realtime load info of all cards, set a heathy usage ratio and pick the usable cards for serving. | |||
# So that some errors like OOM can be prevented, and the server will be more robust. | |||
if self.infer_conf.deepspeed: | |||
return ray.get( | |||
[worker.generate.remote(prompt, **config) for worker in self.deepspeed_workers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the fixed work, maybe we should spread the load to all cards when deepspeed is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently tensor parallelism is used to process single request, 1 process per card is the industrial best practice given the load is balanced. You might think each request send to different card, that is not the case.
Signed-off-by: Yizhong Zhang <[email protected]>
@KepingYan could you help to review this and discuss with @Deegue |
# optimize transformers for gaudi | ||
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi | ||
|
||
adapt_transformers_to_gaudi() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move this function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this function out of this if:
if infer_conf.deepspeed: |
Both with deepspeed or not will execute this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand from the PR title, this PR is to add debug mode, why touch other code? Could you submit a separate PR to address other issues.
# optimize transformers for gaudi | ||
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi | ||
|
||
adapt_transformers_to_gaudi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is not executed in every worker, will it work as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, this function will be executed earlier.
# Q2: Why always use the first worker? | ||
return ray.get(self.deepspeed_workers[0].get_streamer.remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xwu99 , Could you please help explain this question, I think it is the same idea as in deepspeed_predictor.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a distributed inference which involving a group of worker processes, worker 0 is assigned as rank 0 (according to the standard MPI model) which is the main rank to return the result, other ranks only engage in calculation, not returning the result. In fact, all ranks holding the same result in this case. You can consider rank 0 is head process of the distributed worker group that is usually used to return the result.
# Q2: Why always use the first worker? | ||
self.deepspeed_workers[0].streaming_generate.remote(prompt, streamer, **config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xwu99 Same here.
Debug mode with logs and some improvements & questions.