-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO-NOT-MERGE: discussion of common helm chart's probe/resources #431
Conversation
for more information, see https://pre-commit.ci
|
||
- Common components' value files include different probe timing sections for CPU and for accelerators | ||
|
||
- Their deployment templates select one based on .Values.accelDevice value (empty for CPU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll come up with a PR which selects better probe timings (and custom metrics) based on *.accelDevice
values (hopefully early next week).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need to set different values for different device, especially for CPU some models startup are really slow.
What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem if we don't introduce the extra *.accelDevice, and use current variables .Values.livenessProbe/.Values.readinessProbe/.Values.startupProbe in each chart?
Intent of the probes
The intent of startup and liveness probes is to restart deadlocked (or otherwise frozen) pods, on assumption that restart gets them working again. I.e. startup and liveness probes make sense if there are deadlock/freezing problems with the service, and startup actually gets rid of the problem (eventually scheduler backs off from re-starting the failing instance, which at least gets rid of its resource usage).
Readiness probes are intended to avoid routing traffic to services that are temporarily down.
Issues with too short timings
Issues that I have seen with current OPEA probe values, when multiple instances get scheduled on same Xeon:
- Startup: Scaled up instances never serve any queries, they get restarted because their startup does not finish in time. Meaning that they just consume CPU on re-startups and slow down the already working inference service instances instead of doing useful work
- Liveness: Stressed services that are working fine, just a bit slow, get restarted, which similarly to startup probes, just makes the situation worse
- Readiness: Probe failures keep all instances mostly in non-Ready state i.e. Service object does not route any queries to them, instead they get buffered to single overworked instance, which increases the latencies
Situation was noticeably improved by increasing the probe timings.
I haven't seem any deadlocks, so at least in my setup both startup & liveness probes are just harmful. Readiness probes can be useful though, if they're fine-tuned to balance traffic from overworked (non-Ready) instances to ones with free capacity (in Ready state).
Issues with too long timings
If restart can fix the issue, and timings are too long, fixing the issue is unnecessarily delayed. But if no traffic is routed to Pod due to its non-Ready state, that's just potential performance decrease, not a functional issue.
However, if pod keeps in Ready state despite Liveness probe failing, it means more queries being routed to non-working pod and being lost when it is eventually restarted => Liveness test should not be something that can fail although readiness test succeeds, and as a subset of that requirement, Liveness probe timeouts should not be shorter than Readiness ones ( they could be longer though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when multiple instances get scheduled on same Xeon:
Partly this is a problem of current OPEA Helm charts missing suitable per-model resource requests / affinity / scheduler topology / NRI policies. If each model + service arg + device combo would specify well enough their resource needs, and they were isolated enough from each other when running on CPU, there would be less need for increasing the default probe timings for CPU so much.
|
||
- GMC device variant manifests are generated for all relevant components, not just TGI | ||
|
||
(I don't think probe timings would need to be fine-tuned based on which model is used.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? A large model may need more time to be ready, compared to a small model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there naturally is difference between them, I was thinking that for models fitting to a single Gaudi, current default probe values could be enough.
If warmup/startup is extraordinarily slow with some larger model, probe timings can always be overridden in the <component>/<device>/<model>.yaml
file, but that can come later, after the separate per-model resource usage files are there.
PS. CPU side is more problematic, because there are so many additional variables affecting the perf (starting from underlying node HW differences, isolation and scheduling policies in effect), but there's not much that can be done. Having separate files for every different config would not practical (too many files), but there can be some additional documentation on what kind of (manual) fine-tuning user may need to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.
However, for a dedicated AIExample, we'll have default model/tgi, the default probe timing can be set accordingly and remind in the docs for the possibility of tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startup time varies a lot, for different CPU models combined with different AI models/Size. And the tgi version matters too per my limited observation.
AFAIK there's a magnitude of performance difference between accelerator devices (real ones, not e.g. iGPUs and small NPUs) and CPUs for inferencing.
@yongfengdu Are you saying that if you use largest model (fitting to single device) listed in OPEA docs on Gaudi TGI, and smallest model listed in OPEA docs on on Xeon TGI, latter starts faster because there's such a large difference between those models?
|
||
- Their deployment templates select one based on .Values.accelDevice value (empty for CPU) | ||
|
||
- All <device>-values.yaml files set appropriate <subchart>.accelDevice value (not just ChatQnA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can use .accelDevice to control all the k8s resource spec variant, and top level subchart only have to set that .accelDevice ?
-f common/teirerank/gaudi/bge-reranker-base.yaml | ||
-f common/tei/cpu/bge-base.yaml | ||
-f common/data-prep/gpu/values.yaml | ||
(These would provide values with subchart prefix/heading so they can be used from top-level charts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From top level chart's perspective, a common tgi chart can be used twice, i.e. one for chat completion, another for guardrails, so providing subchart prefix/heading in common/tgi/gaudi/neural-chat-7b.yaml may not be enough, we may need to ask end user to use '--set-file subchart_prefix= common/tgi/gaudi/neural-chat-7b.yaml' in the top level chart deployment.
One more question @yongfengdu , will this affect repo chart if the end users try to install helm chart not from source code, but from chart repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can just be two files that are identical except for the subchart name (separate files in case somebody wants to use different models for them:
- common/tgi/gaudi/neural-chat-7b.yaml
- common/tgi/gaudi/neural-chat-7b-guardrails.yaml
As they're otherwise identical, generating/updating the extra files should be automated:
for dev in cpu gaudi; do
models=$(ls $dev/ | grep -v guardrails | sed s/.yaml//)
for model in $models; do
sed s/tgi:/tgi-guardrails:/ $dev/$model.yaml > $dev/$model-guardrails.yaml
done
done
Btw. Are there other components than TGI, that may be used under multiple names within the same chart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we actually reuse the the llm-uservice chart for both codegen/docsum with different images set in the top lelve chart. Also we're thinking of reusing llm-uservice chart for all the different variants for llm test-generation/. But since those services are quite simple, we can make them follow the accelDevice way to have cmmon chart's value files contains all the variants' config, and top level chart only have to specify the variant name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.
GMC needs separate manifest files to be generated for each of these subservices, so that it can apply correct resources requests when their models differ and/or are changed.
=> The script generating the GMC manifests files could compose their names based on the subchart name used within the Helm yaml override file, and what model is specified there. But for consistency it may better if every file name (at least for these components) includes the subchart name used within it (not just for some of the files). For example:
- common/tgi/gaudi/tgi.neural-chat-7b.yaml
- common/tgi/gaudi/tgi-guardrails.neural-chat-7b.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so only two components (tgi & llm-uservice) are currently used with multiple aliases.
That's just 2 examples. Multiple services will be llm-uservice alike, but most of them will not depends on any specific model, they're intermediate services which talks to multiple different backends which is related to specific models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using helm repo usage should be tested, as I know, helm package will package all the files in the directory, but with tar.gz format, not sure if we can specify files in a tar ball.
Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users. I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).
Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.
BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, the proposed values.yaml structure looks too complex to me(As a developer), not mention to end users.
Currently user needs to:
- Change models to ones he prefers
- Run the current non-working OPEA Helm chart
- Profile resource usage, and test suitable probe timings
- Update resource usage + probe timing values to Helm charts to get working OPEA install
With this proposal, OPEA project would do those steps for the user, he just needs to select files matching the models he's interested on. How that would be harder for the user?
(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)
I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml). Keep the default values.yaml as simple as it could be, for users to start using quickly with suboptimal settings.
Top level chart can naturally have some default <device>-values.yaml
duplicating device/model files content for its subcharts. However, IMHO they should be autogenerated from component files, so that there's only single place that needs to be updated whenever given (model version and corresponding service args, resource values) etc are changed.
BTW, besides tgi/llm-uservice aliases, the tei/teirerank actually are the same with different chart name, they can be merged with different aliases.
I see that they use the same image, just with with different model, but why teirerank
one does not have Gaudi values file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to define dedicated scenario and include all OPT values in one single values.yaml file (Like Gaudi-4node-neural-7b.yaml).
@yongfengdu That won't work for GMC. Because GMC is used for changing the model, it will need also to have information about other things that need to be changed with the model (args, resource requests etc).
Mismatching or missing pod resources mean:
- Too small resource limits: pods will be CPU throttled and outright killed as they go over their memory limits
- Too large resource requests: pods won't fit to nodes, or nodes are badly utilized (good utilization is important production cluster criteria)
- No requests/limits: pod has worst QoS (other pods have precedence over it): https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Above process is not needed for models that fit into single Gaudi, because Gaudi cannot be shared, and Gaudi drivers do not have significant CPU side usage. However, OPEA Helm is supposed to support also CPU installs.)
Note that at least TEI can use much more CPU side memory, that what it uses on device side: huggingface/text-embeddings-inference#280
In next TEI release, it should be possible to affect this by limiting number of CPUs available for TEI: huggingface/text-embeddings-inference#410
Btw. Here are parameters used by NIM Helm installations: https://github.com/NVIDIA/nim-deploy/blob/main/helm/nim-llm/README.md#parameters |
Description
The summary of the proposed changes as long as the relevant motivation and context.
Issues
List the issue or RFC link this PR is working on. If there is no such link, please mark it as
n/a
.Type of change
List the type of change like below. Please delete options that are not relevant.
Dependencies
List the newly introduced 3rd party dependency if exists.
Tests
Describe the tests that you ran to verify your changes.