-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support alternative metrics on accelerated TGI / TEI instances #454
Conversation
@eero-t could you double check the CI failure on xeon with |
Failure in CI CPU values file test is for the TGI service response check: https://github.com/opea-project/GenAIInfra/actions/runs/10991943008/job/30556825567?pr=451 This PR changes only HPA files and CI skips HPA testing, so the reason for the failure is some change done before this PR. What CPU values file does CPU values file overrides deployment probe timings and resources from defaults: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/tgi/values.yaml#L58 So that ChatQnA service can be scaled better without failing, as scheduler knows how much resources they need, and kubelet uses higher QoS class for them: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ Increased probe timings cannot be the cause for failure, and memory limits should be large enough, not to be an issue either. CPU limits are quite tight though (8 CPUs), to allow running multiple TGI instances on same node, e.g. with NRI policies. Reason for fail CI test checks TGI log output, but appears to be doing it too soon. CPU limits can can explain slowdown from default (no resources specified), when compared to situation where only single TGI instance is running / node has plenty of resources free. But CPU values file passed CI earlier, so something else has changed since it was merged; either in CI setup, or the TGI setup (other than values specified in CPU file). Potential workarounds Things that can be done (in some other PR):
|
Another possibility is the latest image updated from docker hub, which might cause different behaviors. |
Reviewed tests that were run for all PRs that were merged since => this and #451 are first PRs for which CI tested Additionally, when looking at the passing Helm E2E tests in that merged #386 PR, e.g. for "xeon, values, common": https://github.com/opea-project/GenAIInfra/actions/runs/10800184120/job/29957611425?pr=386 There are lot of "Couldn't connect to server" errors (e.g. for data-prep), but the test passed despite failures in functionality not touched by that PR. Then when looking at same passing test in later PRs, tests do not log any more what they are doing: https://github.com/opea-project/GenAIInfra/actions/runs/10940026240/job/30371535827?pr=444 I.e. it seems that CI tests cannot be trusted to guarantee any of the functionality working, or catching when that functionality actually broke. => Somebody (not me) needs to bisect repo content in the CI environment to see when things actually broke. |
Additionally, E2E tests pass the tests even if log file content was wrong, if the log file had wrong name, see: |
This is a workaround in the testpod to avoid some opea services' probe health check status doesn't got populated in the K8S service availability quickly enough. Sleeping for a short period of time between 'helm install --wait' and 'helm test' could be another option.
What does the test do is in each helm chart's
|
ok, I root caused this issue. GenAIComps PR 608 plus cpu-values significantly increases the test run time, which trigger the default 5m timeout of |
DocSum defaults to same model as ChatQaA, and default model used by CodeGen + CodeTrans is also 7b one, so tgi.accelDevice impact is assumed to be close enough. Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
How much (The values I've used may be too small. At that time I had only couple of Xeon nodes for scaling testing, and smaller values allowed stuffing more instances to a single Xeon node. However, I used also NRI to isolate pods better, which is not the default k8s setup...) |
Tests should be predictable and fail on errors, not paper over them.
Script should wait until pod is ready to accept input before trying to use it, or fail test if wait timeouts. For example, this fails unless there are matching TGI pods, and all of them are in
That's not really good enough. One would need to dig out all the relevant test template files from the given PR branch, go through all relevant values files and deduct results for all variables & conditionals in the template. That's way too much work and error prone. Tests should log what they're actually testing, otherwise they're not worth much. Without logging it's nightmare to verify that they correctly test everything that's relevant (+ preferably skip completely irrelevant things), and to debug failures. |
the current
That can be done in another PR. |
Oh, that waits also for Ready state. Yes, that would work:
EDIT: compared to Helm [1] as service might fail due to another service it uses not being ready yet, and those dependencies, or their names (potentially) changing based on which Helm chart is being tested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me we shouldn't use "helm install -f cpu-values.yaml" for test since the cpu-values.yaml is for HPA usage.
Maybe we can consider rename it to cpu-hpa-values.yaml or something.
No. @yongfengdu
In addition to pods not getting scheduled properly without appropriate resource requests, they also go to lowest QoS class without resources them, i.e. get throttled & evicted first from the nodes. PS. |
Description
Changes:
accelDevice
Helm variables also in other charts using accelerated TGI / TEIQueue size metric is a bit better for scaling TGI Gaudi instances than the one used for Xeon scaling, and queries for it are simpler. HPA rule updates act also as examples on use of different HPA algorithm types (
Value
vs.AverageValue
).Issues
n/a
.Type of change
Dependencies
n/a
.Tests
Did manual testing of HPA scaling of 1-4 TGI Gaudi instances (4 = maxReplicas value), using different TGI-provided metrics, and batch sizes.
PS. none of the TGI / TEI metrics seem that good for scaling, their values mostly indicate just whether given component is stressed, not how slow the resulting (e.g. ChatQnA) queries are. Meaning that HPA scaling factor causes scale up when they get loaded, and it drops extra instances when load goes almost completely away, but that scaling can be quite far from a really optimal one.
Something like opea-project/GenAIExamples#391 (and fine-tuning the HPA values for given model, data type, OPEA version etc) would be needed for more optimal scaling.