Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatQnA: accelerate also teirerank with Gaudi #475

Merged
merged 4 commits into from
Oct 25, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Oct 15, 2024

Description

Accelerate also teirerank with Gaudi, not just tei.

When reranking is used, it does not make sense (performance-wise) to accelerate just tei, as reranking is a larger bottleneck.

Issues

#486

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a.

Tests

Manually checked the ChatQnA throughput.

@eero-t eero-t marked this pull request as draft October 15, 2024 14:10
@eero-t
Copy link
Contributor Author

eero-t commented Oct 15, 2024

Marking as draft because I'm not sure in which other places this should be added.

(I can add common/teirerank/gaudi-values.yaml, but IMHO guardrails-gaudi-values.yaml files should be something added on top of gaudi-values.yaml, not independent.)

And because I'm not sure what the recent changes in there other GenAI repos imply on reranking use...

@eero-t
Copy link
Contributor Author

eero-t commented Oct 15, 2024

Looking at the CI fails, CI seems to be currently in rather broken state:

  • pre-commit.ci fails due to CI check for Python doc formatting not being configured properly
  • Change to gaudi-values.yaml cannot be cause for the Xeon failures
  • PR does not touch guardrail files yet either

@yongfengdu
Copy link
Collaborator

I've fixed the pre-commit failure.
Guardrail failure should be caused by other changes(Docker or Python), I'm debugging it but maybe we can disable it first to make the CI work.
Please hold a while until this one merged.#473
Besides the guardrail fail, there is another relative path check failed, I'm sure it's not due to the code itself, but https://github.com/yongfengdu/GenAIInfra/blob/ci-fix/.github/workflows/pr-path-detection.yml

@yongfengdu
Copy link
Collaborator

The Xeon failure is caused by opea-project/GenAIExamples#891, which removed the use of microservice layers for LLM and Embedding. #474 is the follow up change for helm-charts.

For the guardrails-gaudi-values.yaml, unfortunately there is no way to include gaudi-values.yaml in helm chart, so it's ok to do duplicate changes there, or just keep it as is(Guardrail case still use CPU for reranking).

You'll have to rebase with latest change to continue.

@eero-t
Copy link
Contributor Author

eero-t commented Oct 16, 2024

For the guardrails-gaudi-values.yaml, unfortunately there is no way to include gaudi-values.yaml in helm chart, so it's ok to do duplicate changes there, or just keep it as is(Guardrail case still use CPU for reranking).

I added gaudi-values.yaml to common/teirerank/, but did not touch guardrails, as I haven't done any performance testing for it.

You'll have to rebase with latest change to continue.

Thanks, done!

@eero-t eero-t marked this pull request as ready for review October 16, 2024 12:46
@eero-t
Copy link
Contributor Author

eero-t commented Oct 16, 2024

I dropped the PR draft status, but I haven't tested this yet with the ChatQnA "nowrapper" changes that were merged after I filed this. I would expect rerank perf to be even more important after its wrapper service is not providing extra buffering / slowdown though...

@eero-t
Copy link
Contributor Author

eero-t commented Oct 16, 2024

As Gaudi rerank worked fine for me, CI failure for it could be result of the later nowrapper changes:

2024-10-16T13:00:32.007226Z  INFO text_embeddings_backend_python::management: backends/python/src/management.rs:136: Python backend process terminated
Error: Error when doing warmup

Caused by:
    Could not start backend: max_warmup_length (1024) exceeds model's max_input_length (512), you can modify this value adding `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_warmup_length>` to your Docker run command

I think it's another bug in CI though, of only specifying one of the 2 related TEI options.

@eero-t
Copy link
Contributor Author

eero-t commented Oct 18, 2024

    Could not start backend: max_warmup_length (1024) exceeds model's max_input_length (512), you can modify this value adding `-e MAX_WARMUP_SEQUENCE_LENGTH=<new_warmup_length>` to your Docker run command

I think it's another bug in CI though, of only specifying one of the 2 related TEI options.

This is not a bug in CI, but with git HEAD, max warmup (matching specified max input length) is given only for tei, not teirerank: #483

@eero-t
Copy link
Contributor Author

eero-t commented Oct 18, 2024

Rebased teirerank config fix as first one, so that every commit works.

This PR does not change anything related to guardrails, but still that test fails, due to another CI bug:

#kubectl logs chatqna20241018105227-testpod -n chatqna-20241018105227
curl: (22) The requested URL returned error: 500
Internal Server Errorcurl failed with code 22

Fail is due to ChatQnA timeouting on guardrails:

-----------Pod: chatqna20241018105227-6897bff66b-cm9sv---------
#kubectl describe pod chatqna20241018105227-6897bff66b-cm9sv -n chatqna-20241018105227
...
     Image:           100.83.111.229:5000/opea/chatqna-guardrails:latest
...
INFO:     10.244.164.179:36050 - "POST /v1/chatqna HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/aiohttp/connector.py", line 1091, in _wrap_create_connection
    sock = await aiohappyeyeballs.start_connection(
...
 TimeoutError: [Errno 110] Connect call failed ('10.103.241.79', 80)
...
 aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host chatqna20241018105227-guardrails-usvc:80 ssl:default [Connect call failed ('10.103.241.79', 80)]

Although that service gets (eventually) to Ready state and its log shows now errors:

NAME                                                     READY   STATUS      RESTARTS       AGE     IP               NODE                NOMINATED NODE   READINESS GATES
...
chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z   1/1     Running     0              3m22s   10.244.164.175   aise-cluster-01-3   <none>           <none>
...
-----------Pod: chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z---------
#kubectl describe pod chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z -n chatqna-20241018105227
...
    Liveness:        http-get http://:guardrails-usvc/v1/health_check delay=5s timeout=1s period=5s #success=1 #failure=24
    Readiness:       http-get http://:guardrails-usvc/v1/health_check delay=5s timeout=1s period=5s #success=1 #failure=3
    Startup:         http-get http://:guardrails-usvc/v1/health_check delay=5s timeout=1s period=5s #success=1 #failure=120
...
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Created    3m11s  kubelet            Created container chatqna20241018105227
  Normal   Started    3m11s  kubelet            Started container chatqna20241018105227
  Warning  Unhealthy  3m5s   kubelet            Startup probe failed: Get "http://10.244.164.175:9090/v1/health_check": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
-----------Pod: chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z---------
#kubectl describe pod chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z -n chatqna-20241018105227
...
#kubectl logs chatqna20241018105227-guardrails-usvc-7658584b7c-4d25z -n chatqna-20241018105227
...
INFO:     100.83.111.245:34224 - "GET /v1/health_check HTTP/1.1" 200 OK
INFO:     10.244.164.137:50580 - "POST /v1/guardrails HTTP/1.1" 200 OK
INFO:     100.83.111.245:52736 - "GET /v1/health_check HTTP/1.1" 200 OK

=> CI runs the query before verifying that all necessary backend pods (at least TGI & TEI) have reached Ready state?

@eero-t
Copy link
Contributor Author

eero-t commented Oct 18, 2024

=> CI runs the query before verifying that all necessary backend pods (at least TGI & TEI) have reached Ready state?

@lianhao Has this not been fixed in CI yet: #454 (comment) ?

@yongfengdu
Copy link
Collaborator

The CI failure is caused by this commit opea-project/GenAIExamples#977
which changed the default port of guardrails, but we didn't specify the GUARDRAIL_SERVICE_PORT in our helm chart.

@lianhao
Copy link
Collaborator

lianhao commented Oct 21, 2024

=> CI runs the query before verifying that all necessary backend pods (at least TGI & TEI) have reached Ready state?

@lianhao Has this not been fixed in CI yet: #454 (comment) ?

This failures is caused by a recent PR #977 in GenAIExample

@lianhao
Copy link
Collaborator

lianhao commented Oct 21, 2024

let's wait for PR #489 to land-in first

@eero-t
Copy link
Contributor Author

eero-t commented Oct 21, 2024

The CI failure is caused by this commit opea-project/GenAIExamples#977 which changed the default port of guardrails, but we didn't specify the GUARDRAIL_SERVICE_PORT in our helm chart.

Any idea why that regression was not caught by CI?

@lianhao
Copy link
Collaborator

lianhao commented Oct 22, 2024

The CI failure is caused by this commit opea-project/GenAIExamples#977 which changed the default port of guardrails, but we didn't specify the GUARDRAIL_SERVICE_PORT in our helm chart.

Any idea why that regression was not caught by CI?

Because it's a change in GenAIExamples repo, not in this GenAIInfra repo.

@eero-t
Copy link
Contributor Author

eero-t commented Oct 22, 2024

Because it's a change in GenAIExamples repo, not in this GenAIInfra repo.

Hm. Such breakage seems to happen often enough that I think PRs in those repos could trigger (e.g. after merging) also GenAIInfra CI tests...

EDIT: Or if cross-repo triggers are not possible, maybe there could be automated GenAIInfra CI test runs at some specific interval (e.g. weekly or more often), to catch regressions caused by changes in external dependencies. It would be nice if such tests could file the detected issues as bugs ("[Bug] CI external dependency test 2024 WW43 - failed").

@lianhao
Copy link
Collaborator

lianhao commented Oct 23, 2024

Because it's a change in GenAIExamples repo, not in this GenAIInfra repo.

Hm. Such breakage seems to happen often enough that I think PRs in those repos could trigger (e.g. after merging) also GenAIInfra CI tests...

EDIT: Or if cross-repo triggers are not possible, maybe there could be automated GenAIInfra CI test runs at some specific interval (e.g. weekly or more often), to catch regressions caused by changes in external dependencies. It would be nice if such tests could file the detected issues as bugs ("[Bug] CI external dependency test 2024 WW43 - failed").

Cross-repo trigger seems not available right now. We have an automated trigger to create issue in GenAIInfra if any GenAIExamples' docker compose files get changed(In this specific case, it's issue #484). I believe @daisy-ycguo has already enabled the CD test for GenAIInfra which we might be able to leverage for periodical CD test.

Max input length applies to both, so teirerank needs also max warmup length.

Fixes: opea-project#483

Signed-off-by: Eero Tamminen <[email protected]>
helm-charts/common/teirerank/gaudi-values.yaml Outdated Show resolved Hide resolved
helm-charts/chatqna/gaudi-values.yaml Outdated Show resolved Hide resolved
Currently it's the same image.

Signed-off-by: Eero Tamminen <[email protected]>
@lianhao lianhao merged commit 620963f into opea-project:main Oct 25, 2024
12 checks passed
@eero-t eero-t deleted the gaudi-accel branch December 5, 2024 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants