Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python backend with multiple instances cause unexpected and non-deterministic results #7907

Open
NadavShmayo opened this issue Dec 25, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@NadavShmayo
Copy link

NadavShmayo commented Dec 25, 2024

Description
When using a Python backend with multiple model instances and running inference with many identical requests, the results are not deterministic and not even close to the expected result.

Triton Information
24.09

Are you using the Triton container or did you build it yourself?
Triton container (with additional Python libraries)

To Reproduce
Clone the following repository and follow the steps in the README.md file:
https://github.com/NadavShmayo/fairseq-triton-example

Expected behavior
I expect the outputs from the Python model to be consistently the same for a request with the same input values.
The locust script in the example repository I created prints the output for every time it differs from the expected output.

Additional Information

  • I do believe this is an issue with Triton and not with my models since the error doesn't reproduce with instance count of 1.
  • I tried to avoid using multiple instances and instead used decoupled mode with a ThreadPoolExecutor, which lead to the same problem, even when moving every object initialization to inside the thread worker, to avoid non thread-safe behavior.
  • When trying to debug with print statements in the compiled models and the Python model, I noticed that sometimes the encoder output seems to have weird values after transferred to the Python model, but the problem seems to reproduce even when this is not the case.
  • It seems that the issue is less reproduceable when using a dynamic batcher with queue delay, which leads me to believe that it might be related to race condition in some shared memory between the BLS instances.
@tanmayv25
Copy link
Contributor

tanmayv25 commented Jan 25, 2025

@NadavShmayo Thanks for sharing this concise reproducer. Looking at the instructions at a high-level it seems that you are using pytorch model. In this case, can you review this section of the document?
The determinism is not guaranteed across the run. My guess is under higher request load, some different cuda kernels are getting selected leading to difference in the results.
There are some suggestions on how to make the results more reproducible: https://pytorch.org/docs/stable/notes/randomness.html

@tanmayv25 tanmayv25 added the bug Something isn't working label Jan 25, 2025
@pskiran1 pskiran1 self-assigned this Jan 27, 2025
@pskiran1
Copy link
Member

@NadavShmayo, we could reproduce a similar behavior and when running inference with an instance count greater than 1, the following error occurred randomly and frequently. Have you experienced the same error on your end?

E0130 16:32:48.407374 2509 pb_stub.cc:721] "Failed to process the request(s) for model 'bls_0_3', message: AssertionError: 14 < 14\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/fairseq/sequence_generator.py(470): _generate\n  /usr/local/lib/python3.10/dist-packages/fairseq/sequence_generator.py(153): forward\n  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(116): decorate_context\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1562): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1553): _wrapped_call_impl\n  /my_workspace/fairseq-triton-example/./latest_models/bls/1/model.py(359): execute\n"

@NadavShmayo
Copy link
Author

@NadavShmayo, we could reproduce a similar behavior and when running inference with an instance count greater than 1, the following error occurred randomly and frequently. Have you experienced the same error on your end?

E0130 16:32:48.407374 2509 pb_stub.cc:721] "Failed to process the request(s) for model 'bls_0_3', message: AssertionError: 14 < 14\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/fairseq/sequence_generator.py(470): _generate\n  /usr/local/lib/python3.10/dist-packages/fairseq/sequence_generator.py(153): forward\n  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(116): decorate_context\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1562): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1553): _wrapped_call_impl\n  /my_workspace/fairseq-triton-example/./latest_models/bls/1/model.py(359): execute\n"

Thank you for looking into this!
Yes I have ran into this error, it is not very indicative but the underlying reason is just a never-ending translation in the Fairseq model, which is exactly the problem I opened this issue for, as the translations become gibberish or just never ending (as you encountered) when using multiple instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants