Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dont retry on connection failure #103

Merged

Conversation

danmcp
Copy link
Member

@danmcp danmcp commented Aug 16, 2024

This is a continuation/replacement of #80

Resolves: #77

The approach of this separates errors into 3 buckets:

Fatal errors we retry and fail after max attempts (Ex: APIConnectionError, 404). These cases don't have much hope of fixing themselves.
Errors which we record after max attempts (Ex: RateLimitError). Failure on one try might not mean they are all going to fail.
Errors which we record after 1 occurrence (Ex: 400). Failure is request specific and retries aren't going to help.

Which allows the logic to fail fast when possible, not retry when it wouldn't help, and not catastrophically fail on isolated failures.

Copy link
Member

@jaideepr97 jaideepr97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danmcp @booxter
I really like how the errors have been broken down by their codes - it helps to further my understanding of good error handling in general

we could probably start focusing on expanding the unit testing infrastructure on the project but that should probably be in a separate PR

src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved
src/instructlab/eval/exceptions.py Outdated Show resolved Hide resolved
@danmcp danmcp force-pushed the dont-retry-on-connection-failure branch from a356211 to 1561eee Compare August 19, 2024 16:59
booxter and others added 4 commits August 19, 2024 17:57
The `python` symlink may be missing on a system; or even point to py2.
We should use `python3` to be sure. (It's ok to use `python` inside a
virtualenv though.)

Signed-off-by: Ihar Hrachyshka <[email protected]>
Before the patch, all errors from openai were handled by retrying up to
API_MAX_RETRY times and returning $ERROR$ message at the last attempt.

With this patch, if all attempts result in APIConnectionError, we raise
a new EvalError exception. (If at least one of the previous attempts
result in a different error, then we return $ERROR$ as usual.)

Also, several errors are not expected to recover with a retry (400-404,
422). This patch makes them return $ERROR$ immediately without retrying.

Closes: instructlab#77

Signed-off-by: Ihar Hrachyshka <[email protected]>
Before the patch, we were calculating them on every retry attempt. The
function is pure, so there is no good reason to repeat the calculation.

This also simplifies the function a bit.

Signed-off-by: Ihar Hrachyshka <[email protected]>
@danmcp danmcp force-pushed the dont-retry-on-connection-failure branch from 1561eee to 7fbd87e Compare August 19, 2024 22:00
Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine now. Not sure I can approve it since most of the PR is (co)authored by myself.

openai.PermissionDeniedError, # 403
openai.NotFoundError, # 404
# General catch-all
openai.OpenAIError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So AFAIU all the explicit types listed above are for documentation purposes only? Because all of them are OpenAIErrors already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought they were really helpful to see elaborated. Although it's a great point that we should add a comment stating that logic explicitly.

@nathan-weinberg nathan-weinberg merged commit ba6fe0e into instructlab:main Aug 21, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_branch_gen_answers does not fail when no model is being served
4 participants