Dont retry on connection failure #103

danmcp · 2024-08-16T03:30:33Z

This is a continuation/replacement of #80

Resolves: #77

The approach of this separates errors into 3 buckets:

Fatal errors we retry and fail after max attempts (Ex: APIConnectionError, 404). These cases don't have much hope of fixing themselves.
Errors which we record after max attempts (Ex: RateLimitError). Failure on one try might not mean they are all going to fail.
Errors which we record after 1 occurrence (Ex: 400). Failure is request specific and retries aren't going to help.

Which allows the logic to fail fast when possible, not retry when it wouldn't help, and not catastrophically fail on isolated failures.

jaideepr97

Thanks @danmcp @booxter
I really like how the errors have been broken down by their codes - it helps to further my understanding of good error handling in general

we could probably start focusing on expanding the unit testing infrastructure on the project but that should probably be in a separate PR

src/instructlab/eval/mt_bench_common.py

src/instructlab/eval/exceptions.py

The `python` symlink may be missing on a system; or even point to py2. We should use `python3` to be sure. (It's ok to use `python` inside a virtualenv though.) Signed-off-by: Ihar Hrachyshka <[email protected]>

Before the patch, all errors from openai were handled by retrying up to API_MAX_RETRY times and returning $ERROR$ message at the last attempt. With this patch, if all attempts result in APIConnectionError, we raise a new EvalError exception. (If at least one of the previous attempts result in a different error, then we return $ERROR$ as usual.) Also, several errors are not expected to recover with a retry (400-404, 422). This patch makes them return $ERROR$ immediately without retrying. Closes: instructlab#77 Signed-off-by: Ihar Hrachyshka <[email protected]>

Before the patch, we were calculating them on every retry attempt. The function is pure, so there is no good reason to repeat the calculation. This also simplifies the function a bit. Signed-off-by: Ihar Hrachyshka <[email protected]>

Signed-off-by: Dan McPherson <[email protected]>

booxter

This is fine now. Not sure I can approve it since most of the PR is (co)authored by myself.

booxter · 2024-08-21T15:26:05Z

src/instructlab/eval/mt_bench_common.py

+            openai.PermissionDeniedError,  # 403
+            openai.NotFoundError,  # 404
+            # General catch-all
+            openai.OpenAIError,


So AFAIU all the explicit types listed above are for documentation purposes only? Because all of them are OpenAIErrors already.

Yes. I thought they were really helpful to see elaborated. Although it's a great point that we should add a comment stating that logic explicitly.

jaideepr97 approved these changes Aug 19, 2024

View reviewed changes

src/instructlab/eval/mt_bench_common.py Show resolved Hide resolved

src/instructlab/eval/exceptions.py Outdated Show resolved Hide resolved

danmcp force-pushed the dont-retry-on-connection-failure branch from a356211 to 1561eee Compare August 19, 2024 16:59

nathan-weinberg requested a review from booxter August 19, 2024 17:34

booxter and others added 4 commits August 19, 2024 17:57

Use python3 in README instructions

ad9ad83

The `python` symlink may be missing on a system; or even point to py2. We should use `python3` to be sure. (It's ok to use `python` inside a virtualenv though.) Signed-off-by: Ihar Hrachyshka <[email protected]>

Calculate messages for openai completion once

abf0f41

Before the patch, we were calculating them on every retry attempt. The function is pure, so there is no good reason to repeat the calculation. This also simplifies the function a bit. Signed-off-by: Ihar Hrachyshka <[email protected]>

Add to fatal exceptions and handle OpenAIError

7fbd87e

Signed-off-by: Dan McPherson <[email protected]>

danmcp force-pushed the dont-retry-on-connection-failure branch from 1561eee to 7fbd87e Compare August 19, 2024 22:00

nathan-weinberg approved these changes Aug 20, 2024

View reviewed changes

booxter reviewed Aug 21, 2024

View reviewed changes

nathan-weinberg merged commit ba6fe0e into instructlab:main Aug 21, 2024
9 checks passed

nathan-weinberg mentioned this pull request Aug 22, 2024

Don't retry on connection error #80

Closed

danmcp mentioned this pull request Sep 9, 2024

tests: Fail when model is not served #85

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dont retry on connection failure #103

Dont retry on connection failure #103

danmcp commented Aug 16, 2024

jaideepr97 left a comment •

edited

Loading

booxter left a comment

booxter Aug 21, 2024

danmcp Aug 21, 2024

Dont retry on connection failure #103

Dont retry on connection failure #103

Conversation

danmcp commented Aug 16, 2024

jaideepr97 left a comment • edited Loading

Choose a reason for hiding this comment

booxter left a comment

Choose a reason for hiding this comment

booxter Aug 21, 2024

Choose a reason for hiding this comment

danmcp Aug 21, 2024

Choose a reason for hiding this comment

jaideepr97 left a comment •

edited

Loading