Skip to content

Commit

Permalink
retry improvements (#1430)
Browse files Browse the repository at this point in the history
* rename is_rate_limit to should_retry

* lint

* fix typo

* cleanup status codes

* improved should_retry for stainless clients

* more retry work (azureai)

* improved cloudflare retry filter

* improve mistral retry logic

* goodfire should_retry

* improved should_retry for bedrock

* improved vertex should_retry

* improve google should_retry

* don't pass max retries and timeout directly to model interfaces

* retry controller, docs, don't retry connection errors

* some renaming + exception not baseexception

* hooks for retry tracing

* retry counter

* trace retries

* display current retry count

* show number of retries for generate in running samples view

* simplify log to transcript

* simplify log capture code

* rename tracker to hooks

* http logging hooks for google

* some hooks cleanup

* narrow scope of inspect log (don't propagate)

* refine retry log

* tooltip on http requests

* docs on inspect trace http

* various review tweaks

* changelog

* propagate for test

* fix typos in changelog

---------

Co-authored-by: jjallaire-aisi <[email protected]>
  • Loading branch information
jjallaire and jjallaire-aisi authored Mar 3, 2025
1 parent c79c08d commit 588e4b9
Show file tree
Hide file tree
Showing 37 changed files with 734 additions and 576 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
## Unreleased

- New "HTTP Retries" display (replacing the "HTTP Rate Limits" display) which counts all retries and does so much more consistently and accurately across providers.
- The `ModelAPI` class now has a `should_retry()` method that replaces the deprecated `is_rate_limit()` method.
- The "Generate..." progress message in the Running Samples view now shows the number of retries for the active call to `generate()`.
- New `inspect trace http` command which will show all HTTP requests for a run.
- More consistent use of `max_retries` and `timeout` configuration options. These options now exclusively control Inspect's outer retry handler; model providers use their default behaviour for the inner request, which is typically 2-4 retries and a service-appropriate timeout.
- Logging: Inspect no longer sets the global log level nor does it allow its own messages to propagate to the global handler (eliminating the possiblity of duplicate display). This should improve compatibility with applications that have their own custom logging configured.

## v0.3.72 (03 March 2025)

- Computer: Updated tool definition to match improvements in Claude Sonnet 3.7.
Expand Down
2 changes: 1 addition & 1 deletion docs/errors-and-limits.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def intercode_ctf():
Working time is computed based on total clock time minus time spent on (a) unsuccessful model generations (e.g. rate limited requests); and (b) waiting on shared resources (e.g. Docker containers or subprocess execution).

::: {.callout-note appearance="simple"}
In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (`google`, `vertex`, `azureai`, and `goodfire`), and in these cases the `working_time` will include any internal retries that the model client performs.
In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (`vertex`, `azureai`, and `goodfire`), and in these cases the `working_time` will include any internal retries that the model client performs.
:::


Expand Down
4 changes: 2 additions & 2 deletions docs/options.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ Below are sections for the various categories of options supported by `inspect e
| `--parallel-tool-calls` | Whether to enable calling multiple functions during tool use (defaults to True) OpenAI and Groq only. |
| `--max-tool-output` | Maximum size of tool output (in bytes). Defaults to 16 \* 1024. |
| `--internal-tools` | Whether to automatically map tools to model internal implementations (e.g. 'computer' for Anthropic). |
| `--max-retries` | Maximum number of times to retry request (defaults to 5) |
| `--timeout` | Request timeout (in seconds). |
| `--max-retries` | Maximum number of times to retry generate request (defaults to unlimited) |
| `--timeout` | Generate timeout in seconds (defaults to no timeout) |

## Tasks and Solvers

Expand Down
35 changes: 30 additions & 5 deletions docs/parallelism.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,13 @@ $ inspect eval --model openai/gpt-4 --max-connections 20

The default value for max connections is 10. By increasing it we might get better performance due to higher parallelism, however we might get *worse* performance if this causes us to frequently hit rate limits (which are retried with exponential backoff). The "correct" max connections for your evaluations will vary based on your actual rate limit and the size and complexity of your evaluations.

::: {.callout-note appearance="simple"}
Note that max connections is applied per-model. This means that if you use a grader model from a provider distinct from the one you are evaluating you will get extra concurrency (as each model will enforce its own max connections).
:::

### Rate Limits

When you run an eval you'll see information reported on the current active connection usage as well as the number of HTTP rate limit errors that have been encountered (note that Inspect will automatically retry on rate limits and other errors likely to be transient):
When you run an eval you'll see information reported on the current active connection usage as well as the number of HTTP retries that have occurred (Inspect will automatically retry on rate limits and other errors likely to be transient):

![](images/rate-limit.png){fig-alt="The Inspect task results displayed in the terminal. The number of HTTP rate limit errors that have occurred (25) is printed in the bottom right of the task results."}

Expand All @@ -40,20 +44,41 @@ You should experiment with various values for max connections at different times

### Limiting Retries

By default, Inspect will continue to retry model API calls (with exponential backoff) indefinitely when a rate limit error (HTTP status 429) is returned. You can limit these retries by using the `max_retries` and `timeout` eval options. For example:
By default, Inspect will retry model API calls indefinitely (with exponential backoff) when a recoverable HTTP error occurs. The initial backoff is 3 seconds and exponentiation will result in a 25 minute wait for the 10th request (then 30 minutes for the 11th and subsequent requests). You can limit Inspect's retries using the `--max-retries` option:

``` bash
inspect eval --model openai/gpt-4 --max-retries 10
```

Note that model interfaces themselves may have internal retry behavior (for example, the `openai` and `anthropic` packages both retry twice by default).

You can put a limit on the total time for retries using the `--timeout` option:

``` bash
$ inspect eval --model openai/gpt-4 --max-retries 10 --timeout 600
inspect eval --model openai/gpt-4 --timeout 600
```

### Debugging Retries

If you want more insight into Model API connections and retries, specify `log_level=http`. For example:

``` bash
$ inspect eval --model openai/gpt-4 --log-level=http
inspect eval --model openai/gpt-4 --log-level=http
```

You can also view all of the HTTP requests for the current (or most recent) evaluation run using the `inspect trace http` command. For example:

``` bash
inspect trace http # show all http requests
inspect trace http --failed # show only failed requests
```

::: {.callout-note appearance="simple"}
Note that max connections is applied per-model. This means that if you use a grader model from a provider distinct from the one you are evaluating you will get extra concurrency (as each model will enforce its own max connections).
Note that the `inspect trace http` command is currently available only in the development version of Inspect. To install the development version from GitHub:

``` bash
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
```
:::

## Multiple Models {#sec-multiple-models}
Expand Down
25 changes: 25 additions & 0 deletions docs/tracing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,31 @@ As with the `inspect trace dump` command, you can apply a filter when listing an
inspect trace anomalies --filter model
```

## HTTP Requests

::: {.callout-note appearance="simple"}
Note that the `inspect trace http` command described below is currently available only in the development version of Inspect. To install the development version from GitHub:

``` bash
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
```
:::

You can view all of the HTTP requests for the current (or most recent) evaluation run using the `inspect trace http` command. For example:

``` bash
inspect trace http # show all http requests
inspect trace http --failed # show only failed requests
```

The `--filter` parameter also works here, for example:

```bash
inspect trace http --failed --filter bedrock
```



## Tracing API {#tracing-api}

In addition to the standard set of actions which are trace logged, you can do your own custom trace logging using the `trace_action()` and `trace_message()` APIs. Trace logging is a great way to make sure that logging context is *always captured* (since the last 10 trace logs are always available) without cluttering up the console or eval transcripts.
Expand Down
5 changes: 2 additions & 3 deletions src/inspect_ai/_cli/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
DEFAULT_EPOCHS,
DEFAULT_LOG_LEVEL_TRANSCRIPT,
DEFAULT_MAX_CONNECTIONS,
DEFAULT_MAX_RETRIES,
)
from inspect_ai._util.file import filesystem
from inspect_ai._util.samples import parse_sample_id, parse_samples_limit
Expand Down Expand Up @@ -47,9 +46,9 @@
NO_SCORE_DISPLAY = "Do not display scoring metrics in realtime."
MAX_CONNECTIONS_HELP = f"Maximum number of concurrent connections to Model API (defaults to {DEFAULT_MAX_CONNECTIONS})"
MAX_RETRIES_HELP = (
f"Maximum number of times to retry request (defaults to {DEFAULT_MAX_RETRIES})"
"Maximum number of times to retry model API requests (defaults to unlimited)"
)
TIMEOUT_HELP = "Request timeout (in seconds)."
TIMEOUT_HELP = "Model API request timeout in seconds (defaults to no timeout)"


def eval_options(func: Callable[..., Any]) -> Callable[..., click.Context]:
Expand Down
59 changes: 53 additions & 6 deletions src/inspect_ai/_cli/trace.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from inspect_ai._util.error import PrerequisiteError
from inspect_ai._util.trace import (
ActionTraceRecord,
TraceRecord,
inspect_trace_dir,
list_trace_files,
read_trace_file,
Expand Down Expand Up @@ -84,6 +85,41 @@ def dump_command(trace_file: str | None, filter: str | None) -> None:
)


@trace_command.command("http")
@click.argument("trace-file", type=str, required=False)
@click.option(
"--filter",
type=str,
help="Filter (applied to trace message field).",
)
@click.option(
"--failed",
type=bool,
is_flag=True,
default=False,
help="Show only failed HTTP requests (non-200 status)",
)
def http_command(trace_file: str | None, filter: str | None, failed: bool) -> None:
"""View all HTTP requests in the trace log."""
_, traces = _read_traces(trace_file, "HTTP", filter)

last_timestamp = ""
table = Table(Column(), Column(), box=None)
for trace in traces:
if failed and "200 OK" in trace.message:
continue
timestamp = trace.timestamp.split(".")[0]
if timestamp == last_timestamp:
timestamp = ""
else:
last_timestamp = timestamp
timestamp = f"[{timestamp}]"
table.add_row(timestamp, trace.message)

if table.row_count > 0:
r_print(table)


@trace_command.command("anomalies")
@click.argument("trace-file", type=str, required=False)
@click.option(
Expand All @@ -99,12 +135,7 @@ def dump_command(trace_file: str | None, filter: str | None) -> None:
)
def anomolies_command(trace_file: str | None, filter: str | None, all: bool) -> None:
"""Look for anomalies in a trace file (never completed or cancelled actions)."""
trace_file_path = _resolve_trace_file_path(trace_file)
traces = read_trace_file(trace_file_path)

if filter:
filter = filter.lower()
traces = [trace for trace in traces if filter in trace.message.lower()]
trace_file_path, traces = _read_traces(trace_file, None, filter)

# Track started actions
running_actions: dict[str, ActionTraceRecord] = {}
Expand Down Expand Up @@ -199,6 +230,22 @@ def print_fn(o: RenderableType) -> None:
print(console.export_text(styles=True).strip())


def _read_traces(
trace_file: str | None, level: str | None = None, filter: str | None = None
) -> tuple[Path, list[TraceRecord]]:
trace_file_path = _resolve_trace_file_path(trace_file)
traces = read_trace_file(trace_file_path)

if level:
traces = [trace for trace in traces if trace.level == level]

if filter:
filter = filter.lower()
traces = [trace for trace in traces if filter in trace.message.lower()]

return (trace_file_path, traces)


def _print_bucket(
print_fn: Callable[[RenderableType], None],
label: str,
Expand Down
12 changes: 6 additions & 6 deletions src/inspect_ai/_display/core/footer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from rich.console import RenderableType
from rich.text import Text

from inspect_ai._util.logger import http_rate_limit_count
from inspect_ai._util.retry import http_retries_count
from inspect_ai.util._concurrency import concurrency_status
from inspect_ai.util._throttle import throttle

Expand All @@ -26,12 +26,12 @@ def task_resources() -> str:


def task_counters(counters: dict[str, str]) -> str:
return task_dict(counters | task_http_rate_limits())
return task_dict(counters | task_http_retries())


def task_http_rate_limits() -> dict[str, str]:
return {"HTTP rate limits": f"{http_rate_limit_count():,}"}
def task_http_retries() -> dict[str, str]:
return {"HTTP retries": f"{http_retries_count():,}"}


def task_http_rate_limits_str() -> str:
return f"HTTP rate limits: {http_rate_limit_count():,}"
def task_http_retries_str() -> str:
return f"HTTP retries: {http_retries_count():,}"
4 changes: 2 additions & 2 deletions src/inspect_ai/_display/plain/display.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
TaskSpec,
TaskWithResult,
)
from ..core.footer import task_http_rate_limits_str
from ..core.footer import task_http_retries_str
from ..core.panel import task_panel, task_targets
from ..core.results import task_metric, tasks_results

Expand Down Expand Up @@ -182,7 +182,7 @@ def _print_status(self) -> None:
status_parts.append(resources)

# Add rate limits
rate_limits = task_http_rate_limits_str()
rate_limits = task_http_retries_str()
if rate_limits:
status_parts.append(rate_limits)

Expand Down
4 changes: 4 additions & 0 deletions src/inspect_ai/_display/textual/widgets/footer.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,7 @@ def watch_left(self, new_left: RenderableType) -> None:
def watch_right(self, new_right: RenderableType) -> None:
footer_right = cast(Static, self.query_one("#footer-right"))
footer_right.update(new_right)
if footer_right.tooltip is None:
footer_right.tooltip = (
"Execute 'inspect trace http' for a log of all HTTP requests."
)
19 changes: 14 additions & 5 deletions src/inspect_ai/_display/textual/widgets/samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,7 @@ async def sync_sample(self, sample: ActiveSample | None) -> None:
# track the sample
self.sample = sample

status_group = self.query_one("#" + self.STATUS_GROUP)
pending_status = self.query_one("#" + self.PENDING_STATUS)
timeout_tool = self.query_one("#" + self.TIMEOUT_TOOL_CALL)
clock = self.query_one(Clock)
Expand Down Expand Up @@ -537,11 +538,19 @@ async def sync_sample(self, sample: ActiveSample | None) -> None:
pending_caption = cast(
Static, self.query_one("#" + self.PENDING_CAPTION)
)
pending_caption_text = (
"Generating..."
if isinstance(last_event, ModelEvent)
else "Executing..."
)
if isinstance(last_event, ModelEvent):
# see if there are retries in play
if sample.retry_count > 0:
suffix = "retry" if sample.retry_count == 1 else "retries"
pending_caption_text = (
f"Generating ({sample.retry_count:,} {suffix})..."
)
else:
pending_caption_text = "Generating..."
else:
pending_caption_text = "Executing..."
status_group.styles.width = max(22, len(pending_caption_text))

pending_caption.update(
Text.from_markup(f"[italic]{pending_caption_text}[/italic]")
)
Expand Down
3 changes: 1 addition & 2 deletions src/inspect_ai/_eval/context.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from inspect_ai._util.dotenv import init_dotenv
from inspect_ai._util.hooks import init_hooks
from inspect_ai._util.logger import init_http_rate_limit_count, init_logger
from inspect_ai._util.logger import init_logger
from inspect_ai.approval._apply import have_tool_approval, init_tool_approval
from inspect_ai.approval._human.manager import init_human_approval_manager
from inspect_ai.approval._policy import ApprovalPolicy
Expand All @@ -20,7 +20,6 @@ def init_eval_context(
init_logger(log_level, log_level_transcript)
init_concurrency()
init_max_subprocesses(max_subprocesses)
init_http_rate_limit_count()
init_hooks()
init_active_samples()
init_human_approval_manager()
Expand Down
9 changes: 4 additions & 5 deletions src/inspect_ai/_eval/task/sandbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@

from inspect_ai._eval.task.task import Task
from inspect_ai._eval.task.util import task_run_dir
from inspect_ai._util.constants import DEFAULT_MAX_RETRIES, DEFAULT_TIMEOUT
from inspect_ai._util.file import file, filesystem
from inspect_ai._util.httpx import httpx_should_retry, log_httpx_retry_attempt
from inspect_ai._util.registry import registry_unqualified_name
from inspect_ai._util.retry import httpx_should_retry, log_retry_attempt
from inspect_ai._util.url import data_uri_to_base64, is_data_uri, is_http_url
from inspect_ai.dataset import Sample
from inspect_ai.util._concurrency import concurrency
Expand Down Expand Up @@ -186,14 +185,14 @@ async def _retrying_httpx_get(
url: str,
client: httpx.AsyncClient = httpx.AsyncClient(),
timeout: int = 30, # per-attempt timeout
max_retries: int = DEFAULT_MAX_RETRIES,
total_timeout: int = DEFAULT_TIMEOUT, # timeout for the whole retry loop. not for an individual attempt
max_retries: int = 10,
total_timeout: int = 120, # timeout for the whole retry loop. not for an individual attempt
) -> bytes:
@retry(
wait=wait_exponential_jitter(),
stop=(stop_after_attempt(max_retries) | stop_after_delay(total_timeout)),
retry=retry_if_exception(httpx_should_retry),
before_sleep=log_retry_attempt(url),
before_sleep=log_httpx_retry_attempt(url),
)
async def do_get() -> bytes:
response = await client.get(
Expand Down
2 changes: 0 additions & 2 deletions src/inspect_ai/_util/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@
PKG_NAME = Path(__file__).parent.parent.stem
PKG_PATH = Path(__file__).parent.parent
DEFAULT_EPOCHS = 1
DEFAULT_MAX_RETRIES = 5
DEFAULT_TIMEOUT = 120
DEFAULT_MAX_CONNECTIONS = 10
DEFAULT_MAX_TOKENS = 2048
DEFAULT_VIEW_PORT = 7575
Expand Down
Loading

0 comments on commit 588e4b9

Please sign in to comment.