retry improvements (#1430)

* rename is_rate_limit to should_retry * lint * fix typo * cleanup status codes * improved should_retry for stainless clients * more retry work (azureai) * improved cloudflare retry filter * improve mistral retry logic * goodfire should_retry * improved should_retry for bedrock * improved vertex should_retry * improve google should_retry * don't pass max retries and timeout directly to model interfaces * retry controller, docs, don't retry connection errors * some renaming + exception not baseexception * hooks for retry tracing * retry counter * trace retries * display current retry count * show number of retries for generate in running samples view * simplify log to transcript * simplify log capture code * rename tracker to hooks * http logging hooks for google * some hooks cleanup * narrow scope of inspect log (don't propagate) * refine retry log * tooltip on http requests * docs on inspect trace http * various review tweaks * changelog * propagate for test * fix typos in changelog --------- Co-authored-by: jjallaire-aisi <[email protected]>
UKGovernmentBEIS · Mar 3, 2025 · 588e4b9 · 588e4b9
1 parent c79c08d
commit 588e4b9
Show file tree

Hide file tree

Showing 37 changed files with 734 additions and 576 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,12 @@
+## Unreleased
+
+- New "HTTP Retries" display (replacing the "HTTP Rate Limits" display) which counts all retries and does so much more consistently and accurately across providers.
+- The `ModelAPI` class now has a `should_retry()` method that replaces the deprecated `is_rate_limit()` method.
+- The "Generate..." progress message in the Running Samples view now shows the number of retries for the active call to `generate()`.
+- New `inspect trace http` command which will show all HTTP requests for a run.
+- More consistent use of `max_retries` and `timeout` configuration options. These options now exclusively control Inspect's outer retry handler; model providers use their default behaviour for the inner request, which is typically 2-4 retries and a service-appropriate timeout.
+- Logging: Inspect no longer sets the global log level nor does it allow its own messages to propagate to the global handler (eliminating the possiblity of duplicate display). This should improve compatibility with applications that have their own custom logging configured. 
+
 ## v0.3.72 (03 March 2025)
 
 - Computer: Updated tool definition to match improvements in Claude Sonnet 3.7.

diff --git a/docs/errors-and-limits.qmd b/docs/errors-and-limits.qmd
@@ -116,7 +116,7 @@ def intercode_ctf():
 Working time is computed based on total clock time minus time spent on (a) unsuccessful model generations (e.g. rate limited requests); and (b) waiting on shared resources (e.g. Docker containers or subprocess execution).
 
 ::: {.callout-note appearance="simple"}
-In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (`google`, `vertex`, `azureai`, and `goodfire`), and in these cases the `working_time` will include any internal retries that the model client performs. 
+In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (`vertex`, `azureai`, and `goodfire`), and in these cases the `working_time` will include any internal retries that the model client performs. 
 :::
 
 

diff --git a/docs/options.qmd b/docs/options.qmd
@@ -86,8 +86,8 @@ Below are sections for the various categories of options supported by `inspect e
 | `--parallel-tool-calls` | Whether to enable calling multiple functions during tool use (defaults to True) OpenAI and Groq only.                                                                                                                                                   |
 | `--max-tool-output`     | Maximum size of tool output (in bytes). Defaults to 16 \* 1024.                                                                                                                                                                                         |
 | `--internal-tools`      | Whether to automatically map tools to model internal implementations (e.g. 'computer' for Anthropic).                                                                                                                                                   |
-| `--max-retries`         | Maximum number of times to retry request (defaults to 5)                                                                                                                                                                                                |
-| `--timeout`             | Request timeout (in seconds).                                                                                                                                                                                                                           |
+| `--max-retries`         | Maximum number of times to retry generate request (defaults to unlimited)                                                                                                                                                                                                |
+| `--timeout`             | Generate timeout in seconds (defaults to no timeout)                                                                                                                                                                                                                           |
 
 ## Tasks and Solvers
 

diff --git a/docs/parallelism.qmd b/docs/parallelism.qmd
@@ -28,9 +28,13 @@ $ inspect eval --model openai/gpt-4 --max-connections 20
 
 The default value for max connections is 10. By increasing it we might get better performance due to higher parallelism, however we might get *worse* performance if this causes us to frequently hit rate limits (which are retried with exponential backoff). The "correct" max connections for your evaluations will vary based on your actual rate limit and the size and complexity of your evaluations.
 
+::: {.callout-note appearance="simple"}
+Note that max connections is applied per-model. This means that if you use a grader model from a provider distinct from the one you are evaluating you will get extra concurrency (as each model will enforce its own max connections).
+:::
+
 ### Rate Limits
 
-When you run an eval you'll see information reported on the current active connection usage as well as the number of HTTP rate limit errors that have been encountered (note that Inspect will automatically retry on rate limits and other errors likely to be transient):
+When you run an eval you'll see information reported on the current active connection usage as well as the number of HTTP retries that have occurred (Inspect will automatically retry on rate limits and other errors likely to be transient):
 
 ![](images/rate-limit.png){fig-alt="The Inspect task results displayed in the terminal. The number of HTTP rate limit errors that have occurred (25) is printed in the bottom right of the task results."}
 
@@ -40,20 +44,41 @@ You should experiment with various values for max connections at different times
 
 ### Limiting Retries
 
-By default, Inspect will continue to retry model API calls (with exponential backoff) indefinitely when a rate limit error (HTTP status 429) is returned. You can limit these retries by using the `max_retries` and `timeout` eval options. For example:
+By default, Inspect will retry model API calls indefinitely (with exponential backoff) when a recoverable HTTP error occurs. The initial backoff is 3 seconds and exponentiation will result in a 25 minute wait for the 10th request (then 30 minutes for the 11th and subsequent requests). You can limit Inspect's retries using the `--max-retries` option:
+
+``` bash
+inspect eval --model openai/gpt-4 --max-retries 10
+```
+
+Note that model interfaces themselves may have internal retry behavior (for example, the `openai` and `anthropic` packages both retry twice by default).
+
+You can put a limit on the total time for retries using the `--timeout` option:
 
 ``` bash
-$ inspect eval --model openai/gpt-4 --max-retries 10 --timeout 600
+inspect eval --model openai/gpt-4 --timeout 600 
 ```
 
+### Debugging Retries
+
 If you want more insight into Model API connections and retries, specify `log_level=http`. For example:
 
 ``` bash
-$ inspect eval --model openai/gpt-4 --log-level=http
+inspect eval --model openai/gpt-4 --log-level=http
+```
+
+You can also view all of the HTTP requests for the current (or most recent) evaluation run using the `inspect trace http` command. For example:
+
+``` bash
+inspect trace http           # show all http requests
+inspect trace http --failed  # show only failed requests
 ```
 
 ::: {.callout-note appearance="simple"}
-Note that max connections is applied per-model. This means that if you use a grader model from a provider distinct from the one you are evaluating you will get extra concurrency (as each model will enforce its own max connections).
+Note that the `inspect trace http` command is currently available only in the development version of Inspect. To install the development version from GitHub:
+
+``` bash
+pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
+```
 :::
 
 ## Multiple Models {#sec-multiple-models}

diff --git a/docs/tracing.qmd b/docs/tracing.qmd
@@ -69,6 +69,31 @@ As with the `inspect trace dump` command, you can apply a filter when listing an
 inspect trace anomalies --filter model
 ```
 
+## HTTP Requests
+
+::: {.callout-note appearance="simple"}
+Note that the `inspect trace http` command described below is currently available only in the development version of Inspect. To install the development version from GitHub:
+
+``` bash
+pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
+```
+:::
+
+You can view all of the HTTP requests for the current (or most recent) evaluation run using the `inspect trace http` command. For example:
+
+``` bash
+inspect trace http           # show all http requests
+inspect trace http --failed  # show only failed requests
+```
+
+The `--filter` parameter also works here, for example:
+
+```bash
+inspect trace http --failed --filter bedrock
+```
+
+
+
 ## Tracing API {#tracing-api}
 
 In addition to the standard set of actions which are trace logged, you can do your own custom trace logging using the `trace_action()` and `trace_message()` APIs. Trace logging is a great way to make sure that logging context is *always captured* (since the last 10 trace logs are always available) without cluttering up the console or eval transcripts.

diff --git a/src/inspect_ai/_cli/eval.py b/src/inspect_ai/_cli/eval.py
@@ -11,7 +11,6 @@
     DEFAULT_EPOCHS,
     DEFAULT_LOG_LEVEL_TRANSCRIPT,
     DEFAULT_MAX_CONNECTIONS,
-    DEFAULT_MAX_RETRIES,
 )
 from inspect_ai._util.file import filesystem
 from inspect_ai._util.samples import parse_sample_id, parse_samples_limit
@@ -47,9 +46,9 @@
 NO_SCORE_DISPLAY = "Do not display scoring metrics in realtime."
 MAX_CONNECTIONS_HELP = f"Maximum number of concurrent connections to Model API (defaults to {DEFAULT_MAX_CONNECTIONS})"
 MAX_RETRIES_HELP = (
-    f"Maximum number of times to retry request (defaults to {DEFAULT_MAX_RETRIES})"
+    "Maximum number of times to retry model API requests (defaults to unlimited)"
 )
-TIMEOUT_HELP = "Request timeout (in seconds)."
+TIMEOUT_HELP = "Model API request timeout in seconds (defaults to no timeout)"
 
 
 def eval_options(func: Callable[..., Any]) -> Callable[..., click.Context]:

diff --git a/src/inspect_ai/_cli/trace.py b/src/inspect_ai/_cli/trace.py
@@ -15,6 +15,7 @@
 from inspect_ai._util.error import PrerequisiteError
 from inspect_ai._util.trace import (
     ActionTraceRecord,
+    TraceRecord,
     inspect_trace_dir,
     list_trace_files,
     read_trace_file,
@@ -84,6 +85,41 @@ def dump_command(trace_file: str | None, filter: str | None) -> None:
     )
 
 
+@trace_command.command("http")
+@click.argument("trace-file", type=str, required=False)
+@click.option(
+    "--filter",
+    type=str,
+    help="Filter (applied to trace message field).",
+)
+@click.option(
+    "--failed",
+    type=bool,
+    is_flag=True,
+    default=False,
+    help="Show only failed HTTP requests (non-200 status)",
+)
+def http_command(trace_file: str | None, filter: str | None, failed: bool) -> None:
+    """View all HTTP requests in the trace log."""
+    _, traces = _read_traces(trace_file, "HTTP", filter)
+
+    last_timestamp = ""
+    table = Table(Column(), Column(), box=None)
+    for trace in traces:
+        if failed and "200 OK" in trace.message:
+            continue
+        timestamp = trace.timestamp.split(".")[0]
+        if timestamp == last_timestamp:
+            timestamp = ""
+        else:
+            last_timestamp = timestamp
+            timestamp = f"[{timestamp}]"
+        table.add_row(timestamp, trace.message)
+
+    if table.row_count > 0:
+        r_print(table)
+
+
 @trace_command.command("anomalies")
 @click.argument("trace-file", type=str, required=False)
 @click.option(
@@ -99,12 +135,7 @@ def dump_command(trace_file: str | None, filter: str | None) -> None:
 )
 def anomolies_command(trace_file: str | None, filter: str | None, all: bool) -> None:
     """Look for anomalies in a trace file (never completed or cancelled actions)."""
-    trace_file_path = _resolve_trace_file_path(trace_file)
-    traces = read_trace_file(trace_file_path)
-
-    if filter:
-        filter = filter.lower()
-        traces = [trace for trace in traces if filter in trace.message.lower()]
+    trace_file_path, traces = _read_traces(trace_file, None, filter)
 
     # Track started actions
     running_actions: dict[str, ActionTraceRecord] = {}
@@ -199,6 +230,22 @@ def print_fn(o: RenderableType) -> None:
         print(console.export_text(styles=True).strip())
 
 
+def _read_traces(
+    trace_file: str | None, level: str | None = None, filter: str | None = None
+) -> tuple[Path, list[TraceRecord]]:
+    trace_file_path = _resolve_trace_file_path(trace_file)
+    traces = read_trace_file(trace_file_path)
+
+    if level:
+        traces = [trace for trace in traces if trace.level == level]
+
+    if filter:
+        filter = filter.lower()
+        traces = [trace for trace in traces if filter in trace.message.lower()]
+
+    return (trace_file_path, traces)
+
+
 def _print_bucket(
     print_fn: Callable[[RenderableType], None],
     label: str,

diff --git a/src/inspect_ai/_display/core/footer.py b/src/inspect_ai/_display/core/footer.py
@@ -1,7 +1,7 @@
 from rich.console import RenderableType
 from rich.text import Text
 
-from inspect_ai._util.logger import http_rate_limit_count
+from inspect_ai._util.retry import http_retries_count
 from inspect_ai.util._concurrency import concurrency_status
 from inspect_ai.util._throttle import throttle
 
@@ -26,12 +26,12 @@ def task_resources() -> str:
 
 
 def task_counters(counters: dict[str, str]) -> str:
-    return task_dict(counters | task_http_rate_limits())
+    return task_dict(counters | task_http_retries())
 
 
-def task_http_rate_limits() -> dict[str, str]:
-    return {"HTTP rate limits": f"{http_rate_limit_count():,}"}
+def task_http_retries() -> dict[str, str]:
+    return {"HTTP retries": f"{http_retries_count():,}"}
 
 
-def task_http_rate_limits_str() -> str:
-    return f"HTTP rate limits: {http_rate_limit_count():,}"
+def task_http_retries_str() -> str:
+    return f"HTTP retries: {http_retries_count():,}"
diff --git a/src/inspect_ai/_display/plain/display.py b/src/inspect_ai/_display/plain/display.py
@@ -22,7 +22,7 @@
     TaskSpec,
     TaskWithResult,
 )
-from ..core.footer import task_http_rate_limits_str
+from ..core.footer import task_http_retries_str
 from ..core.panel import task_panel, task_targets
 from ..core.results import task_metric, tasks_results
 
@@ -182,7 +182,7 @@ def _print_status(self) -> None:
             status_parts.append(resources)
 
             # Add rate limits
-            rate_limits = task_http_rate_limits_str()
+            rate_limits = task_http_retries_str()
             if rate_limits:
                 status_parts.append(rate_limits)
 

diff --git a/src/inspect_ai/_display/textual/widgets/footer.py b/src/inspect_ai/_display/textual/widgets/footer.py
@@ -36,3 +36,7 @@ def watch_left(self, new_left: RenderableType) -> None:
     def watch_right(self, new_right: RenderableType) -> None:
         footer_right = cast(Static, self.query_one("#footer-right"))
         footer_right.update(new_right)
+        if footer_right.tooltip is None:
+            footer_right.tooltip = (
+                "Execute 'inspect trace http' for a log of all HTTP requests."
+            )
diff --git a/src/inspect_ai/_display/textual/widgets/samples.py b/src/inspect_ai/_display/textual/widgets/samples.py
@@ -506,6 +506,7 @@ async def sync_sample(self, sample: ActiveSample | None) -> None:
         # track the sample
         self.sample = sample
 
+        status_group = self.query_one("#" + self.STATUS_GROUP)
         pending_status = self.query_one("#" + self.PENDING_STATUS)
         timeout_tool = self.query_one("#" + self.TIMEOUT_TOOL_CALL)
         clock = self.query_one(Clock)
@@ -537,11 +538,19 @@ async def sync_sample(self, sample: ActiveSample | None) -> None:
                 pending_caption = cast(
                     Static, self.query_one("#" + self.PENDING_CAPTION)
                 )
-                pending_caption_text = (
-                    "Generating..."
-                    if isinstance(last_event, ModelEvent)
-                    else "Executing..."
-                )
+                if isinstance(last_event, ModelEvent):
+                    # see if there are retries in play
+                    if sample.retry_count > 0:
+                        suffix = "retry" if sample.retry_count == 1 else "retries"
+                        pending_caption_text = (
+                            f"Generating ({sample.retry_count:,} {suffix})..."
+                        )
+                    else:
+                        pending_caption_text = "Generating..."
+                else:
+                    pending_caption_text = "Executing..."
+                status_group.styles.width = max(22, len(pending_caption_text))
+
                 pending_caption.update(
                     Text.from_markup(f"[italic]{pending_caption_text}[/italic]")
                 )

diff --git a/src/inspect_ai/_eval/context.py b/src/inspect_ai/_eval/context.py
@@ -1,6 +1,6 @@
 from inspect_ai._util.dotenv import init_dotenv
 from inspect_ai._util.hooks import init_hooks
-from inspect_ai._util.logger import init_http_rate_limit_count, init_logger
+from inspect_ai._util.logger import init_logger
 from inspect_ai.approval._apply import have_tool_approval, init_tool_approval
 from inspect_ai.approval._human.manager import init_human_approval_manager
 from inspect_ai.approval._policy import ApprovalPolicy
@@ -20,7 +20,6 @@ def init_eval_context(
     init_logger(log_level, log_level_transcript)
     init_concurrency()
     init_max_subprocesses(max_subprocesses)
-    init_http_rate_limit_count()
     init_hooks()
     init_active_samples()
     init_human_approval_manager()

diff --git a/src/inspect_ai/_eval/task/sandbox.py b/src/inspect_ai/_eval/task/sandbox.py
@@ -15,10 +15,9 @@
 
 from inspect_ai._eval.task.task import Task
 from inspect_ai._eval.task.util import task_run_dir
-from inspect_ai._util.constants import DEFAULT_MAX_RETRIES, DEFAULT_TIMEOUT
 from inspect_ai._util.file import file, filesystem
+from inspect_ai._util.httpx import httpx_should_retry, log_httpx_retry_attempt
 from inspect_ai._util.registry import registry_unqualified_name
-from inspect_ai._util.retry import httpx_should_retry, log_retry_attempt
 from inspect_ai._util.url import data_uri_to_base64, is_data_uri, is_http_url
 from inspect_ai.dataset import Sample
 from inspect_ai.util._concurrency import concurrency
@@ -186,14 +185,14 @@ async def _retrying_httpx_get(
     url: str,
     client: httpx.AsyncClient = httpx.AsyncClient(),
     timeout: int = 30,  # per-attempt timeout
-    max_retries: int = DEFAULT_MAX_RETRIES,
-    total_timeout: int = DEFAULT_TIMEOUT,  #  timeout for the whole retry loop. not for an individual attempt
+    max_retries: int = 10,
+    total_timeout: int = 120,  #  timeout for the whole retry loop. not for an individual attempt
 ) -> bytes:
     @retry(
         wait=wait_exponential_jitter(),
         stop=(stop_after_attempt(max_retries) | stop_after_delay(total_timeout)),
         retry=retry_if_exception(httpx_should_retry),
-        before_sleep=log_retry_attempt(url),
+        before_sleep=log_httpx_retry_attempt(url),
     )
     async def do_get() -> bytes:
         response = await client.get(

diff --git a/src/inspect_ai/_util/constants.py b/src/inspect_ai/_util/constants.py
@@ -6,8 +6,6 @@
 PKG_NAME = Path(__file__).parent.parent.stem
 PKG_PATH = Path(__file__).parent.parent
 DEFAULT_EPOCHS = 1
-DEFAULT_MAX_RETRIES = 5
-DEFAULT_TIMEOUT = 120
 DEFAULT_MAX_CONNECTIONS = 10
 DEFAULT_MAX_TOKENS = 2048
 DEFAULT_VIEW_PORT = 7575