Skip to content

Non-Compaction Auto-Retry Policy

This document describes the standard API-error retry path in AgentSession.

It explicitly excludes context-overflow recovery via auto-compaction. Overflow is handled by compaction logic and is documented separately in compaction.md.

Retry and compaction are checked from the same agent_end path, but they are intentionally separated:

  1. agent_end inspects the last assistant message.
  2. #isRetryableError(...) runs first.
  3. If retry is initiated, compaction checks are skipped for that turn.
  4. Context-overflow errors are hard-excluded from retry classification (isContextOverflow(...) short-circuits retry).
  5. Overflow therefore falls through to #checkCompaction(...) instead of standard retry.

So: overload/rate/server/network-style failures use this retry policy; context-window overflow uses compaction recovery.

#isRetryableError(...) requires all of the following:

  • assistant stopReason === "error"
  • errorMessage exists
  • message is not context overflow
  • errorMessage matches #isRetryableErrorMessage(...)

Current retryable pattern set (regex-based):

  • overloaded
  • rate limit / usage limit / too many requests
  • HTTP-like server classes: 429, 500, 502, 503, 504
  • service unavailable / server error / internal error
  • connection error / fetch failed
  • retry delay wording

This is string-pattern classification, not typed provider error codes.

Session state used by retry:

  • #retryAttempt: number (0 means idle)
  • #retryPromise: Promise<void> | undefined (tracks in-progress retry lifecycle)
  • #retryResolve: (() => void) | undefined (resolves #retryPromise)
  • #retryAbortController: AbortController | undefined (cancels backoff sleep)

Flow (#handleRetryableError):

  1. Read retry settings group.
  2. If retry.enabled === false, stop immediately (false, no retry started).
  3. Increment #retryAttempt.
  4. Create #retryPromise once (first attempt in a chain).
  5. If attempt exceeded retry.maxRetries, emit final failure event and stop.
  6. Compute delay: retry.baseDelayMs * 2^(attempt-1).
  7. For usage-limit errors, parse retry hints and call auth storage (markUsageLimitReached(...)); if provider/model switch succeeds, force delay to 0.
  8. Emit auto_retry_start.
  9. Remove the trailing assistant error message from agent runtime state (kept in persisted session history).
  10. Sleep with abort support.
  11. On wake, schedule agent.continue() via setTimeout(..., 0).

#retryAttempt resets to 0 in these cases:

  • first successful non-error, non-aborted assistant message after retries started (emits auto_retry_end { success: true })
  • retry cancellation during backoff sleep
  • max retries exceeded path

#retryPromise resolves/clears when retry chain ends (success, cancellation, or max-exceeded), via #resolveRetry().

Settings:

  • retry.enabled (default true)
  • retry.maxRetries (default 3)
  • retry.baseDelayMs (default 2000)

Attempt numbering:

  • attempt counter is incremented before max-check
  • start events use current attempt (1-based)
  • max-exceeded end event reports attempt: this.#retryAttempt - 1 (last attempted retry count)

Backoff sequence with default settings:

  • attempt 1: 2000 ms
  • attempt 2: 4000 ms
  • attempt 3: 8000 ms

Delay override inputs are only used in the usage-limit handling path, and only to influence auth-storage model/account switching decision. In the main non-compaction retry path, backoff remains local exponential delay unless switching succeeds (delayMs = 0).

abortRetry():

  • aborts #retryAbortController (if present)
  • resolves retry promise (#resolveRetry()) so awaiters are unblocked

If abort hits while sleeping, catch path emits:

  • auto_retry_end { success: false, finalError: "Retry cancelled" }
  • resets attempt/controller

abort() calls abortRetry() before aborting the active agent stream. This guarantees retry backoff is cancelled when user issues a general abort.

On auto_retry_start, EventController:

  • swaps Esc handler to session.abortRetry()
  • renders loader text: Retrying (attempt/maxAttempts) in Ns… (esc to cancel)

On auto_retry_end, it restores prior Esc handler and clears loader state.

prompt() ultimately waits on #waitForRetry() after agent.prompt(...) returns.

Effect:

  • a prompt call does not fully resolve until any started retry chain finishes (success/failure/cancel)
  • retry lifecycle is part of one logical prompt execution boundary

This prevents callers from treating a retrying turn as complete too early.

Defined in settings schema under retry group:

  • retry.enabled
  • retry.maxRetries
  • retry.baseDelayMs

Programmatic toggles in session:

  • setAutoRetryEnabled(enabled) writes retry.enabled
  • autoRetryEnabled reads retry.enabled
  • isRetrying reports whether retry lifecycle promise is active

RPC command surface:

  • set_auto_retrysession.setAutoRetryEnabled(command.enabled)
  • abort_retrysession.abortRetry()

Client helpers:

  • RpcClient.setAutoRetry(enabled)
  • RpcClient.abortRetry()

Both commands return success responses; retry progress/failure details come from streamed session events, not command response payloads.

Session-level retry events:

  • auto_retry_start { attempt, maxAttempts, delayMs, errorMessage }
  • auto_retry_end { success, attempt, finalError? }

Propagation:

  • emitted through AgentSession.subscribe(...)
  • forwarded to extension runner as extension events
  • in RPC mode, forwarded directly as JSON event objects (session.subscribe(event => output(event)))
  • in TUI, consumed by EventController for loader/error UI

Final failure surfacing:

  • On max-exceeded or cancellation, auto_retry_end.success === false
  • TUI shows: Retry failed after N attempts: <finalError>
  • Extensions/hooks receive auto_retry_end with same fields
  • RPC consumers receive same event object on stdout stream

Retry stops and will not auto-continue when any of these occur:

  • retry.enabled is false
  • error is not retry-classified
  • error is context overflow (delegated to compaction path)
  • max retries exceeded
  • user cancels retry (abort_retry or Esc during retry loader)
  • global abort (abort) cancels retry first

A new retry chain can still start later on a future retryable error after counters reset.

  • Classification is regex text matching; provider-specific structured errors are not used here.
  • Retry strips the failing assistant error from runtime context before re-continue, but session history still keeps that error entry.
  • RpcSessionState currently exposes autoCompactionEnabled but not an autoRetryEnabled field; RPC callers must track their own toggle state or query settings through other APIs.