Skip to content

fix(executor): add exponential backoff retry for transient LLM provid…#3773

Open
Rabba-Meghana wants to merge 5 commits intosimstudioai:mainfrom
Rabba-Meghana:fix/executor-llm-retry-backoff
Open

fix(executor): add exponential backoff retry for transient LLM provid…#3773
Rabba-Meghana wants to merge 5 commits intosimstudioai:mainfrom
Rabba-Meghana:fix/executor-llm-retry-backoff

Conversation

@Rabba-Meghana
Copy link

Problem

When LLM providers (OpenAI, Anthropic, Gemini) return a 429 rate limit or 503 service unavailable, the workflow execution crashes immediately with no retry attempt. This silently breaks production agent workflows for users.

Solution

  • Added withRetry() utility with exponential backoff and jitter
  • Wrapped handler.execute() in block-executor so all agent blocks automatically retry on transient provider errors
  • Respects Retry-After headers from OpenAI and Anthropic
  • Never retries user errors (400, 401, 404) — only transient failures
  • Full test coverage: success path, 429 retry, maxRetries exceeded, non-retryable errors, Retry-After parsing

Type of Change

  • Bug fix

Testing

  • retry.test.ts covers 7 scenarios: immediate success, 429 retry, maxRetries exhausted, non-retryable 400/401, network error, Retry-After header
  • Run with: bun test apps/sim/executor/utils/retry.test.ts

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

…er errors

- Add withRetry() utility with exponential backoff and jitter
- Wrap handler.execute() in block-executor so all agent blocks retry on transient provider errors
- Respects Retry-After headers from OpenAI and Anthropic
- Never retries user errors (400, 401, 404) only transient failures
- Full test coverage: success path, 429 retry, maxRetries exceeded, non-retryable errors, Retry-After parsing
@vercel
Copy link

vercel bot commented Mar 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 25, 2026 11:20pm

Request Review

@cursor
Copy link

cursor bot commented Mar 25, 2026

PR Summary

Medium Risk
Adds automatic retries around LLM-related block execution, which can change workflow timing/cost and may amplify side effects if provider calls aren’t fully idempotent.

Overview
Prevents workflow runs from failing immediately on transient LLM provider errors by adding a shared withRetry utility (exponential backoff + jitter, optional Retry-After support, retries only for 429/503/529 and specific network errors).

BlockExecutor now wraps execution of LLM-like blocks (AGENT, EVALUATOR, ROUTER, ROUTER_V2) with withRetry, leaving other blocks unchanged, and introduces focused unit tests covering success, retry/limit behavior, non-retryable 4xx, network errors, and Retry-After handling.

Written by Cursor Bugbot for commit d6e45a2. This will update automatically on new commits. Configure here.

@Rabba-Meghana
Copy link
Author

Retries are intentionally scoped to transient provider/network errors only (429, 503, 529). These status codes indicate the request never reached or was not processed by the provider, making retries safe. Non-idempotent side effects from tool calls within agent blocks are not affected, those execute after the provider call succeeds. Happy to add an idempotent: false opt-out flag to RetryOptions if the team prefers explicit control per block type.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds exponential backoff retry logic (withRetry) to the block executor to handle transient LLM provider errors (429, 503, 529), preventing workflow crashes on rate-limit or service-unavailable responses. The utility is well-structured with jitter, Retry-After header support, and good test coverage.

Two meaningful behavioral issues need attention before merge:

  • Retry scope is too broad: withRetry is applied at the BlockExecutor level and wraps every block type — not just LLM/AI blocks. HTTP request blocks, webhook blocks, database blocks, and any other block that performs a non-idempotent external operation will now also be silently retried up to 3 times on a 429 or 503. This can produce duplicate side effects (duplicate API calls, emails, database writes) that are hard to diagnose.
  • All statusless errors are retried: Errors with no .status/.statusCode property are treated as retryable (status === null → isRetryable = true). This includes ordinary JavaScript runtime errors and validation exceptions, which will now incur up to 30 seconds of unnecessary delay before surfacing to the user.

Confidence Score: 3/5

  • Not safe to merge as-is — retry scope covering all block types can cause duplicate side effects on non-idempotent operations.
  • Two P1 issues both relate to over-broad retry behavior: (1) the retry wraps every block type at the executor level instead of being scoped to LLM blocks, risking duplicate side effects; (2) statusless errors (including JS runtime errors) are unconditionally retried, masking bugs and adding silent latency. Both issues are concrete behavioral problems that can surface in production workflows. The implementation approach is otherwise sound and the test coverage is solid.
  • apps/sim/executor/execution/block-executor.ts (retry scope), apps/sim/executor/utils/retry.ts (statusless error handling)

Important Files Changed

Filename Overview
apps/sim/executor/utils/retry.ts New utility implementing exponential backoff retry with jitter and Retry-After header support; two issues: all statusless errors are retried (including non-transient JS errors), and the Retry-After header is not respected for 503 responses.
apps/sim/executor/execution/block-executor.ts Wraps all block handler executions with withRetry — applies retry to every block type (HTTP, database, webhook, email, etc.), not just LLM blocks, risking duplicate side effects on non-idempotent operations.
apps/sim/executor/utils/retry.test.ts Good test coverage for the happy path, 429 retry, maxRetries exhaustion, non-retryable 400/401, network error, and Retry-After; missing afterEach to restore real timers.

Sequence Diagram

sequenceDiagram
    participant BE as BlockExecutor
    participant WR as withRetry
    participant H as BlockHandler
    participant LLM as LLM Provider

    BE->>WR: withRetry(() => handler.execute(...))
    loop Up to maxRetries (3)
        WR->>H: handler.execute(ctx, block, inputs)
        H->>LLM: API Request
        alt Success
            LLM-->>H: 200 OK
            H-->>WR: output
            WR-->>BE: output
        else 429 / 503 / 529 (transient)
            LLM-->>H: Error (status 429/503/529)
            H-->>WR: throw error
            WR->>WR: Check Retry-After header (429 only)
            WR->>WR: calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
            WR->>WR: await setTimeout(delayMs)
        else 400 / 401 / 404 (non-retryable)
            LLM-->>H: Error (status 4xx)
            H-->>WR: throw error
            WR-->>BE: rethrow immediately
        else No status (network / JS error)
            H-->>WR: throw error (no status)
            WR->>WR: treated as retryable (isRetryable = true)
            WR->>WR: await setTimeout(delayMs)
        end
    end
    WR-->>BE: throw lastError (maxRetries exceeded)
Loading

Reviews (1): Last reviewed commit: "fix(executor): add exponential backoff r..." | Re-trigger Greptile

Comment on lines +124 to +128
const output = await withRetry(
() => handler.executeWithNode
? handler.executeWithNode(ctx, block, resolvedInputs, nodeMetadata)
: handler.execute(ctx, block, resolvedInputs)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Retry applied to all block types, not just LLM blocks

The withRetry wrapper is placed around every block's handler execution — not just AI/LLM blocks. This means any block that calls an external service (HTTP request blocks, webhook blocks, database blocks, email blocks, etc.) will also be retried up to 3 times on a 429 or 503 response.

For non-idempotent operations — e.g. sending an email, creating a database row, triggering a webhook — this can silently create 2–4 duplicate operations per execution. The PR description says "all agent blocks automatically retry," but the implementation actually retries all block types.

Consider either:

  1. Only retrying within the LLM/AI provider layer (not at the block executor level), or
  2. Checking the block type before applying retry logic:
const isLLMBlock = /* check block.metadata?.id against known LLM block types */
const output = isLLMBlock
  ? await withRetry(() => handler.executeWithNode ? ... : ...)
  : await (handler.executeWithNode ? ... : ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

Comment on lines +66 to +68
const isRetryable =
status === null ||
retryableStatusCodes.includes(status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 All statusless errors trigger retries, including non-transient JS errors

When status === null (no .status or .statusCode on the thrown error), isRetryable is true. This catches genuine network errors — but it also retries any JavaScript runtime error, validation error, or unexpected exception that doesn't carry a status code.

For example, a TypeError, RangeError, or any internal error thrown by a block handler will be silently retried 3 times with up to 30 seconds of total delay before finally failing. This masks bugs and significantly degrades the user-facing error latency for logic errors.

Consider reversing the default: treat unknown errors as non-retryable unless the error is explicitly recognized as transient (e.g., it is a known network error type, or it matches a recognizable pattern):

const isRetryable =
  status !== null
    ? retryableStatusCodes.includes(status)
    : isNetworkError(error) // helper that checks error.code, error.name, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

Comment on lines +76 to +81
if (responseHeaders && status === 429) {
const retryAfterMs = parseRetryAfterHeader(responseHeaders)
delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
} else {
delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Retry-After header not respected for 503 responses

The Retry-After header is parsed and used only when status === 429. However, RFC 7231 §7.1.3 specifies that servers sending 503 Service Unavailable may also include a Retry-After header to signal when the service will recover. Both OpenAI and Anthropic can send 503s with Retry-After.

// suggestion
      if (responseHeaders && (status === 429 || status === 503)) {
        const retryAfterMs = parseRetryAfterHeader(responseHeaders)
        delayMs = retryAfterMs ?? calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
      } else {
        delayMs = calculateBackoffDelay(attempt, initialDelayMs, maxDelayMs)
      }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

Comment on lines +6 to +9
vi.useFakeTimers()
})

it('returns result immediately on success', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing afterEach to restore real timers

vi.useFakeTimers() is called in beforeEach but there is no corresponding vi.useRealTimers() in an afterEach. If Vitest runs these tests in the same worker as other test files that depend on real timers (e.g. setTimeout, Date.now()), the leaked fake timer state can cause subtle test failures elsewhere.

afterEach(() => {
  vi.useRealTimers()
})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

- Scope retries to LLM blocks only (agent, evaluator, router) to avoid
  duplicate side effects on non-idempotent blocks like HTTP, email, webhook
- Treat statusless errors as non-retryable by default — only retry known
  network errors (ECONNRESET, ETIMEDOUT, etc.) not JS runtime errors
- Respect Retry-After header for 503 responses in addition to 429 (RFC 7231)
- Add afterEach vi.useRealTimers() to prevent timer leak between tests
@Rabba-Meghana
Copy link
Author

Fixed in 70ce97f — addressed all review feedback: scoped retries to LLM blocks only, added isNetworkError() for statusless errors, added Retry-After support for 503, and added afterEach vi.useRealTimers() to tests.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

const capped = Math.min(maxDelayMs, exponential)
const jitter = Math.random() * capped * 0.2
return Math.floor(Math.min(capped + jitter, maxDelayMs))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jitter eliminated when backoff delay reaches max cap

Low Severity

In calculateBackoffDelay, capped is already clamped to maxDelayMs, but then jitter is added to capped and the result is clamped again by Math.min(..., maxDelayMs). When exponential >= maxDelayMs, the jitter is always clamped back to zero, producing a deterministic maxDelayMs with no randomization. This defeats the purpose of jitter (preventing thundering herd) once the exponential component reaches the cap. The jitter needs to be subtracted from (or applied within) capped rather than added on top of it.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant