Skip to content

fix(agents): probe single-provider billing cooldowns#41422

Merged
altaywtf merged 4 commits intomainfrom
fix/40226-billing-recovery
Mar 9, 2026
Merged

fix(agents): probe single-provider billing cooldowns#41422
altaywtf merged 4 commits intomainfrom
fix/40226-billing-recovery

Conversation

@altaywtf
Copy link
Member

@altaywtf altaywtf commented Mar 9, 2026

Summary

  • Problem: after a billing error, single-provider setups stay stuck returning billing failures until openclaw gateway restart, even after the user tops up credits.
  • Why it matters: the current billing cooldown probe logic only retries when fallback models exist, so a lone Anthropic provider has no recovery path.
  • What changed: allow the primary model to be probed on the existing 30-second throttle when the cooldown reason is billing and there are no fallback candidates.
  • What did NOT change (scope boundary): no broadening of no-fallback retries for rate_limit or auth; fallback-bearing providers still keep the near-expiry probe behavior.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Single-provider billing cooldowns can now recover automatically on the normal probe throttle after credits are restored, instead of staying stuck until a gateway restart.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 + pnpm
  • Model/provider: single-provider billing cooldown path
  • Integration/channel (if any): n/a
  • Relevant config (redacted): primary model with no configured fallbacks

Steps

  1. Put the primary provider into a billing cooldown state.
  2. Configure the model chain with no fallback candidates.
  3. Retry after credits are restored.

Expected

  • The primary can be retried on the probe throttle and recover without gateway restart.

Actual

  • Before this change, the primary was skipped indefinitely and the run ended with All models failed.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • Reproducing test for single-provider billing cooldown now probes instead of failing immediately.
    • Existing fallback probe behavior still passes for rate-limit, overload, and billing-with-fallback paths.
  • Edge cases checked:
    • No-fallback probe path remains limited to billing.
    • Existing 30-second probe throttle is preserved.
  • What you did not verify:
    • Live provider top-up against Anthropic.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commit 00e96296c
  • Files/config to restore: src/agents/model-fallback.ts, src/agents/model-fallback.probe.test.ts
  • Known bad symptoms reviewers should watch for: unexpected no-fallback retries for non-billing cooldown reasons

Risks and Mitigations

  • Risk: more frequent retry attempts could widen single-provider churn during active billing outages.
    • Mitigation: reuse the existing 30-second throttle and keep the no-fallback path limited to billing only.

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels Mar 9, 2026
@altaywtf altaywtf self-assigned this Mar 9, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR correctly fixes the stuck-recovery bug where a single-provider setup in a billing cooldown could never auto-recover. The fix adds a shouldProbeSingleProviderBilling branch in resolveCooldownDecision that allows the primary model to be retried on the existing 30-second probe throttle when inferredReason === "billing" and there are no fallback candidates.

What's verified:

  • The change is tightly scoped to billing only; rate_limit, overloaded, and auth no-fallback paths are untouched
  • The extracted isProbeThrottleOpen helper eliminates duplication and makes the new call site clean
  • The probe throttle (markProbe: true) is set on entry to this new path, bounding churn during an active billing outage to one attempt per 30s
  • Test coverage includes the happy path (probe succeeds) and regression paths (billing + with-fallbacks near/far expiry, rate-limit + no-fallbacks)
  • The logic correctly gates the new path to billing-only and no-fallback scenarios, with existing throttle guards in place

Confidence Score: 5/5

  • Safe to merge — the fix is tightly scoped to billing + no-fallback recovery and reuses the proven 30-second throttle mechanism.
  • The change is minimal and well-contained: it adds a single branch in resolveCooldownDecision to enable probing for single-provider billing cooldowns. The logic correctly gates the new path to billing only and verifies the throttle is open before probing. The extracted isProbeThrottleOpen helper is a clean refactor with no side-effects. All existing tests pass, and the new test accurately exercises the bug-fix path. The scope boundary (billing-only, no-fallback-only) is enforced in code and verified by tests.
  • No files require special attention.

Last reviewed commit: 00e9629

@altaywtf altaywtf force-pushed the fix/40226-billing-recovery branch 2 times, most recently from 91dca00 to 8b6e118 Compare March 9, 2026 21:11
@altaywtf altaywtf force-pushed the fix/40226-billing-recovery branch from f96c75e to bbc4254 Compare March 9, 2026 21:55
@altaywtf altaywtf merged commit 0669b0d into main Mar 9, 2026
27 of 29 checks passed
@altaywtf altaywtf deleted the fix/40226-billing-recovery branch March 9, 2026 21:58
@altaywtf
Copy link
Member Author

altaywtf commented Mar 9, 2026

Merged via squash.

Thanks @altaywtf!

@aisle-research-bot
Copy link

aisle-research-bot bot commented Mar 9, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Billing cooldown bypass via single-provider probing can trigger outbound provider calls during disabled window

1. 🟡 Billing cooldown bypass via single-provider probing can trigger outbound provider calls during disabled window

Property Value
Severity Medium
CWE CWE-840
Location src/agents/model-fallback.ts:474-486

Description

The new billing-cooldown probing logic in runWithModelFallback() now attempts the primary model even when all auth profiles are billing-disabled and there are no fallback models, as long as the in-memory probe throttle allows it.

This is security-relevant in hosted/multi-user deployments where untrusted users can trigger runs using shared provider credentials:

  • Previously (per the removed comment/behavior), billing-disabled primary with no fallbacks would be skipped, preventing outbound calls while the auth profiles were disabled.
  • Now, for inferredReason === "billing" and !hasFallbackCandidates, the code will return an attempt decision and set allowTransientCooldownProbe: true.
  • Downstream, allowTransientCooldownProbe is used by the embedded runner to probe one cooldowned auth profile even when all profiles are in cooldown, including when the inferred reason is billing (see src/agents/pi-embedded-runner/run.ts around the allowTransientCooldownProbe / didTransientCooldownProbe loop).

Impact:

  • An untrusted actor repeatedly invoking agent runs can force periodic outbound calls to a provider even during a billing-disabled window (cost/abuse/traffic amplification), rather than failing fast.
  • The throttle is process-memory and non-atomic (check-then-set), so concurrent requests can still cause a burst of probe attempts before the throttle is marked.

Vulnerable code (new branch enabling the probe):

if (inferredReason === "billing") {
  const shouldProbeSingleProviderBilling =
    params.isPrimary &&
    !params.hasFallbackCandidates &&
    isProbeThrottleOpen(params.now, params.probeThrottleKey);
  if (params.isPrimary && (shouldProbe || shouldProbeSingleProviderBilling)) {
    return { type: "attempt", reason: inferredReason, markProbe: true };
  }
  return { type: "skip", ... };
}

And the probe option that enables attempting a cooldowned profile:

if (decision.reason === "rate_limit" ||
    decision.reason === "overloaded" ||
    decision.reason === "billing") {
  runOptions = { allowTransientCooldownProbe: true };
}

Recommendation

Treat billing disablement as a hard block for untrusted/user-triggered runs, or gate billing probes behind explicit server-side policy.

Options:

  1. Do not probe billing-disabled profiles for untrusted triggers (user traffic), only for trusted operators/cron health checks.

    • Add a new parameter like trustedCooldownProbe derived from the request trust level (senderIsOwner/internal triggers) and require it for billing probes.
  2. Stronger rate limiting for probes (to prevent outbound call abuse):

    • Make the throttle persistent (per agent/provider) across process restarts.
    • Make the throttle atomic by marking the probe before checking eligibility (or using an in-flight lock/promise per throttleKey).
    • Add an additional limiter keyed by (sessionKey, provider) or request origin (IP/account/user) for ingress-triggered runs.

Example gating (conceptual):

// plumb this from caller context (e.g., senderIsOwner/trigger)
const allowBillingProbe = params.trustedCaller === true;

if (inferredReason === "billing" && !allowBillingProbe) {
  return {
    type: "skip",
    reason: inferredReason,
    error: `Provider ${params.candidate.provider} has billing issue (skipping all models)`,
  };
}

If billing probes are required for recovery, consider probing via a dedicated, low-rate background job rather than user-triggered requests.


Analyzed PR: #41422 at commit bbc4254

Last updated on: 2026-03-09T22:34:01Z

mrosmarin added a commit to mrosmarin/openclaw that referenced this pull request Mar 9, 2026
* main: (33 commits)
  Exec: mark child command env with OPENCLAW_CLI (openclaw#41411)
  fix(plugins): expose model auth API to context-engine plugins (openclaw#41090)
  Add HTTP 499 to transient error codes for model fallback (openclaw#41468)
  Logging: harden probe suppression for observations (openclaw#41338)
  fix(discord): apply effective maxLinesPerMessage in live replies (openclaw#40133)
  build(protocol): regenerate Swift models after pending node work schemas (openclaw#41477)
  Agents: add fallback error observations (openclaw#41337)
  acp: harden follow-up reliability and attachments (openclaw#41464)
  fix(agents): probe single-provider billing cooldowns (openclaw#41422)
  acp: add regression coverage and smoke-test docs (openclaw#41456)
  acp: forward attachments into ACP runtime sessions (openclaw#41427)
  acp: enrich streaming updates for ide clients (openclaw#41442)
  Sandbox: import STATE_DIR from paths directly (openclaw#41439)
  acp: restore session context and controls (openclaw#41425)
  acp: fail honestly in bridge mode (openclaw#41424)
  Gateway: tighten node pending drain semantics (openclaw#41429)
  Gateway: add pending node work primitives (openclaw#41409)
  fix(auth): reset cooldown error counters on expiry to prevent infinite escalation (openclaw#41028)
  fix(cron): do not misclassify empty/NO_REPLY as interim acknowledgement (openclaw#41401)
  iOS: reconnect gateway on foreground return (openclaw#41384)
  ...
ademczuk pushed a commit to ademczuk/openclaw that referenced this pull request Mar 10, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
mukhtharcm pushed a commit to hnykda/openclaw that referenced this pull request Mar 10, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
aiwatching pushed a commit to aiwatching/openclaw that referenced this pull request Mar 10, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Moshiii pushed a commit to Moshiii/openclaw that referenced this pull request Mar 11, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
dominicnunez pushed a commit to dominicnunez/openclaw that referenced this pull request Mar 11, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Ruijie-Ysp pushed a commit to Ruijie-Ysp/clawdbot that referenced this pull request Mar 12, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
qipyle pushed a commit to qipyle/openclaw that referenced this pull request Mar 12, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
GGzili pushed a commit to GGzili/moltbot that referenced this pull request Mar 12, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf
Interstellar-code pushed a commit to Interstellar-code/operator1 that referenced this pull request Mar 16, 2026
Merged via squash.

Prepared head SHA: bbc4254
Co-authored-by: altaywtf <[email protected]>
Co-authored-by: altaywtf <[email protected]>
Reviewed-by: @altaywtf

(cherry picked from commit 0669b0d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway session caches billing error state — renewing credits does not recover without gateway restart

1 participant