fix: Retry on 404 in run details to handle read replica lag#440
fix: Retry on 404 in run details to handle read replica lag#440olivermeyer merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds resiliency to Run.details() in the Platform client by retrying NotFoundException (HTTP 404) to better tolerate short-lived read-replica lag immediately after run submission.
Changes:
- Include
NotFoundExceptioninRun.details()retry conditions. - Switch
Run.details()retry stop condition tostop_after_delayand adjust the wait strategy.
Codecov Report✅ All modified and coverable lines are covered by tests.
|
cb5b1a8 to
f4a9199
Compare
| stop=stop_after_delay(5), | ||
| wait=wait_exponential_jitter(initial=0.5, max=3), |
There was a problem hiding this comment.
The hardcoded retry parameters (5 seconds max delay, 0.5-3s jitter) for NotFoundException retries are not configurable through settings, unlike the inner retry which uses settings().run_retry_attempts, settings().run_retry_wait_min, and settings().run_retry_wait_max. For consistency and maintainability, consider adding configuration options for these read replica lag retry parameters, such as 'run_replica_lag_retry_max_delay', 'run_replica_lag_retry_wait_min', and 'run_replica_lag_retry_wait_max'. This would allow operators to adjust these values based on their infrastructure's read replica lag characteristics without code changes.
There was a problem hiding this comment.
This is the only place where these values are used. We don't know for sure if the change would work. And we don't know where else we need these values yet. We can add some comments but there's little value to premature abstraction here.
f4a9199 to
957ac91
Compare
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_run_details_retries_on_not_found_then_succeeds(app_run, mock_api) -> None: |
There was a problem hiding this comment.
BLOCKING: Missing timeout marker
These tests trigger retry logic with up to 5 seconds of delay, exceeding the default 10s pytest timeout. All three new tests need:
@pytest.mark.timeout(timeout=15)
@pytest.mark.unit
def test_run_details_retries_on_not_found_then_succeeds(app_run, mock_api) -> None:Reference: CLAUDE.md Section "Testing Workflow" - "Default timeout: 10 seconds per test. Override in test: @pytest.mark.timeout(timeout=60)"
Verify fix:
uv run pytest tests/aignostics/platform/resources/runs_test.py::test_run_details_retries_on_not_found_then_succeeds -v|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_run_details_does_not_retry_other_exceptions(app_run, mock_api) -> None: |
There was a problem hiding this comment.
BLOCKING: Missing timeout marker
Add timeout marker for consistency:
@pytest.mark.timeout(timeout=15)
@pytest.mark.unitWhile this test should fail fast (1 call), it's better to be explicit given the retry context.
| _headers={"User-Agent": user_agent()}, | ||
| ) | ||
| ) | ||
|
|
There was a problem hiding this comment.
SUGGESTION: Consider retry budget implications
The nested retry structure means:
- Inner retry: Up to 4 attempts for transient errors (network, 5xx)
- Outer retry: Up to 5 seconds for
NotFoundException
Potential issue: If the inner retry succeeds but returns a NotFoundException, the outer retry will keep calling _fetch(), which will do up to 4 more retries on each attempt. This could lead to many API calls.
Example scenario:
- Outer retry attempt 1 →
_fetch()calls API 4 times (all return 404) → raisesNotFoundException - Wait ~0.5s
- Outer retry attempt 2 →
_fetch()calls API 4 times again → raisesNotFoundException - Continue until 5 seconds elapsed
Recommendation: Consider if the inner retry should even run when we get a clean NotFoundException. The read replica lag issue is deterministic (run doesn't exist yet), not a transient network error.
Possible alternative:
def _fetch() -> RunData:
# For NotFoundException, skip inner retry since it's not a transient error
return self._api.get_run_v1_runs_run_id_get(
run_id,
_request_timeout=settings().run_timeout,
_headers={"User-Agent": user_agent()},
)
# Outer retry handles NotFoundException (read replica lag)
# AND regular retryable exceptions as fallback
return Retrying(
retry=retry_if_exception_type(
exception_types=(NotFoundException, *RETRYABLE_EXCEPTIONS)
),
stop=stop_after_delay(5),
wait=wait_exponential_jitter(initial=0.5, max=3),
before_sleep=_log_retry_attempt,
reraise=True,
)(_fetch)This would give you a single retry loop that handles both 404s (up to 5s) and transient errors (up to 5s), avoiding nested retries.
However, the current approach is functionally correct - just potentially more API calls than necessary. Consider discussing with the team.
|
Claude finished @nahua-aignx's task in 2m 51s —— View job Post-Merge Review: PR #440 - Retry on 404 for Read Replica LagI've analyzed this merged PR that fixes read replica lag handling in ✅ What Was FixedThe PR successfully addresses the original issue where tests failed due to a 125ms gap between run submission and retrieval, hitting read replica lag. The solution adds an outer retry layer specifically for 🔴 CRITICAL ISSUE: Missing Test Timeout MarkersAll three new tests are timing out! They exceed the default 10-second pytest timeout because they involve retry logic with up to 5 seconds of delay. Location: Tests affected:
Required fix (per CLAUDE.md "Testing Workflow" section): @pytest.mark.timeout(timeout=15)
@pytest.mark.unit
def test_run_details_retries_on_not_found_then_succeeds(app_run, mock_api) -> None:Why this matters:
Reference: CLAUDE.md Section "Testing Workflow" states: "Default timeout: 10 seconds per test. Override in test: Verification command: uv run pytest tests/aignostics/platform/resources/runs_test.py::test_run_details_retries_on_not_found_then_succeeds -v
|
|



Yesterday an hourly test on staging failed most likely due to read replica lag where the gap between the test submitting a run and the test trying to read the test was 125ms.
I'd like to propose a small fix to
details_with_retryso that we include theNotFoundExceptionas a retry condition and usestop_after_delaywith a different wait condition. Let me know if this is fine.