feat: add SRE incident response agent by neelay-aign · Pull Request #541 · aignostics/python-sdk

neelay-aign · 2026-04-14T12:39:56Z

Summary

Add a background SRE agent that triages BetterStack incidents for the Python SDK using the Anthropic Managed Agents API
Agent uses GitHub MCP server + web search + built-in tools (zero custom tools) to read workflow run logs, diagnose failures, and create draft fix PRs
Orchestrator runs as a GitHub Actions workflow triggered by repository_dispatch (zero separate infrastructure)
Includes 15 unit tests for incident filtering and prompt construction

Architecture

BetterStack incident → GitHub repository_dispatch → GH Actions workflow
  → Python orchestrator fetches incident from BetterStack API
  → Creates Managed Agent session on Anthropic infra
  → Agent triages using GitHub MCP + web search + mounted repo
  → Creates draft PR or issue with findings

Files

File	Purpose
`sre-agent/src/sre_agent/main.py`	Orchestrator: fetch incident, filter, create session, stream
`sre-agent/src/sre_agent/_config.py`	Pydantic Settings for env vars
`sre-agent/src/sre_agent/_setup.py`	One-time script to create agent/environment/skill/vault on Anthropic
`sre-agent/skills/sre-runbook/SKILL.md`	Repo-specific triage context for the agent
`.github/workflows/sre-incident-response.yml`	GH Actions workflow (dispatch + manual trigger)
`sre-agent/tests/test_main.py`	15 unit tests

Setup steps (post-merge)

1. Run one-time setup to create Anthropic resources

cd sre-agent
ANTHROPIC_API_KEY=<key> SRE_GITHUB_PAT=<fine-grained-pat> uv run python -m sre_agent._setup

The PAT needs contents:write, pull-requests:write, issues:write scopes on this repo. PRs will appear under the PAT owner's GitHub identity.

2. Store output as GitHub Actions secrets

The setup script prints three IDs. Add these as repo secrets:

SRE_AGENT_ID
SRE_ENVIRONMENT_ID
SRE_VAULT_ID

Also add:

BETTERSTACK_API_TOKEN — BetterStack API token (Settings → API tokens)

3. Configure BetterStack webhook

Set up a webhook integration in BetterStack that POSTs to:

URL: https://api.github.com/repos/aignostics/python-sdk/dispatches
Headers: Authorization: token <GITHUB_PAT>, Accept: application/vnd.github+json
Body:

{
  "event_type": "betterstack-incident",
  "client_payload": {
    "incident_id": "{{incident_id}}"
  }
}

Testing

Simulated incident (no external deps needed)

gh workflow run sre-incident-response.yml -f simulate=true

Real BetterStack incident

gh workflow run sre-incident-response.yml -f incident_id=949981259 -f simulate=false

Simulated repository_dispatch (mimics BetterStack webhook)

gh api repos/aignostics/python-sdk/dispatches \
  -f event_type=betterstack-incident \
  -f 'client_payload={"incident_id":"949981259"}'

Unit tests

cd sre-agent && uv sync --extra dev && uv run pytest -v

Test plan

Run unit tests locally (uv run pytest -v in sre-agent/)
Run setup script to create Anthropic resources
Test with simulated incident via workflow_dispatch
Test with real incident ID via workflow_dispatch
Configure BetterStack webhook and test end-to-end

🤖 Generated with Claude Code

Add a background SRE agent that triages BetterStack incidents for the Python SDK and creates fix PRs via the GitHub MCP server. Architecture: BetterStack webhook -> GitHub repository_dispatch -> GH Actions workflow -> Managed Agent session on Anthropic infra. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

sonarqubecloud · 2026-04-14T12:41:13Z

Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

sentry · 2026-04-14T12:43:45Z

+    ]
+
+    if attrs.get("response_content"):
+        ctx = json.loads(attrs["response_content"])


Bug: The json.loads() call on response_content is not wrapped in a try-except block, which can cause a crash if the API returns malformed JSON.
_{Severity: HIGH}

Suggested Fix

Wrap the json.loads(attrs["response_content"]) call in a try-except json.JSONDecodeError block. Log the error and gracefully handle the case where response_content cannot be parsed, for example, by proceeding with an empty context ctx = {}.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: sre-agent/src/sre_agent/main.py#L84 Potential issue: The code at `sre-agent/src/sre_agent/main.py:84` calls `json.loads(attrs["response_content"])` without any error handling. The `response_content` is fetched from the external BetterStack API. If this API returns a non-empty but invalid JSON string due to an API bug, network issue, or other edge case, the `json.loads()` call will raise an unhandled `json.JSONDecodeError`. This will crash the orchestrator, preventing it from triaging the incident.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

sentry · 2026-04-14T12:43:45Z

+                f"**Run URL**: {gh['run_url']}",
+                f"**Workflow**: {gh.get('workflow', 'unknown')}",
+                f"**Commit**: {gh.get('sha', 'unknown')}",
+                f"**Job**: {gh.get('job', 'unknown')}",


Bug: The code incorrectly looks for the job key within the github sub-dictionary instead of the top-level context, causing the job status to always be 'unknown'.
_{Severity: MEDIUM}

Suggested Fix

Modify the line to extract the job status from the correct location in the context dictionary. Change gh.get('job', 'unknown') to ctx.get('job', {}).get('status', 'unknown') to correctly access the nested status field.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: sre-agent/src/sre_agent/main.py#L92 Potential issue: The code attempts to extract job status using `gh.get('job', 'unknown')`. However, the `gh` dictionary only contains the `github` sub-dictionary from the API response. The `job` key actually exists at the top level of the response context. As a result, the code will always fail to find the job status, and the prompt sent to the agent will incorrectly state `**Job**: unknown`, even when the status is available. This deprives the agent of potentially critical diagnostic information.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

Copilot

Pull request overview

Adds a new sre-agent/ subproject and a GitHub Actions workflow to automatically triage BetterStack incidents for the Python SDK using Anthropic Managed Agents (with GitHub MCP + web search), including a runbook skill and unit tests for incident filtering/prompt building.

Changes:

Introduce a standalone SRE incident-response orchestrator (sre_agent.main) plus one-time Anthropic resource setup script (sre_agent._setup).
Add a repo runbook skill (skills/sre-runbook/SKILL.md) and a GitHub Actions workflow to trigger triage via repository_dispatch or manual dispatch.
Add unit tests for incident relevance filtering and prompt construction; add a dedicated uv.lock for the subproject.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`.github/workflows/sre-incident-response.yml`	New workflow to run the SRE agent on BetterStack dispatch / manual trigger.
`sre-agent/pyproject.toml`	Defines the standalone `sre-agent` Python project (deps, build, pytest config).
`sre-agent/uv.lock`	Lockfile for the new subproject’s dependency resolution.
`sre-agent/src/sre_agent/__init__.py`	Package initializer for the SRE agent module.
`sre-agent/src/sre_agent/__main__.py`	Enables `python -m sre_agent` execution entrypoint.
`sre-agent/src/sre_agent/_config.py`	Pydantic settings model for agent IDs, vault/environment IDs, BetterStack token, repo mount target.
`sre-agent/src/sre_agent/_setup.py`	One-time setup script to create Anthropic agent/environment/skill/vault resources.
`sre-agent/src/sre_agent/main.py`	Orchestrator: fetch/simulate incident, filter, build prompt, run Managed Agent session and stream output.
`sre-agent/skills/sre-runbook/SKILL.md`	Runbook/triage guidance provided to the agent as a skill file.
`sre-agent/tests/test_main.py`	Unit tests for `is_python_sdk_incident` and `build_prompt`.

+        print("Using simulated incident for testing.")
+        incident = SAMPLE_INCIDENT
+    elif incident_id:
+        settings = SREAgentSettings()  # type: ignore[call-arg]
+        incident = fetch_incident(incident_id, settings.betterstack_api_token.get_secret_value())
+    else:
+        print("No INCIDENT_ID provided and SIMULATE is not true. Exiting.")
+        sys.exit(0)
+
+    if not is_python_sdk_incident(incident):
+        print(f"Skipping non-Python-SDK incident: {incident.get('attributes', {}).get('name', 'unknown')}")
+        sys.exit(0)
+
+    settings = SREAgentSettings()  # type: ignore[call-arg]


+    req = urllib.request.Request(
+        f"https://uptime.betterstack.com/api/v2/incidents/{incident_id}",
+        headers={"Authorization": f"Bearer {token}"},
+    )
+    with urllib.request.urlopen(req) as resp:
+        return json.loads(resp.read())["data"]  # type: ignore[no-any-return]


+    if attrs.get("response_content"):
+        ctx = json.loads(attrs["response_content"])
+        gh = ctx.get("github", {})
+        if gh.get("run_url"):
+            parts.extend([
+                "\n## Failed GitHub Actions Run",
+                f"**Run URL**: {gh['run_url']}",
+                f"**Workflow**: {gh.get('workflow', 'unknown')}",
+                f"**Commit**: {gh.get('sha', 'unknown')}",
+                f"**Job**: {gh.get('job', 'unknown')}",
+            ])


+        if gh.get("run_url"):
+            parts.extend([
+                "\n## Failed GitHub Actions Run",
+                f"**Run URL**: {gh['run_url']}",
+                f"**Workflow**: {gh.get('workflow', 'unknown')}",
+                f"**Commit**: {gh.get('sha', 'unknown')}",
+                f"**Job**: {gh.get('job', 'unknown')}",


+      - uses: actions/checkout@v4
+
+      - uses: astral-sh/setup-uv@v6


+
+      - name: Install dependencies
+        working-directory: sre-agent
+        run: uv sync


+
+### "Scheduled Testing" incidents (staging)
+- Cause: Unit, integration, or e2e tests failed against staging.
+- Runs every 6 hours via .github/workflows/_scheduled-test-hourly.yml.


codecov · 2026-04-14T13:02:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

❌ Your project check has failed because the head coverage (63.78%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (1b1b4b6) and HEAD (678bb69). Click for more details.

HEAD has 10 uploads less than BASE

Flag BASE (1b1b4b6) HEAD (678bb69)

11 1

see 24 files with indirect coverage changes

neelay-aign requested review from a team and helmut-hoffer-von-ankershoffen as code owners April 14, 2026 12:39

neelay-aign added the skip:test:long_running Skip long-running tests (≥5min) label Apr 14, 2026

Copilot AI review requested due to automatic review settings April 14, 2026 12:39

Copilot started reviewing on behalf of neelay-aign April 14, 2026 12:40 View session

sentry bot reviewed Apr 14, 2026

View reviewed changes

Copilot AI reviewed Apr 14, 2026

View reviewed changes

neelay-aign removed the skip:test:long_running Skip long-running tests (≥5min) label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SRE incident response agent#541

feat: add SRE incident response agent#541
neelay-aign wants to merge 1 commit intomainfrom
feat/sre-incident-response-agent

neelay-aign commented Apr 14, 2026

Uh oh!

sonarqubecloud bot commented Apr 14, 2026

Uh oh!

sentry bot Apr 14, 2026

Uh oh!

sentry bot Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neelay-aign commented Apr 14, 2026

Summary

Architecture

Files

Setup steps (post-merge)

1. Run one-time setup to create Anthropic resources

2. Store output as GitHub Actions secrets

3. Configure BetterStack webhook

Testing

Simulated incident (no external deps needed)

Real BetterStack incident

Simulated repository_dispatch (mimics BetterStack webhook)

Unit tests

Test plan

Uh oh!

sonarqubecloud bot commented Apr 14, 2026

Quality Gate failed

Uh oh!

sentry bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Apr 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants