Conversation
Add a background SRE agent that triages BetterStack incidents for the Python SDK and creates fix PRs via the GitHub MCP server. Architecture: BetterStack webhook -> GitHub repository_dispatch -> GH Actions workflow -> Managed Agent session on Anthropic infra. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
| ] | ||
|
|
||
| if attrs.get("response_content"): | ||
| ctx = json.loads(attrs["response_content"]) |
There was a problem hiding this comment.
Bug: The json.loads() call on response_content is not wrapped in a try-except block, which can cause a crash if the API returns malformed JSON.
Severity: HIGH
Suggested Fix
Wrap the json.loads(attrs["response_content"]) call in a try-except json.JSONDecodeError block. Log the error and gracefully handle the case where response_content cannot be parsed, for example, by proceeding with an empty context ctx = {}.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: sre-agent/src/sre_agent/main.py#L84
Potential issue: The code at `sre-agent/src/sre_agent/main.py:84` calls
`json.loads(attrs["response_content"])` without any error handling. The
`response_content` is fetched from the external BetterStack API. If this API returns a
non-empty but invalid JSON string due to an API bug, network issue, or other edge case,
the `json.loads()` call will raise an unhandled `json.JSONDecodeError`. This will crash
the orchestrator, preventing it from triaging the incident.
Did we get this right? 👍 / 👎 to inform future reviews.
| f"**Run URL**: {gh['run_url']}", | ||
| f"**Workflow**: {gh.get('workflow', 'unknown')}", | ||
| f"**Commit**: {gh.get('sha', 'unknown')}", | ||
| f"**Job**: {gh.get('job', 'unknown')}", |
There was a problem hiding this comment.
Bug: The code incorrectly looks for the job key within the github sub-dictionary instead of the top-level context, causing the job status to always be 'unknown'.
Severity: MEDIUM
Suggested Fix
Modify the line to extract the job status from the correct location in the context dictionary. Change gh.get('job', 'unknown') to ctx.get('job', {}).get('status', 'unknown') to correctly access the nested status field.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: sre-agent/src/sre_agent/main.py#L92
Potential issue: The code attempts to extract job status using `gh.get('job',
'unknown')`. However, the `gh` dictionary only contains the `github` sub-dictionary from
the API response. The `job` key actually exists at the top level of the response
context. As a result, the code will always fail to find the job status, and the prompt
sent to the agent will incorrectly state `**Job**: unknown`, even when the status is
available. This deprives the agent of potentially critical diagnostic information.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Pull request overview
Adds a new sre-agent/ subproject and a GitHub Actions workflow to automatically triage BetterStack incidents for the Python SDK using Anthropic Managed Agents (with GitHub MCP + web search), including a runbook skill and unit tests for incident filtering/prompt building.
Changes:
- Introduce a standalone SRE incident-response orchestrator (
sre_agent.main) plus one-time Anthropic resource setup script (sre_agent._setup). - Add a repo runbook skill (
skills/sre-runbook/SKILL.md) and a GitHub Actions workflow to trigger triage viarepository_dispatchor manual dispatch. - Add unit tests for incident relevance filtering and prompt construction; add a dedicated
uv.lockfor the subproject.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/sre-incident-response.yml |
New workflow to run the SRE agent on BetterStack dispatch / manual trigger. |
sre-agent/pyproject.toml |
Defines the standalone sre-agent Python project (deps, build, pytest config). |
sre-agent/uv.lock |
Lockfile for the new subproject’s dependency resolution. |
sre-agent/src/sre_agent/__init__.py |
Package initializer for the SRE agent module. |
sre-agent/src/sre_agent/__main__.py |
Enables python -m sre_agent execution entrypoint. |
sre-agent/src/sre_agent/_config.py |
Pydantic settings model for agent IDs, vault/environment IDs, BetterStack token, repo mount target. |
sre-agent/src/sre_agent/_setup.py |
One-time setup script to create Anthropic agent/environment/skill/vault resources. |
sre-agent/src/sre_agent/main.py |
Orchestrator: fetch/simulate incident, filter, build prompt, run Managed Agent session and stream output. |
sre-agent/skills/sre-runbook/SKILL.md |
Runbook/triage guidance provided to the agent as a skill file. |
sre-agent/tests/test_main.py |
Unit tests for is_python_sdk_incident and build_prompt. |
| print("Using simulated incident for testing.") | ||
| incident = SAMPLE_INCIDENT | ||
| elif incident_id: | ||
| settings = SREAgentSettings() # type: ignore[call-arg] | ||
| incident = fetch_incident(incident_id, settings.betterstack_api_token.get_secret_value()) | ||
| else: | ||
| print("No INCIDENT_ID provided and SIMULATE is not true. Exiting.") | ||
| sys.exit(0) | ||
|
|
||
| if not is_python_sdk_incident(incident): | ||
| print(f"Skipping non-Python-SDK incident: {incident.get('attributes', {}).get('name', 'unknown')}") | ||
| sys.exit(0) | ||
|
|
||
| settings = SREAgentSettings() # type: ignore[call-arg] |
| req = urllib.request.Request( | ||
| f"https://uptime.betterstack.com/api/v2/incidents/{incident_id}", | ||
| headers={"Authorization": f"Bearer {token}"}, | ||
| ) | ||
| with urllib.request.urlopen(req) as resp: | ||
| return json.loads(resp.read())["data"] # type: ignore[no-any-return] |
| if attrs.get("response_content"): | ||
| ctx = json.loads(attrs["response_content"]) | ||
| gh = ctx.get("github", {}) | ||
| if gh.get("run_url"): | ||
| parts.extend([ | ||
| "\n## Failed GitHub Actions Run", | ||
| f"**Run URL**: {gh['run_url']}", | ||
| f"**Workflow**: {gh.get('workflow', 'unknown')}", | ||
| f"**Commit**: {gh.get('sha', 'unknown')}", | ||
| f"**Job**: {gh.get('job', 'unknown')}", | ||
| ]) |
| if gh.get("run_url"): | ||
| parts.extend([ | ||
| "\n## Failed GitHub Actions Run", | ||
| f"**Run URL**: {gh['run_url']}", | ||
| f"**Workflow**: {gh.get('workflow', 'unknown')}", | ||
| f"**Commit**: {gh.get('sha', 'unknown')}", | ||
| f"**Job**: {gh.get('job', 'unknown')}", |
| - uses: actions/checkout@v4 | ||
|
|
||
| - uses: astral-sh/setup-uv@v6 |
|
|
||
| - name: Install dependencies | ||
| working-directory: sre-agent | ||
| run: uv sync |
|
|
||
| ### "Scheduled Testing" incidents (staging) | ||
| - Cause: Unit, integration, or e2e tests failed against staging. | ||
| - Runs every 6 hours via .github/workflows/_scheduled-test-hourly.yml. |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (63.78%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.
|


Summary
repository_dispatch(zero separate infrastructure)Architecture
Files
sre-agent/src/sre_agent/main.pysre-agent/src/sre_agent/_config.pysre-agent/src/sre_agent/_setup.pysre-agent/skills/sre-runbook/SKILL.md.github/workflows/sre-incident-response.ymlsre-agent/tests/test_main.pySetup steps (post-merge)
1. Run one-time setup to create Anthropic resources
The PAT needs
contents:write,pull-requests:write,issues:writescopes on this repo. PRs will appear under the PAT owner's GitHub identity.2. Store output as GitHub Actions secrets
The setup script prints three IDs. Add these as repo secrets:
SRE_AGENT_IDSRE_ENVIRONMENT_IDSRE_VAULT_IDAlso add:
BETTERSTACK_API_TOKEN— BetterStack API token (Settings → API tokens)3. Configure BetterStack webhook
Set up a webhook integration in BetterStack that POSTs to:
URL:
https://api.github.com/repos/aignostics/python-sdk/dispatchesHeaders:
Authorization: token <GITHUB_PAT>,Accept: application/vnd.github+jsonBody:
{ "event_type": "betterstack-incident", "client_payload": { "incident_id": "{{incident_id}}" } }Testing
Simulated incident (no external deps needed)
Real BetterStack incident
Simulated repository_dispatch (mimics BetterStack webhook)
gh api repos/aignostics/python-sdk/dispatches \ -f event_type=betterstack-incident \ -f 'client_payload={"incident_id":"949981259"}'Unit tests
Test plan
uv run pytest -vinsre-agent/)workflow_dispatchworkflow_dispatch🤖 Generated with Claude Code