🚨 SRE Autopilot: Production Incident Response

title

SRE Autopilot Environment

emoji

🚨

colorFrom

red

colorTo

yellow

sdk

docker

app_port

8000

base_path

/web

🚨 SRE Autopilot: Production Incident Response

This environment simulates a microservices system under failure for the Meta-PyTorch Hackathon.

The agent acts as an on-call Site Reliability Engineer (SRE) receiving live, noisy telemetry across a dependency graph. Its goal is to diagnose the root cause and execute remediation actions to restore Service Level Agreement (SLA) compliance over a 30-step episode.

🧠 Why This Is an Elite-Tier RL Environment

Unlike typical LLM "agentic" benchmarks where the optimal policy can be deduced through pure zero-shot reasoning, this environment structurally demands a learned Reinforcement Learning policy.

Why naive LLMs fail here:

Hidden Failure Modes: The true failure mode (e.g., db_connection_leak, flapping_network) is hidden latent state. Only probing and observing the system over time reveals the true root cause.
Delayed & Compound Effects: Restarts cause immediate 1-step downtime penalties but fix the long-term cascade. Scaling up takes 3 steps to take effect.
Irreversible Consequences: Actions cost resources (budgeted restarts). Trap actions temporarily improve metrics but do not fix the root cause.
Stochastic Observability: Telemetry data contains Gaussian noise, dropping spans, and lag.

🚀 Tasks

We implement exactly 3 tiers of difficulty perfectly suited for continuous grading:

Easy (single_oom)
- Single service OOM, clear error spike, one restart fixes it 80% of the time. Tests basic pattern recognition and categorical action execution.
Medium (db_connection_leak)
- Cascading failure — DB leak causes latency to slowly spike, cascading up to the API Gateway. Fixing the leaf nodes doesn't help; the agent must traverse the trace graph and restart the true root.
Hard (flapping_network)
- Intermittent flapping: network dropping on payment gateway momentarily. The metrics recover and regress. Agent must learn the specific fingerprint through trial and error across episodes.

📊 State & Action Space

Observation (`SREObservation`)

metrics: Dictionary mapping service IDs to noisy telemetry (latency_p99, error_rate, cpu_pct, memory_pct).
dependency_graph: Adjacency list of microservice traces.
action_history: The last 5 actions taken.
sla_status: Boolean indicator of which services are currently meeting the 200ms latency / 1% error SLA.
reward_hint: Dense reward signal for the current step.
reward_breakdown: Structured breakdown (SLA compliance, cost penalty, downtime penalty, resolution bonus).

Actions (`SREAction`)

restart — Restarts a service. Costs 1 from budget, induces 1-step 100% error rate downtime but can fix faults.
scale_up — Doubles capacity. Costs money, takes 3 steps to realize effect.
circuit_break — Opens circuit breaker on a service. Stops propagation but 100% error locally for 3 steps.
rollback — Instant fix but very high cost penalty.
wait — Passes 1 step, observes without acting.

Services

api_gateway, auth_service, product_service, order_service, user_db, inventory_db, payment_gateway

💰 Evaluation & Reward Function

To strictly comply with the Phase 2 deep validation checks, episodic rewards are calculated as an overall evaluation score (bounded strictly between 0.01 and 0.99) rather than raw unbounded per-step feedback:

0.0: Reward emitted for all non-terminal intermediate steps.
Final Score (0.01–0.99): On the terminal step (done=True), the environment inherently invokes grader.py. The terminal reward encapsulates SLA compliance, step efficiency, budget spent, and tier-specific root-cause resolution bonuses, guaranteed to yield a strict Phase 2 validated score.

🛠 Usage

Environment Variables (Required for Inference)

Before running inference.py, set the following environment variables:

export API_BASE_URL="https://api.groq.com/openai/v1" # Free Groq API (OpenAI Compatible)
export MODEL_NAME="llama-3.1-8b-instant"             # Groq's specific model string
export HF_TOKEN="gsk_your_groq_api_key_here"         # Paste your Groq API key (keeping var name)

Quick Start — Python Client

from client import SREEnv
from models import SREAction

env = SREEnv(base_url="http://localhost:8000")

# Test the easiest failure tier
result = env.reset(task_tier="easy")

# Restart the suspected root cause
action = SREAction(action="restart", service_id="api_gateway")
result = env.step(action)

print(f"Reward: {result.reward}")

Running Inference (Submission Script)

The inference script is the required submission entry point. It uses the OpenAI Client:

# 1. Start the environment server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# 2. In another terminal, run inference
python inference.py

This will run all 3 task tiers and output rewards + grader scores in the 0.0–1.0 range.

Running the Baseline Agent

The baseline uses a local Qwen model for experimentation (not required for submission):

python baseline.py

🐳 Docker Deployment

docker build -t sre-env:latest .
docker run -p 8000:8000 sre-env:latest

📡 API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check (returns 200)
`/reset`	POST	Reset environment (accepts `{"task_tier": "easy\|medium\|hard"}`)
`/step`	POST	Take an action (accepts `{"action": "...", "service_id": "..."}`)
`/state`	GET	Get the current hidden state
`/tasks`	GET	List available tasks and action schema
`/grader?tier=easy`	GET	Run heuristic agent + grade (returns score 0.0–1.0)
`/baseline`	GET	Run baseline on all 3 tiers, returns scores

⚙️ Infra Requirements

Runtime: Inference must complete in under 20 minutes
Resources: Must run on vcpu=2, memory=8GB
LLM Calls: Must use OpenAI Client via API_BASE_URL, MODEL_NAME, HF_TOKEN

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
__pycache__		__pycache__
server		server
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
baseline.py		baseline.py
client.py		client.py
grader.py		grader.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚨 SRE Autopilot: Production Incident Response

🧠 Why This Is an Elite-Tier RL Environment

🚀 Tasks

📊 State & Action Space

Observation (`SREObservation`)

Actions (`SREAction`)

Services

💰 Evaluation & Reward Function

🛠 Usage

Environment Variables (Required for Inference)

Quick Start — Python Client

Running Inference (Submission Script)

Running the Baseline Agent

🐳 Docker Deployment

📡 API Endpoints

⚙️ Infra Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚨 SRE Autopilot: Production Incident Response

🧠 Why This Is an Elite-Tier RL Environment

🚀 Tasks

📊 State & Action Space

Observation (SREObservation)

Actions (SREAction)

Services

💰 Evaluation & Reward Function

🛠 Usage

Environment Variables (Required for Inference)

Quick Start — Python Client

Running Inference (Submission Script)

Running the Baseline Agent

🐳 Docker Deployment

📡 API Endpoints

⚙️ Infra Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Observation (`SREObservation`)

Actions (`SREAction`)

Packages