ops

POLLN Operations

Complete operational readiness for POLLN Colony production deployment

Overview

This directory contains all operational procedures, monitoring configurations, and disaster recovery documentation required to operate POLLN Colony in production.

Status: ✅ PRODUCTION READY Last Updated: 2026-03-08 Version: 1.0

Directory Structure

ops/
├── runbooks/                    # Incident response runbooks
│   ├── 01-colony-crash.md
│   ├── 02-dependency-failure.md
│   ├── 03-data-corruption.md
│   ├── 04-performance-degradation.md
│   └── 05-disaster-recovery.md
│
├── monitoring/                  # Monitoring configurations
│   ├── prometheus-rules.yaml    # Alert rules
│   └── grafana-dashboard.json   # Operations dashboard
│
├── alerting/                    # Alerting policies
│   └── escalation-policy.md     # On-call & escalation
│
├── slos/                        # Service Level Objectives
│   └── service-level-objectives.md
│
├── tests/                       # Operational validation tests
│   ├── dr-validation.test.ts
│   └── operational-readiness.test.ts
│
├── disaster-recovery.md         # Complete DR plan
├── backup-restore-procedures.md # Backup & restore procedures
└── README.md                    # This file

Quick Reference

Critical Contacts

Role	Contact	Hours
On-Call Engineer	[email protected]	24/7
On-Call Manager	[email protected]	24/7
SRE Lead	[email protected]	Business hours
Engineering Director	[email protected]	Business hours
CTO	[email protected]	Emergency

Emergency Procedures

Colony Down
- Runbook: 01-colony-crash.md
- Escalation: P0
- Target MTTR: < 15 min
Data Corruption
- Runbook: 03-data-corruption.md
- Escalation: P0
- Target MTTR: < 60 min
Disaster Recovery
- Runbook: 05-disaster-recovery.md
- Escalation: P0
- Target MTTR: < 60 min

Key Metrics

Metric	Target	Current	Alert
Availability	99.9%	99.95%	✅
P99 Latency	< 1s	500ms	✅
Error Budget	0.1%	0.05% used	✅
RTO	< 60 min	45 min	✅
RPO	< 5 min	2 min	✅

Runbooks

Incident response procedures for common operational issues.

Available Runbooks

Colony Crash
- Colony process not responding
- All agents showing unhealthy
- Health checks failing
Dependency Failure
- LMCache backend down
- Federated learning coordinator down
- World model service down
- Meadow disconnected
Data Corruption
- Agent state corruption
- Synapse weight corruption
- World model corruption
- KV-anchor corruption
Performance Degradation
- High latency
- Memory leak
- Agent explosion
- Dream cycle saturation
Disaster Recovery
- Complete region failure
- Catastrophic data corruption
- Security breach
- Natural disaster

Using Runbooks

Identify the incident type
Follow the diagnosis steps
Implement the resolution
Verify the fix
Complete post-incident actions

Monitoring

Prometheus Alerting

Location: monitoring/prometheus-rules.yaml

Alert Groups:

colony_health - Colony health and agent status
performance - Latency, memory, CPU
kv_cache - KV-cache performance
federated - Federated learning sync
world_model - World model and dreaming
data_integrity - Data corruption detection
security - Security incidents
sla - SLA compliance

Deploy Alert Rules:

kubectl apply -f ops/monitoring/prometheus-rules.yaml

Grafana Dashboard

Location: monitoring/grafana-dashboard.json

Panels:

Colony health overview
Active agents
Decision latency (P99)
Memory usage
CPU usage
Dream cycle status
KV-cache performance
Federated learning
World model
Synapse health
SLA status

Import Dashboard:

# Via Grafana UI
1. Go to Dashboards → Import
2. Upload ops/monitoring/grafana-dashboard.json
3. Select Prometheus datasource
4. Click Import

Alerting

Escalation Policy

Location: alerting/escalation-policy.md

Severity Levels:

Severity	Name	Response Time	Example
P0	Critical	< 5 min	Colony down, data corruption
P1	High	< 15 min	Dependency failure
P2	Medium	< 30 min	Single feature broken
P3	Low	< 1 hour	Performance below baseline

Escalation Path:

P0: on-call → manager → director → CTO → CEO
P1: on-call → manager → director → CTO
P2: on-call → manager → director
P3: on-call → manager

On-Call Rotation

Primary: 24/7 coverage, weekly rotation
Secondary: Business hours, shadow primary
Handoff: Monday 9:00 AM UTC

Notification Channels

P0: Phone, SMS, Slack (@here), Email
P1: SMS, Slack (@on-call), Email
P2: Slack (@on-call), Email
P3: Slack (@on-call), Email digest

Service Level Objectives

Location: slos/service-level-objectives.md

Core SLOs

SLO	Target	Status
Availability	99.9%	✅ 99.95%
P99 Latency	< 1000ms	✅ 500ms
Throughput	> 1000 RPS	✅ 1200 RPS
Durability	99.9999%	✅ 99.9999%
Freshness	< 1 hour	✅ 30 min
Correctness	> 95%	✅ 98%

Error Budget

Target: 99.9% availability
Error Budget: 0.1% (43.2 minutes/month)
Current Usage: 0.05% (21.6 minutes remaining)
Status: ✅ Healthy

Disaster Recovery

DR Plan

Location: disaster-recovery.md

Recovery Objectives:

RTO: < 60 minutes (actual: 45 min)
RPO: < 5 minutes (actual: 2 min)

DR Architecture:

Primary Region: us-east-1
DR Region: us-west-2
Replication: Continuous (S3), Real-time (RDS)
Testing: Quarterly

Backup Procedures

Location: backup-restore-procedures.md

Backup Types:

Continuous: WAL logs, 24h retention
Frequent: Every 5 minutes, 30d retention
Daily: Daily snapshots, 90d retention
Weekly: Weekly archives, 1y retention

Backup Components:

Colony state (S3 versioning)
Agent topology (DB dump)
Synapse weights (DB dump)
World model (S3 versioning)
KV anchors (S3 versioning)
Federated state (DB dump)

Testing

Operational Validation

Run operational readiness tests:

# DR validation
npm run test:ops:dr

# Operational readiness
npm run test:ops:readiness

# All ops tests
npm run test:ops

DR Testing Schedule

Weekly: Backup verification
Monthly: Tabletop exercise
Quarterly: Full DR drill
Annually: Complete review

Common Procedures

Creating Emergency Backup

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli backup --emergency

Restoring from Backup

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli restore --full --backup=latest

Validating Colony Health

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli validate --full

Checking Agent Status

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli agents --status

Viewing Metrics

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli metrics --live

Incident Response

Declaring an Incident

P0/P1 Incidents: Create incident channel immediately
P2 Incidents: Create incident channel within 15 min
P3 Incidents: Track in issue tracker

Incident Channel Template

# Incident #XXX - [Brief Description]

**Severity**: P0/P1/P2/P3
**Status**: Investigating / Identified / Monitoring / Resolved
**IC**: @username
**Started**: [Timestamp]

## Timeline
- [Timestamp] - Incident detected
- [Timestamp] - Status update

## Actions Taken
- [Action item]

## Next Steps
- [ ] [Action]

Post-Incident Actions

Complete incident document
Schedule postmortem (within 5 days)
Create action items
Update runbooks
Present to stakeholders

Maintenance

Daily Operations

Review alerts (morning)
Check backup status
Review error budget
Monitor system health

Weekly Operations

Review SLO performance
Check DR readiness
Review runup logs
Update metrics

Monthly Operations

SLO review meeting
On-call rotation review
Runbook updates
Training session

Quarterly Operations

DR drill
SLO target review
Architecture review
Process improvements

Documentation Updates

When to Update

After every incident
Quarterly reviews
Architecture changes
Process improvements

Update Process

Edit relevant runbook/procedure
Update version number
Add changelog entry
Review with SRE team
Merge to main branch

Support

Getting Help

Urgent Issues: Contact on-call (24/7)
Non-Urgent: Create GitHub issue
Questions: Slack #polln-ops
Documentation: See relevant runbook

Contributing

To improve operational procedures:

Edit the relevant document
Test changes in staging
Submit PR with clear description
Request review from SRE team
Participate in review process

Status

Overall Status: ✅ PRODUCTION READY

Readiness Checklist:

Next Review: 2026-06-08

Maintained By: SRE Team Contact: [email protected] Last Updated: 2026-03-08

Name		Name	Last commit message	Last commit date
parent directory ..
alerting		alerting
monitoring		monitoring
runbooks		runbooks
slos		slos
tests		tests
README.md		README.md
backup-restore-procedures.md		backup-restore-procedures.md
disaster-recovery.md		disaster-recovery.md

FilesExpand file tree

ops

Directory actions

More options