Skip to content

Latest commit

 

History

History

README.md

POLLN Operations

Complete operational readiness for POLLN Colony production deployment


Overview

This directory contains all operational procedures, monitoring configurations, and disaster recovery documentation required to operate POLLN Colony in production.

Status: ✅ PRODUCTION READY Last Updated: 2026-03-08 Version: 1.0


Directory Structure

ops/
├── runbooks/                    # Incident response runbooks
│   ├── 01-colony-crash.md
│   ├── 02-dependency-failure.md
│   ├── 03-data-corruption.md
│   ├── 04-performance-degradation.md
│   └── 05-disaster-recovery.md
│
├── monitoring/                  # Monitoring configurations
│   ├── prometheus-rules.yaml    # Alert rules
│   └── grafana-dashboard.json   # Operations dashboard
│
├── alerting/                    # Alerting policies
│   └── escalation-policy.md     # On-call & escalation
│
├── slos/                        # Service Level Objectives
│   └── service-level-objectives.md
│
├── tests/                       # Operational validation tests
│   ├── dr-validation.test.ts
│   └── operational-readiness.test.ts
│
├── disaster-recovery.md         # Complete DR plan
├── backup-restore-procedures.md # Backup & restore procedures
└── README.md                    # This file

Quick Reference

Critical Contacts

Role Contact Hours
On-Call Engineer [email protected] 24/7
On-Call Manager [email protected] 24/7
SRE Lead [email protected] Business hours
Engineering Director [email protected] Business hours
CTO [email protected] Emergency

Emergency Procedures

  1. Colony Down

  2. Data Corruption

  3. Disaster Recovery

Key Metrics

Metric Target Current Alert
Availability 99.9% 99.95%
P99 Latency < 1s 500ms
Error Budget 0.1% 0.05% used
RTO < 60 min 45 min
RPO < 5 min 2 min

Runbooks

Incident response procedures for common operational issues.

Available Runbooks

  1. Colony Crash

    • Colony process not responding
    • All agents showing unhealthy
    • Health checks failing
  2. Dependency Failure

    • LMCache backend down
    • Federated learning coordinator down
    • World model service down
    • Meadow disconnected
  3. Data Corruption

    • Agent state corruption
    • Synapse weight corruption
    • World model corruption
    • KV-anchor corruption
  4. Performance Degradation

    • High latency
    • Memory leak
    • Agent explosion
    • Dream cycle saturation
  5. Disaster Recovery

    • Complete region failure
    • Catastrophic data corruption
    • Security breach
    • Natural disaster

Using Runbooks

  1. Identify the incident type
  2. Follow the diagnosis steps
  3. Implement the resolution
  4. Verify the fix
  5. Complete post-incident actions

Monitoring

Prometheus Alerting

Location: monitoring/prometheus-rules.yaml

Alert Groups:

  • colony_health - Colony health and agent status
  • performance - Latency, memory, CPU
  • kv_cache - KV-cache performance
  • federated - Federated learning sync
  • world_model - World model and dreaming
  • data_integrity - Data corruption detection
  • security - Security incidents
  • sla - SLA compliance

Deploy Alert Rules:

kubectl apply -f ops/monitoring/prometheus-rules.yaml

Grafana Dashboard

Location: monitoring/grafana-dashboard.json

Panels:

  • Colony health overview
  • Active agents
  • Decision latency (P99)
  • Memory usage
  • CPU usage
  • Dream cycle status
  • KV-cache performance
  • Federated learning
  • World model
  • Synapse health
  • SLA status

Import Dashboard:

# Via Grafana UI
1. Go to Dashboards → Import
2. Upload ops/monitoring/grafana-dashboard.json
3. Select Prometheus datasource
4. Click Import

Alerting

Escalation Policy

Location: alerting/escalation-policy.md

Severity Levels:

Severity Name Response Time Example
P0 Critical < 5 min Colony down, data corruption
P1 High < 15 min Dependency failure
P2 Medium < 30 min Single feature broken
P3 Low < 1 hour Performance below baseline

Escalation Path:

P0: on-call → manager → director → CTO → CEO
P1: on-call → manager → director → CTO
P2: on-call → manager → director
P3: on-call → manager

On-Call Rotation

  • Primary: 24/7 coverage, weekly rotation
  • Secondary: Business hours, shadow primary
  • Handoff: Monday 9:00 AM UTC

Notification Channels

  • P0: Phone, SMS, Slack (@here), Email
  • P1: SMS, Slack (@on-call), Email
  • P2: Slack (@on-call), Email
  • P3: Slack (@on-call), Email digest

Service Level Objectives

Location: slos/service-level-objectives.md

Core SLOs

SLO Target Status
Availability 99.9% ✅ 99.95%
P99 Latency < 1000ms ✅ 500ms
Throughput > 1000 RPS ✅ 1200 RPS
Durability 99.9999% ✅ 99.9999%
Freshness < 1 hour ✅ 30 min
Correctness > 95% ✅ 98%

Error Budget

  • Target: 99.9% availability
  • Error Budget: 0.1% (43.2 minutes/month)
  • Current Usage: 0.05% (21.6 minutes remaining)
  • Status: ✅ Healthy

Disaster Recovery

DR Plan

Location: disaster-recovery.md

Recovery Objectives:

  • RTO: < 60 minutes (actual: 45 min)
  • RPO: < 5 minutes (actual: 2 min)

DR Architecture:

  • Primary Region: us-east-1
  • DR Region: us-west-2
  • Replication: Continuous (S3), Real-time (RDS)
  • Testing: Quarterly

Backup Procedures

Location: backup-restore-procedures.md

Backup Types:

  • Continuous: WAL logs, 24h retention
  • Frequent: Every 5 minutes, 30d retention
  • Daily: Daily snapshots, 90d retention
  • Weekly: Weekly archives, 1y retention

Backup Components:

  • Colony state (S3 versioning)
  • Agent topology (DB dump)
  • Synapse weights (DB dump)
  • World model (S3 versioning)
  • KV anchors (S3 versioning)
  • Federated state (DB dump)

Testing

Operational Validation

Run operational readiness tests:

# DR validation
npm run test:ops:dr

# Operational readiness
npm run test:ops:readiness

# All ops tests
npm run test:ops

DR Testing Schedule

  • Weekly: Backup verification
  • Monthly: Tabletop exercise
  • Quarterly: Full DR drill
  • Annually: Complete review

Common Procedures

Creating Emergency Backup

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli backup --emergency

Restoring from Backup

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli restore --full --backup=latest

Validating Colony Health

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli validate --full

Checking Agent Status

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli agents --status

Viewing Metrics

kubectl exec -it deploy/polln-colony -n production \
  -- npm run cli metrics --live

Incident Response

Declaring an Incident

  1. P0/P1 Incidents: Create incident channel immediately
  2. P2 Incidents: Create incident channel within 15 min
  3. P3 Incidents: Track in issue tracker

Incident Channel Template

# Incident #XXX - [Brief Description]

**Severity**: P0/P1/P2/P3
**Status**: Investigating / Identified / Monitoring / Resolved
**IC**: @username
**Started**: [Timestamp]

## Timeline
- [Timestamp] - Incident detected
- [Timestamp] - Status update

## Actions Taken
- [Action item]

## Next Steps
- [ ] [Action]

Post-Incident Actions

  1. Complete incident document
  2. Schedule postmortem (within 5 days)
  3. Create action items
  4. Update runbooks
  5. Present to stakeholders

Maintenance

Daily Operations

  • Review alerts (morning)
  • Check backup status
  • Review error budget
  • Monitor system health

Weekly Operations

  • Review SLO performance
  • Check DR readiness
  • Review runup logs
  • Update metrics

Monthly Operations

  • SLO review meeting
  • On-call rotation review
  • Runbook updates
  • Training session

Quarterly Operations

  • DR drill
  • SLO target review
  • Architecture review
  • Process improvements

Documentation Updates

When to Update

  • After every incident
  • Quarterly reviews
  • Architecture changes
  • Process improvements

Update Process

  1. Edit relevant runbook/procedure
  2. Update version number
  3. Add changelog entry
  4. Review with SRE team
  5. Merge to main branch

Related Documentation


Support

Getting Help

  1. Urgent Issues: Contact on-call (24/7)
  2. Non-Urgent: Create GitHub issue
  3. Questions: Slack #polln-ops
  4. Documentation: See relevant runbook

Contributing

To improve operational procedures:

  1. Edit the relevant document
  2. Test changes in staging
  3. Submit PR with clear description
  4. Request review from SRE team
  5. Participate in review process

Status

Overall Status: ✅ PRODUCTION READY

Readiness Checklist:

  • Runbooks complete (5 runbooks)
  • Monitoring configured (Prometheus + Grafana)
  • Alerting configured (4 severity levels)
  • SLOs defined (6 core SLOs)
  • DR plan complete (RTO < 60min, RPO < 5min)
  • Backup procedures documented
  • Restore procedures documented
  • Validation tests written
  • Escalation policies defined
  • On-call procedures established

Next Review: 2026-06-08


Maintained By: SRE Team Contact: [email protected] Last Updated: 2026-03-08