operations.md

Operations Guide

1. Observability Stack Startup

docker compose -f docker-compose.observability.yml up -d

Access endpoints:

Prometheus: http://localhost:9090
Loki: http://localhost:3100
Tempo: http://localhost:3200
Alertmanager: http://localhost:9093

2. Core Alerts

HighHttpP95Latency: p95 > 1.5s for 5m
IngestionFailureRateHigh: ingestion failed ratio > 5%

3. Queue Backend Modes

Redis Stream: APP_INGESTION_QUEUE_BACKEND=redis_stream
RabbitMQ: APP_INGESTION_QUEUE_BACKEND=rabbitmq
DB polling fallback: APP_INGESTION_QUEUE_BACKEND=db_polling

Terminal failures enter DLQ stream/queue.

4. Log Shipping

Application log file: logs/knowledgeops-agent.log
Promtail scrapes logs/*.log and pushes to Loki
Trace and request correlation fields: trace_id, request_id, chat_id

5. Nightly Regression

python3 scripts/generate_eval_dataset.py
python3 scripts/generate_eval_predictions.py
python3 scripts/run_regression.py --dataset evaluation/dataset.large.json --predictions evaluation/predictions.generated.json --threshold 0.75

6. Performance Validation

k6 run performance/k6/chat_ingestion_load.js -e BASE_URL=http://localhost:8080
k6 run performance/k6/distributed_chat_ingestion.js -e BASE_URL=http://localhost:8080
python3 performance/k6/generate_report.py --summary reports/performance/distributed-k6-summary.json

7. Incident Triage Playbook

Verify app health endpoint and dependency availability.
Check ingestion queue lag and failed jobs.
Correlate logs by trace_id.
Review p95 latency and error spikes in Prometheus.
Trigger fallback or rollback if SLA continues to degrade.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations Guide

1. Observability Stack Startup

2. Core Alerts

3. Queue Backend Modes

4. Log Shipping

5. Nightly Regression

6. Performance Validation

7. Incident Triage Playbook

FilesExpand file tree

operations.md

Latest commit

History

operations.md

File metadata and controls

Operations Guide

1. Observability Stack Startup

2. Core Alerts

3. Queue Backend Modes

4. Log Shipping

5. Nightly Regression

6. Performance Validation

7. Incident Triage Playbook