docker compose -f docker-compose.observability.yml up -dAccess endpoints:
- Prometheus:
http://localhost:9090 - Loki:
http://localhost:3100 - Tempo:
http://localhost:3200 - Alertmanager:
http://localhost:9093
HighHttpP95Latency: p95 > 1.5s for 5mIngestionFailureRateHigh: ingestion failed ratio > 5%
- Redis Stream:
APP_INGESTION_QUEUE_BACKEND=redis_stream - RabbitMQ:
APP_INGESTION_QUEUE_BACKEND=rabbitmq - DB polling fallback:
APP_INGESTION_QUEUE_BACKEND=db_polling
Terminal failures enter DLQ stream/queue.
- Application log file:
logs/knowledgeops-agent.log - Promtail scrapes
logs/*.logand pushes to Loki - Trace and request correlation fields:
trace_id,request_id,chat_id
python3 scripts/generate_eval_dataset.py
python3 scripts/generate_eval_predictions.py
python3 scripts/run_regression.py --dataset evaluation/dataset.large.json --predictions evaluation/predictions.generated.json --threshold 0.75k6 run performance/k6/chat_ingestion_load.js -e BASE_URL=http://localhost:8080
k6 run performance/k6/distributed_chat_ingestion.js -e BASE_URL=http://localhost:8080
python3 performance/k6/generate_report.py --summary reports/performance/distributed-k6-summary.json- Verify app health endpoint and dependency availability.
- Check ingestion queue lag and failed jobs.
- Correlate logs by
trace_id. - Review p95 latency and error spikes in Prometheus.
- Trigger fallback or rollback if SLA continues to degrade.