Full-stack observability platform combining Prometheus, Grafana, ELK Stack, and a custom ML anomaly detection engine β shifting operations from reactive firefighting to predictive incident management.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Observability Platform β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Your App β β Linux Host β β Containers β β
β β (any stack) β β β β (Docker) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β Logs β Metrics β Metrics β Logs β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββββ ββββββββββββββββ β
β β Filebeat β β Node Exporter β β cAdvisor β β
β ββββββββ¬βββββββ βββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βββββββββββ¬ββββββββββ β
β βββββββββββββββ βΌ β
β β Logstash β ββββββββββββββββββββ β
β β (parse, β β Prometheus β β
β β enrich, β β (metrics store) β β
β β route) β ββββββββββ¬βββββββββββ β
β ββββββββ¬βββββββ β β
β β β scrape β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββββββ β
β βElasticsearch βββββββββββ ML Detector β β
β β (log store, β index β β β
β β anomaly β β β’ Z-Score β β
β β results) β β β’ EWMA β β
β ββββββββ¬ββββββββ β β’ Isolation β β
β β β Forest (ML) ββββΊ Prometheus metrics β
β βΌ β β’ Outage Prob ββββΊ Alertmanager β
β ββββββββββββββββ βββββββββββββββββββββββΊ Elasticsearch β
β β Kibana β β β
β β (log search, β ββββββββββββΌβββββββββββ β
β β dashboards) β β Alertmanager β β
β ββββββββββββββββ β (route, dedupe, β β
β β inhibit, notify) β β
β ββββββββββββββββ ββββββββββββββββββββββββ β
β β Grafana ββββ Prometheus + Elasticsearch β
β β (dashboards, β β
β β alerting) β β
β ββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Prometheus metrics (60s interval)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β ML Anomaly Detector β
β β
β For each metric Γ instance: β
β β
β βββββββββββββββ ββββββββββββ ββββββββββ β
β β Z-Score β β EWMA β β IForestβ β
β β (40% weight)β β(30% wt.) β β(30% wt.)β β
β ββββββββ¬βββββββ ββββββ¬ββββββ ββββββ¬ββββ β
β ββββββββββββββββ΄βββββββββββββββ β
β β β
β Composite Score β
β β β
β > 2.5 βββΊ Warning alert β
β > 4.0 βββΊ Critical alert β
β β β
β All scores βββΊ Sigmoid βββΊ Outage Prob β
β β β
β > 75% βββΊ Predictive alert β
βββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Role | Port |
|---|---|---|
| Prometheus | Metrics collection & storage | 9090 |
| Alertmanager | Alert routing, deduplication, inhibition | 9093 |
| Node Exporter | Host metrics (CPU, mem, disk, net) | 9100 |
| cAdvisor | Container metrics | 8080 |
| Grafana | Dashboards & visualisation | 3000 |
| Elasticsearch | Log storage & anomaly indexing | 9200 |
| Logstash | Log ingestion, parsing, enrichment | 5044 |
| Kibana | Log search & exploration | 5601 |
| Filebeat | Log shipper (Docker containers) | β |
| ML Detector | Custom anomaly detection engine | 8888 |
- Docker & Docker Compose
- 4GB RAM minimum (8GB recommended for ELK)
- Python 3.8+ (for Terraform)
# Clone and start the full stack
git clone https://github.com/thomasasamba-bot/02-observability-ml-alerting
cd 02-observability-ml-alerting
# Start everything
docker-compose up -d
# Check all services are healthy
docker-compose ps| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / observability123 |
| Kibana | http://localhost:5601 | β (no auth in demo) |
| Prometheus | http://localhost:9090 | β |
| Alertmanager | http://localhost:9093 | β |
| ML Detector API | http://localhost:8888 | β |
The custom ML detector (docker/ml-detector/detector.py) runs three algorithms in parallel:
Z-Score β statistical baseline detection. Flags metrics that deviate more than 2.5 standard deviations from their rolling mean. Fast and interpretable.
EWMA (Exponentially Weighted Moving Average) β trend-aware detection. Gives more weight to recent observations, catching gradual drift that Z-Score misses.
Isolation Forest β unsupervised ML. Trains on historical metric windows and scores new observations by how easy they are to isolate. Catches complex multi-dimensional anomalies.
These three scores are weighted and combined into a composite anomaly score, which feeds a sigmoid function to produce an outage probability (0β100%).
# Query current anomaly scores
curl http://localhost:8888/anomalies | python -m json.tool
# Check outage probability
curl "http://localhost:9090/api/v1/query?query=ml_outage_probability" | python -m json.toolDeploy the Terraform module to extend observability into AWS:
cd terraform
terraform init
terraform apply -var="[email protected]"This provisions CloudWatch Anomaly Detection alarms, log metric filters, and a composite alarm that fires when multiple signals are degraded simultaneously.
02-observability-ml-alerting/
βββ docker-compose.yml
βββ docker/
β βββ prometheus/
β β βββ prometheus.yml # Scrape config
β β βββ alert_rules.yml # 15+ alert rules with ML anomaly alerts
β βββ alertmanager/
β β βββ alertmanager.yml # Routing tree, inhibitions, receivers
β βββ logstash/
β β βββ pipeline/logstash.conf # Parsing, enrichment, routing
β βββ ml-detector/
β β βββ detector.py # Z-Score + EWMA + Isolation Forest
β β βββ Dockerfile
β β βββ requirements.txt
β βββ filebeat/
β βββ filebeat.yml
βββ dashboards/
β βββ infrastructure-overview.json # Grafana dashboard with ML panels
βββ terraform/
β βββ main.tf # AWS CloudWatch complement
βββ README.md
docker-compose down -v # Remove containers and volumes
# AWS cleanup
cd terraform && terraform destroy- 01-self-healing-infrastructure β AIOps self-healing with Lambda/SSM
- 03-secure-aws-infrastructure β IaC with KMS and IAM hardening
- 04-kubernetes-orchestration β EKS zero-downtime deployments
- 05-devsecops-pipeline β CI/CD with SonarQube and Trivy
Built by Thomas Asamba | github.com/thomasasamba-bot