Skip to content

thomasasamba-bot/02-observability-ml-alerting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Observability & ML-Based Alerting Platform

Full-stack observability platform combining Prometheus, Grafana, ELK Stack, and a custom ML anomaly detection engine β€” shifting operations from reactive firefighting to predictive incident management.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Observability Platform                                 β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚  β”‚   Your App   β”‚   β”‚  Linux Host  β”‚   β”‚  Containers  β”‚                β”‚
β”‚  β”‚  (any stack) β”‚   β”‚              β”‚   β”‚   (Docker)   β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚         β”‚                  β”‚                   β”‚                         β”‚
β”‚  Logs   β”‚          Metrics β”‚           Metrics β”‚  Logs                  β”‚
β”‚         β–Ό                  β–Ό                   β–Ό                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚  Filebeat   β”‚   β”‚ Node Exporter β”‚  β”‚  cAdvisor    β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚         β”‚                  β”‚                   β”‚                         β”‚
β”‚         β–Ό                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β–Ό                                  β”‚
β”‚  β”‚  Logstash   β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚  (parse,    β”‚           β”‚    Prometheus     β”‚                         β”‚
β”‚  β”‚   enrich,   β”‚           β”‚  (metrics store)  β”‚                         β”‚
β”‚  β”‚   route)    β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜                    β”‚                                    β”‚
β”‚         β”‚                           β”‚  scrape                            β”‚
β”‚         β–Ό                           β–Ό                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚  β”‚Elasticsearch │◄────────│  ML Detector     β”‚                          β”‚
β”‚  β”‚  (log store, β”‚  index  β”‚                  β”‚                          β”‚
β”‚  β”‚   anomaly    β”‚         β”‚ β€’ Z-Score        β”‚                          β”‚
β”‚  β”‚   results)   β”‚         β”‚ β€’ EWMA           β”‚                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚ β€’ Isolation      β”‚                          β”‚
β”‚         β”‚                 β”‚   Forest (ML)    │──► Prometheus metrics     β”‚
β”‚         β–Ό                 β”‚ β€’ Outage Prob    │──► Alertmanager           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”€β”€β–Ί Elasticsearch          β”‚
β”‚  β”‚   Kibana     β”‚                    β”‚                                    β”‚
β”‚  β”‚ (log search, β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  dashboards) β”‚         β”‚    Alertmanager      β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  (route, dedupe,     β”‚                        β”‚
β”‚                           β”‚   inhibit, notify)   β”‚                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚  β”‚   Grafana    │◄── Prometheus + Elasticsearch                           β”‚
β”‚  β”‚ (dashboards, β”‚                                                          β”‚
β”‚  β”‚  alerting)   β”‚                                                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ML Detection Pipeline

Prometheus metrics (60s interval)
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           ML Anomaly Detector               β”‚
β”‚                                             β”‚
β”‚  For each metric Γ— instance:                β”‚
β”‚                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Z-Score   β”‚  β”‚  EWMA    β”‚  β”‚  IForestβ”‚ β”‚
β”‚  β”‚ (40% weight)β”‚  β”‚(30% wt.) β”‚  β”‚(30% wt.)β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                         β”‚                   β”‚
β”‚              Composite Score                β”‚
β”‚                         β”‚                   β”‚
β”‚              > 2.5  ──► Warning alert       β”‚
β”‚              > 4.0  ──► Critical alert      β”‚
β”‚                         β”‚                   β”‚
β”‚  All scores ──► Sigmoid ──► Outage Prob     β”‚
β”‚                         β”‚                   β”‚
β”‚              > 75%  ──► Predictive alert    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stack Components

Component Role Port
Prometheus Metrics collection & storage 9090
Alertmanager Alert routing, deduplication, inhibition 9093
Node Exporter Host metrics (CPU, mem, disk, net) 9100
cAdvisor Container metrics 8080
Grafana Dashboards & visualisation 3000
Elasticsearch Log storage & anomaly indexing 9200
Logstash Log ingestion, parsing, enrichment 5044
Kibana Log search & exploration 5601
Filebeat Log shipper (Docker containers) β€”
ML Detector Custom anomaly detection engine 8888

Quick Start

Prerequisites

  • Docker & Docker Compose
  • 4GB RAM minimum (8GB recommended for ELK)
  • Python 3.8+ (for Terraform)
# Clone and start the full stack
git clone https://github.com/thomasasamba-bot/02-observability-ml-alerting
cd 02-observability-ml-alerting

# Start everything
docker-compose up -d

# Check all services are healthy
docker-compose ps

Access the UIs

Service URL Credentials
Grafana http://localhost:3000 admin / observability123
Kibana http://localhost:5601 β€” (no auth in demo)
Prometheus http://localhost:9090 β€”
Alertmanager http://localhost:9093 β€”
ML Detector API http://localhost:8888 β€”

ML Anomaly Detection

The custom ML detector (docker/ml-detector/detector.py) runs three algorithms in parallel:

Z-Score β€” statistical baseline detection. Flags metrics that deviate more than 2.5 standard deviations from their rolling mean. Fast and interpretable.

EWMA (Exponentially Weighted Moving Average) β€” trend-aware detection. Gives more weight to recent observations, catching gradual drift that Z-Score misses.

Isolation Forest β€” unsupervised ML. Trains on historical metric windows and scores new observations by how easy they are to isolate. Catches complex multi-dimensional anomalies.

These three scores are weighted and combined into a composite anomaly score, which feeds a sigmoid function to produce an outage probability (0–100%).

# Query current anomaly scores
curl http://localhost:8888/anomalies | python -m json.tool

# Check outage probability
curl "http://localhost:9090/api/v1/query?query=ml_outage_probability" | python -m json.tool

AWS CloudWatch Integration

Deploy the Terraform module to extend observability into AWS:

cd terraform
terraform init
terraform apply -var="[email protected]"

This provisions CloudWatch Anomaly Detection alarms, log metric filters, and a composite alarm that fires when multiple signals are degraded simultaneously.


Project Structure

02-observability-ml-alerting/
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ prometheus/
β”‚   β”‚   β”œβ”€β”€ prometheus.yml        # Scrape config
β”‚   β”‚   └── alert_rules.yml       # 15+ alert rules with ML anomaly alerts
β”‚   β”œβ”€β”€ alertmanager/
β”‚   β”‚   └── alertmanager.yml      # Routing tree, inhibitions, receivers
β”‚   β”œβ”€β”€ logstash/
β”‚   β”‚   └── pipeline/logstash.conf # Parsing, enrichment, routing
β”‚   β”œβ”€β”€ ml-detector/
β”‚   β”‚   β”œβ”€β”€ detector.py           # Z-Score + EWMA + Isolation Forest
β”‚   β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”‚   └── requirements.txt
β”‚   └── filebeat/
β”‚       └── filebeat.yml
β”œβ”€β”€ dashboards/
β”‚   └── infrastructure-overview.json  # Grafana dashboard with ML panels
β”œβ”€β”€ terraform/
β”‚   └── main.tf                   # AWS CloudWatch complement
└── README.md

Teardown

docker-compose down -v   # Remove containers and volumes

# AWS cleanup
cd terraform && terraform destroy

Related Projects


Built by Thomas Asamba | github.com/thomasasamba-bot

About

Full-stack observability platform: Prometheus + Grafana + ELK Stack + custom ML anomaly detector (Z-Score, EWMA, Isolation Forest). Shifts ops from reactive firefighting to predictive incident management. One-command Docker Compose deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors