📊 Observability & ML-Based Alerting Platform

Full-stack observability platform combining Prometheus, Grafana, ELK Stack, and a custom ML anomaly detection engine — shifting operations from reactive firefighting to predictive incident management.

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                    Observability Platform                                 │
│                                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐                │
│  │   Your App   │   │  Linux Host  │   │  Containers  │                │
│  │  (any stack) │   │              │   │   (Docker)   │                │
│  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘                │
│         │                  │                   │                         │
│  Logs   │          Metrics │           Metrics │  Logs                  │
│         ▼                  ▼                   ▼                         │
│  ┌─────────────┐   ┌───────────────┐  ┌──────────────┐                 │
│  │  Filebeat   │   │ Node Exporter │  │  cAdvisor    │                 │
│  └──────┬──────┘   └───────┬───────┘  └──────┬───────┘                 │
│         │                  │                   │                         │
│         ▼                  └─────────┬─────────┘                        │
│  ┌─────────────┐                     ▼                                  │
│  │  Logstash   │           ┌──────────────────┐                         │
│  │  (parse,    │           │    Prometheus     │                         │
│  │   enrich,   │           │  (metrics store)  │                         │
│  │   route)    │           └────────┬──────────┘                         │
│  └──────┬──────┘                    │                                    │
│         │                           │  scrape                            │
│         ▼                           ▼                                    │
│  ┌──────────────┐         ┌──────────────────┐                          │
│  │Elasticsearch │◄────────│  ML Detector     │                          │
│  │  (log store, │  index  │                  │                          │
│  │   anomaly    │         │ • Z-Score        │                          │
│  │   results)   │         │ • EWMA           │                          │
│  └──────┬───────┘         │ • Isolation      │                          │
│         │                 │   Forest (ML)    │──► Prometheus metrics     │
│         ▼                 │ • Outage Prob    │──► Alertmanager           │
│  ┌──────────────┐         └──────────────────┘──► Elasticsearch          │
│  │   Kibana     │                    │                                    │
│  │ (log search, │         ┌──────────▼──────────┐                        │
│  │  dashboards) │         │    Alertmanager      │                        │
│  └──────────────┘         │  (route, dedupe,     │                        │
│                           │   inhibit, notify)   │                        │
│  ┌──────────────┐         └──────────────────────┘                        │
│  │   Grafana    │◄── Prometheus + Elasticsearch                           │
│  │ (dashboards, │                                                          │
│  │  alerting)   │                                                          │
│  └──────────────┘                                                          │
└──────────────────────────────────────────────────────────────────────────┘

ML Detection Pipeline

Prometheus metrics (60s interval)
         │
         ▼
┌─────────────────────────────────────────────┐
│           ML Anomaly Detector               │
│                                             │
│  For each metric × instance:                │
│                                             │
│  ┌─────────────┐  ┌──────────┐  ┌────────┐ │
│  │   Z-Score   │  │  EWMA    │  │  IForest│ │
│  │ (40% weight)│  │(30% wt.) │  │(30% wt.)│ │
│  └──────┬──────┘  └────┬─────┘  └────┬───┘ │
│         └──────────────┴──────────────┘     │
│                         │                   │
│              Composite Score                │
│                         │                   │
│              > 2.5  ──► Warning alert       │
│              > 4.0  ──► Critical alert      │
│                         │                   │
│  All scores ──► Sigmoid ──► Outage Prob     │
│                         │                   │
│              > 75%  ──► Predictive alert    │
└─────────────────────────────────────────────┘

Stack Components

Component	Role	Port
Prometheus	Metrics collection & storage	9090
Alertmanager	Alert routing, deduplication, inhibition	9093
Node Exporter	Host metrics (CPU, mem, disk, net)	9100
cAdvisor	Container metrics	8080
Grafana	Dashboards & visualisation	3000
Elasticsearch	Log storage & anomaly indexing	9200
Logstash	Log ingestion, parsing, enrichment	5044
Kibana	Log search & exploration	5601
Filebeat	Log shipper (Docker containers)	—
ML Detector	Custom anomaly detection engine	8888

Quick Start

Prerequisites

Docker & Docker Compose
4GB RAM minimum (8GB recommended for ELK)
Python 3.8+ (for Terraform)

# Clone and start the full stack
git clone https://github.com/thomasasamba-bot/02-observability-ml-alerting
cd 02-observability-ml-alerting

# Start everything
docker-compose up -d

# Check all services are healthy
docker-compose ps

Access the UIs

Service	URL	Credentials
Grafana	http://localhost:3000	admin / observability123
Kibana	http://localhost:5601	— (no auth in demo)
Prometheus	http://localhost:9090	—
Alertmanager	http://localhost:9093	—
ML Detector API	http://localhost:8888	—

ML Anomaly Detection

The custom ML detector (docker/ml-detector/detector.py) runs three algorithms in parallel:

Z-Score — statistical baseline detection. Flags metrics that deviate more than 2.5 standard deviations from their rolling mean. Fast and interpretable.

EWMA (Exponentially Weighted Moving Average) — trend-aware detection. Gives more weight to recent observations, catching gradual drift that Z-Score misses.

Isolation Forest — unsupervised ML. Trains on historical metric windows and scores new observations by how easy they are to isolate. Catches complex multi-dimensional anomalies.

These three scores are weighted and combined into a composite anomaly score, which feeds a sigmoid function to produce an outage probability (0–100%).

# Query current anomaly scores
curl http://localhost:8888/anomalies | python -m json.tool

# Check outage probability
curl "http://localhost:9090/api/v1/query?query=ml_outage_probability" | python -m json.tool

AWS CloudWatch Integration

Deploy the Terraform module to extend observability into AWS:

cd terraform
terraform init
terraform apply -var="[email protected]"

This provisions CloudWatch Anomaly Detection alarms, log metric filters, and a composite alarm that fires when multiple signals are degraded simultaneously.

Project Structure

02-observability-ml-alerting/
├── docker-compose.yml
├── docker/
│   ├── prometheus/
│   │   ├── prometheus.yml        # Scrape config
│   │   └── alert_rules.yml       # 15+ alert rules with ML anomaly alerts
│   ├── alertmanager/
│   │   └── alertmanager.yml      # Routing tree, inhibitions, receivers
│   ├── logstash/
│   │   └── pipeline/logstash.conf # Parsing, enrichment, routing
│   ├── ml-detector/
│   │   ├── detector.py           # Z-Score + EWMA + Isolation Forest
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── filebeat/
│       └── filebeat.yml
├── dashboards/
│   └── infrastructure-overview.json  # Grafana dashboard with ML panels
├── terraform/
│   └── main.tf                   # AWS CloudWatch complement
└── README.md

Teardown

docker-compose down -v   # Remove containers and volumes

# AWS cleanup
cd terraform && terraform destroy

Related Projects

01-self-healing-infrastructure — AIOps self-healing with Lambda/SSM
03-secure-aws-infrastructure — IaC with KMS and IAM hardening
04-kubernetes-orchestration — EKS zero-downtime deployments
05-devsecops-pipeline — CI/CD with SonarQube and Trivy

Built by Thomas Asamba | github.com/thomasasamba-bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Observability & ML-Based Alerting Platform

Architecture

ML Detection Pipeline

Stack Components

Quick Start

Prerequisites

Access the UIs

ML Anomaly Detection

AWS CloudWatch Integration

Project Structure

Teardown

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dashboards		dashboards
docker		docker
terraform		terraform
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

📊 Observability & ML-Based Alerting Platform

Architecture

ML Detection Pipeline

Stack Components

Quick Start

Prerequisites

Access the UIs

ML Anomaly Detection

AWS CloudWatch Integration

Project Structure

Teardown

Related Projects

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages