Skip to content

henriquebonfim/o11y-stack-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

O11y Stack Template

A production-ready observability stack template with OpenTelemetry, Prometheus, Loki, Tempo, and Grafana, all routed through Traefik with automatic HTTPS.

📋 Overview

Website Screenshot

This template provides a complete observability stack that collects, stores, and visualizes:

  • Metrics (Prometheus)
  • Logs (Loki)
  • Traces (Tempo)
  • Dashboards (Grafana)

All services are secured behind Traefik reverse proxy with automatic HTTPS redirection and TLS certificates.

System Health Overview Dashboard

System Health Overview

Real-time monitoring dashboard showing system status, resource utilization, and container metrics

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                        Traefik                          │
│            (Reverse Proxy & Load Balancer)              │
│     Routes: *.localhost with HTTPS/TLS support          │
└─────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼─────────────────────┐
        │                   │                     │
        ▼                   ▼                     ▼
   ┌─────────┐        ┌──────────┐         ┌──────────┐
   │   Web   │        │ Grafana  │         │Portainer │
   │  :80    │        │  :3000   │         │  :9000   │
   │(Nginx)  │        └──────────┘         └──────────┘
   └─────────┘              │
                            │
        ┌──────────────────────────────────────┐
        │                                      │
        ▼                                      ▼
┌──────────────┐                      ┌──────────────┐
│OpenTelemetry │◄────────────────────►│  Prometheus  │
│  Collector   │  (Metrics & Traces)  │    :9090     │
│   :4318      │                      └──────────────┘
└──────────────┘                              │
        │                                     │
    ┌───┴────┬──────────────────┐            │
    ▼        ▼                  ▼            ▼
┌──────┐ ┌───────┐      ┌──────────┐  ┌──────────┐
│ Loki │ │ Tempo │      │Node Exp. │  │ cAdvisor │
│:3100 │ │ :3200 │      │  :9100   │  │  :8080   │
└──────┘ └───────┘      └──────────┘  └──────────┘
 (Logs)   (Traces)      (Host Metrics) (Container)

🚀 Quick Start

Prerequisites

  • Docker & Docker Compose
  • mkcert (for local HTTPS certificates)

1. Generate SSL Certificates

# Install mkcert (if not already installed)
# macOS
brew install mkcert

# Linux
apt install mkcert  # or your package manager

# Generate certificates
cd infrastructure/traefik/certs
mkcert -install
mkcert -cert-file local-dev.crt -key-file local-dev.key \
  "localhost" \
  "*.localhost" \
  "traefik.localhost" \
  "portainer.localhost" \
  "grafana.localhost" \
  "prometheus.localhost" \
  "cadvisor.localhost"

2. Configure Environment

# Copy the example environment file
cp example.env .env

# Edit .env with your preferences
nano .env

3. Start the Stack

# Start all services
docker compose up -d

# Check service status
docker compose ps

# View logs
docker compose logs -f

🌐 Service URLs

Once running, access services at:

Service URL Description
Web https://localhost Landing page (index.html)
Grafana https://grafana.localhost Main dashboard & visualization
Traefik https://traefik.localhost Reverse proxy dashboard
Portainer https://portainer.localhost Docker management UI
Prometheus Internal only Metrics database

📊 Stack Components

Web (Landing Page)

  • Purpose: Serves the main landing page at the root domain
  • Technology: Nginx Alpine
  • Features:
    • Lightweight static file server
    • Hosts index.html with animated clock
    • Accessible at the root domain (e.g., https://localhost)
  • Resources: 32MB memory, 0.1 CPU

Traefik (Reverse Proxy)

  • Purpose: Routes all traffic with automatic HTTPS
  • Features:
    • Automatic service discovery via Docker labels
    • HTTP to HTTPS redirection
    • TLS certificate management
    • Access logs, metrics, and traces sent to OpenTelemetry

Example routing configuration:

labels:
  - traefik.enable=true
  - traefik.http.routers.myapp.rule=Host(`myapp.${DOMAIN_NAME}`)
  - traefik.http.routers.myapp.entrypoints=websecure
  - traefik.http.routers.myapp.tls=true
  - traefik.http.services.myapp.loadbalancer.server.port=8080

OpenTelemetry Collector

  • Purpose: Central telemetry data collector
  • Receives: Logs, metrics, and traces from Traefik and applications
  • Exports to: Loki (logs), Tempo (traces), Prometheus (metrics)
  • Endpoints:
    • HTTP: http://otel-collector:4318
    • gRPC: http://otel-collector:4317

Prometheus

  • Purpose: Time-series metrics database
  • Scrapes:
    • Prometheus itself
    • OpenTelemetry Collector
    • Node Exporter (host metrics)
    • cAdvisor (container metrics)
  • Retention: Configurable via environment variables

Loki

  • Purpose: Log aggregation system
  • Receives: Logs from OpenTelemetry Collector
  • Query: Via Grafana with LogQL

Tempo

  • Purpose: Distributed tracing backend
  • Receives: Traces from OpenTelemetry Collector
  • Query: Via Grafana with TraceQL

Grafana

  • Purpose: Visualization and dashboards
  • Pre-configured datasources:
    • Prometheus (metrics)
    • Loki (logs)
    • Tempo (traces)
  • Pre-loaded dashboards:
    • System Overview
    • Traefik Dashboard
    • Node Exporter
    • cAdvisor
    • Loki Logs

Node Exporter

  • Purpose: Host machine metrics
  • Exports: CPU, memory, disk, network stats

cAdvisor

  • Purpose: Container metrics
  • Exports: Container resource usage and performance

Portainer

  • Purpose: Docker container management
  • Features: Web UI for managing containers, images, volumes, networks

⚙️ Configuration

Environment Variables (.env)

# Domain configuration
DOMAIN_NAME=localhost                    # Base domain
TRAEFIK_SUBDOMAIN=traefik               # Traefik dashboard subdomain
PORTAINER_SUBDOMAIN=portainer           # Portainer subdomain
GRAFANA_SUBDOMAIN=grafana               # Grafana subdomain
PROMETHEUS_SUBDOMAIN=prometheus         # Prometheus subdomain
CADVISOR_SUBDOMAIN=cadvisor            # cAdvisor subdomain

# Prometheus retention
PROMETHEUS_RETENTION_TIME=30d           # How long to keep metrics
PROMETHEUS_RETENTION_SIZE=10GB          # Max storage size

# Grafana credentials
GRAFANA_ADMIN_USER=admin                # Grafana admin username
GRAFANA_ADMIN_PASSWORD=secret           # Grafana admin password

Adding Your Application

To add your own application to this stack with Traefik routing and observability:

services:
  myapp:
    image: myapp:latest
    container_name: myapp
    restart: unless-stopped
    networks:
      - frontend_network
    labels:
      # Enable Traefik
      - traefik.enable=true

      # Configure routing
      - traefik.http.routers.myapp.rule=Host(`myapp.${DOMAIN_NAME}`)
      - traefik.http.routers.myapp.entrypoints=websecure
      - traefik.http.routers.myapp.tls=true

      # Configure service
      - traefik.http.services.myapp.loadbalancer.server.port=8080
      - traefik.docker.network=frontend_network

    # Optional: Send telemetry to OpenTelemetry Collector
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
      - OTEL_SERVICE_NAME=myapp

📈 Grafana Dashboards

The stack includes pre-configured dashboards:

  1. System Health Overview - Overall system health and metrics with real-time stats

    • System status indicator
    • Running containers count
    • CPU, Memory, and Disk usage at a glance
    • HTTP request rate and error monitoring
    • Average response time tracking
    • Detailed resource utilization graphs (CPU by core, Memory breakdown, Disk I/O)
    • Container-level metrics
  2. Traefik Dashboard - Request rates, latency, status codes

  3. Node Exporter - Host machine metrics

  4. cAdvisor - Container resource usage

  5. Loki Logs - Log exploration and analysis

System Health Overview Dashboard

Access at: https://grafana.localhost

Default credentials (if set in .env):

  • Username: ${GRAFANA_ADMIN_USER}
  • Password: ${GRAFANA_ADMIN_PASSWORD}

🔍 Observability Features

Metrics

  • Collected by Prometheus every 15 seconds
  • Available in Grafana for visualization
  • Pre-configured scrape configs for all services

Logs

  • Traefik access logs sent to Loki via OpenTelemetry
  • JSON format for easy parsing
  • Queryable via LogQL in Grafana

Traces

  • Traefik distributed tracing enabled
  • Traces sent to Tempo via OpenTelemetry
  • Queryable via TraceQL in Grafana
  • Correlations with logs and metrics

Alerts

  • Prometheus alert rules in infrastructure/prometheus/alerts/
  • System alerts (CPU, memory, disk)
  • Traefik alerts (error rates, latency)

🔒 Security Features

  • All services run with no-new-privileges security option
  • Backend services isolated in internal network
  • Frontend services accessible only via Traefik
  • Read-only root filesystems where applicable
  • Non-root users for Loki, Tempo, Prometheus, Grafana
  • TLS encryption for all external traffic
  • Docker socket mounted as read-only

📝 Resource Limits

All services have configured resource limits:

Service Memory CPU
Web 32MB 0.1
Traefik 128MB 0.1
OpenTelemetry 128MB 0.1
Prometheus 128MB 0.1
Grafana 128MB 0.1
Loki 128MB 0.1
Tempo 64MB 0.1
Node Exporter 16MB 0.1
cAdvisor 64MB 0.1

Adjust in compose.yml based on your needs.

🛠️ Maintenance

Viewing Logs

# All services
docker compose logs -f

# Specific service
docker compose logs -f grafana

Restarting Services

# All services
docker compose restart

# Specific service
docker compose restart prometheus

Updating Services

# Pull latest images
docker compose pull

# Recreate containers
docker compose up -d

Backup Data

# Backup all volumes
docker compose down
sudo tar -czf o11y-backup.tar.gz \
  /var/lib/docker/volumes/o11y-stack-template_*
docker compose up -d

🐛 Troubleshooting

Service won't start

# Check logs
docker compose logs <service-name>

# Check health status
docker compose ps

Can't access via HTTPS

  • Ensure certificates are generated correctly
  • Check if mkcert root CA is installed: mkcert -install
  • Verify DNS resolution: nslookup grafana.localhost

High resource usage

  • Adjust resource limits in compose.yml
  • Reduce Prometheus retention time
  • Decrease scrape intervals

📚 Additional Resources

📄 License

This template is provided as-is for use in your projects.

🤝 Contributing

Feel free to submit issues and enhancement requests!


Built with ❤️ for modern observability

About

A production-ready observability stack template with OpenTelemetry, Prometheus, Loki, Tempo, and Grafana, all routed through Traefik with automatic HTTPS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages