Skip to content

last9/last9-k8s-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Last9 OpenTelemetry Operator Setup

Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.

Features

  • βœ… One-command installation - Deploy everything with a single command
  • βœ… Flexible deployment options - Install only what you need (logs, traces, metrics, events)
  • βœ… Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
  • βœ… Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
  • βœ… Events collection - Capture and forward Kubernetes events
  • βœ… Cluster identification - Automatic cluster name detection and attribution
  • βœ… Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
  • βœ… Environment customization - Override deployment environment and cluster name

Quick Start

Prerequisites

  • kubectl configured to access your Kubernetes cluster
  • helm (v3+) installed

Option 1: Install Everything (Recommended)

Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:

./last9-otel-setup.sh \
  token="Basic <your-base64-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Quick Install (One-liner)

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<user>" \
  password="<pass>"

Installation Options

Option 2: Traces Only (Operator + Collector)

For applications that need distributed tracing:

./last9-otel-setup.sh operator-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 3: Logs Only (Collector without Operator)

For log collection use cases:

./last9-otel-setup.sh logs-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 4: Metrics Only (Kubernetes Monitoring)

For cluster metrics and monitoring:

./last9-otel-setup.sh monitoring-only \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Option 5: Kubernetes Events Only

For Kubernetes events collection:

./last9-otel-setup.sh events-only \
  endpoint="<your-otlp-endpoint>" \
  token="Basic <your-base64-token>" \
  monitoring-endpoint="<your-metrics-endpoint>"

Advanced Configuration

Override Cluster Name

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  cluster="prod-us-east-1"

If not provided, the cluster name is auto-detected from kubectl config current-context.

Set Deployment Environment

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  env="production"

Default: staging for collector, local for auto-instrumentation.

Deploy with Tolerations

For deploying on nodes with taints (e.g., control-plane, monitoring nodes):

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  tolerations-file=/path/to/tolerations.yaml

Example tolerations files are provided in the examples/ directory:

  • tolerations-all-nodes.yaml - Deploy on all nodes including control-plane
  • tolerations-monitoring-nodes.yaml - Deploy on dedicated monitoring nodes
  • tolerations-spot-instances.yaml - Deploy on spot/preemptible instances
  • tolerations-multi-taint.yaml - Handle multiple taints
  • tolerations-nodeSelector-only.yaml - Use nodeSelector without tolerations
  • tolerations-gpu-nodes.yaml - Deploy on GPU nodes with nvidia.com/gpu taints

Configuration Files

File Description
last9-otel-collector-values.yaml OpenTelemetry Collector configuration for logs and traces
last9-otel-collector-metrics-values.yaml Optional: Application metrics scraping (Prometheus SD)
last9-otel-collector-gpu-values.yaml Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics)
k8s-monitoring-values.yaml Kube-prometheus-stack configuration for metrics
last9-kube-events-agent-values.yaml Events collection agent configuration
collector-svc.yaml Collector service for application instrumentation
instrumentation.yaml Auto-instrumentation configuration
deploy.yaml Sample application deployment with auto-instrumentation
tolerations.yaml Sample tolerations configuration

Placeholders

The following placeholders are automatically replaced during installation:

  • {{AUTH_TOKEN}} - Your Last9 authorization token
  • {{OTEL_ENDPOINT}} - Your OTEL endpoint URL
  • {{MONITORING_ENDPOINT}} - Your metrics endpoint URL

Uninstallation

Uninstall Everything

./last9-otel-setup.sh uninstall-all

Uninstall Specific Components

# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"

# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"

# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstall

Verification

After installation, verify the deployment:

# Check all pods in last9 namespace
kubectl get pods -n last9

# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector

# Check monitoring stack
kubectl get prometheus -n last9

# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agent

Auto-Instrumentation

The script automatically sets up instrumentation for:

  • β˜• Java - Automatic OTLP export
  • 🐍 Python - Automatic OTLP export
  • 🟒 Node.js - Automatic OTLP export
  • πŸ”΅ Go - Manual instrumentation supported
  • πŸ’Ž Ruby - Coming soon

Application Metrics Scraping (Optional)

The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.

Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.

Enable Metrics Scraping

To enable application metrics scraping, deploy with the additional metrics configuration file:

# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-metrics-values.yaml

Configure Last9 Metrics Endpoint:

Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:

  • {{LAST9_METRICS_ENDPOINT}} - Your Last9 Prometheus remote write URL
  • {{LAST9_METRICS_USERNAME}} - Your Last9 metrics username
  • {{LAST9_METRICS_PASSWORD}} - Your Last9 metrics password

Quick Start

Add these annotations to your pod template or service to enable automatic metrics scraping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"  # Optional, defaults to /metrics

That's it! Your application metrics will be automatically:

  • Discovered - No manual configuration needed
  • Scraped - Every 30 seconds by default
  • Enriched - With pod, namespace, node labels
  • Exported - To Last9 via Prometheus remote write

How It Works

  1. Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
  2. Annotation-Based Filtering - Only scrapes resources with prometheus.io/scrape: "true"
  3. Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
  4. Direct Export - Sends metrics to Last9 Prometheus endpoint

Supported Annotations

Annotation Required Default Description
prometheus.io/scrape Yes - Set to "true" to enable scraping
prometheus.io/port Yes - Port number exposing /metrics
prometheus.io/path No /metrics HTTP path for metrics endpoint

Scaling

This setup scales automatically:

  • 1 service β†’ Automatically scraped
  • 1000 services β†’ Automatically scraped
  • No configuration changes needed when adding new services

Configuration Files

Base Configuration: last9-otel-collector-values.yaml

  • Traces and logs collection
  • Basic OTLP receiver
  • No metrics scraping

App Metrics Configuration: last9-otel-collector-metrics-values.yaml

  • Prometheus receiver with kubernetes_sd_configs for auto-discovery
  • prometheusremotewrite exporter for sending to Last9
  • RBAC for Kubernetes API access
  • Increased resource limits for collector pods
  • BasicAuth extension for Last9 metrics endpoint

GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml

  • Everything in the app metrics configuration, plus:
  • DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM β€” see variants in values file)
  • Ray head/worker metrics scraping (KubeRay Operator)
  • Cardinality control via metric keep-list for DCGM

Choose ONE metrics overlay:

  • App metrics only: --values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml
  • App + GPU metrics: --values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml

Verification

Check if metrics are being scraped:

# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods

# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888

# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scraped

GPU Metrics (DCGM) & Ray Metrics Scraping

GPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery β€” no annotation changes needed on DCGM or Ray pods.

Enable GPU Metrics

helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-gpu-values.yaml

Note: Use last9-otel-collector-gpu-values.yaml instead of last9-otel-collector-metrics-values.yaml β€” the GPU file already includes all application metrics scrape jobs.

Prerequisites

  • DCGM metrics (pick one):
    • Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with app.kubernetes.io/name=nvidia-dcgm-exporter label)
    • GKE: GPU node pools enabled β€” GKE auto-deploys DCGM exporter in gke-managed-system namespace (label: app.kubernetes.io/name=gke-managed-dcgm-exporter)
  • Ray metrics: KubeRay Operator installed (Ray pods carry ray.io/node-type and ray.io/cluster labels)

Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in last9-otel-collector-gpu-values.yaml for details.

Scrape Jobs

Job Target Label Selector Namespace Port Interval
dcgm-gpu-metrics (Variant A) DCGM Exporter pods app.kubernetes.io/name=nvidia-dcgm-exporter All (auto-discovered) 9400 15s
dcgm-gpu-metrics (Variant B) GKE DCGM Exporter pods app.kubernetes.io/name=gke-managed-dcgm-exporter gke-managed-system 9400 15s
ray-head Ray head nodes ray.io/node-type=head All 8080 30s
ray-workers Ray worker nodes ray.io/node-type=worker All 8080 30s

DCGM Metrics Collected

The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:

  • Utilization: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL
  • Memory: DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_TOTAL
  • Temperature & Power: DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_POWER_USAGE
  • Errors: DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
  • PCIe: DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT
  • Clock & Performance: DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_PSTATE

To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.

GPU Node Tolerations

If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  monitoring-endpoint="..." \
  username="..." \
  password="..." \
  tolerations-file=examples/tolerations-gpu-nodes.yaml

See examples/tolerations-gpu-nodes.yaml for the toleration configuration.

Resource Scaling for Large GPU Fleets

For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:

GPU Nodes CPU Request/Limit Memory Request/Limit
1-10 250m / 500m 512Mi / 1Gi
10-50 500m / 1000m 1Gi / 2Gi
50-100 1000m / 2000m 2Gi / 4Gi

Override resources in your Helm values or pass a custom values file.

Detect Your DCGM Variant

Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:

# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter

# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter
  • If the first command returns pods β†’ use Variant B (uncomment it, comment out Variant A)
  • If the second command returns pods β†’ use Variant A (the default, no changes needed)
  • If neither returns pods β†’ your DCGM exporter is not yet installed (see Prerequisites)

GPU & Ray Verification

# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics

# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers

# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"

# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray

Releases

No releases published

Packages

 
 
 

Contributors

Languages