Last9 OpenTelemetry Operator Setup

Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.

Features

✅ One-command installation - Deploy everything with a single command
✅ Flexible deployment options - Install only what you need (logs, traces, metrics, events)
✅ Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
✅ Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
✅ Events collection - Capture and forward Kubernetes events
✅ Cluster identification - Automatic cluster name detection and attribution
✅ Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
✅ Environment customization - Override deployment environment and cluster name

Quick Start

Prerequisites

kubectl configured to access your Kubernetes cluster
helm (v3+) installed

Option 1: Install Everything (Recommended)

Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:

./last9-otel-setup.sh \
  token="Basic <your-base64-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Quick Install (One-liner)

curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>" \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<user>" \
  password="<pass>"

Installation Options

Option 2: Traces Only (Operator + Collector)

For applications that need distributed tracing:

./last9-otel-setup.sh operator-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 3: Logs Only (Collector without Operator)

For log collection use cases:

./last9-otel-setup.sh logs-only \
  token="Basic <your-token>" \
  endpoint="<your-otlp-endpoint>"

Option 4: Metrics Only (Kubernetes Monitoring)

For cluster metrics and monitoring:

./last9-otel-setup.sh monitoring-only \
  monitoring-endpoint="<your-metrics-endpoint>" \
  username="<your-username>" \
  password="<your-password>"

Option 5: Kubernetes Events Only

For Kubernetes events collection:

./last9-otel-setup.sh events-only \
  endpoint="<your-otlp-endpoint>" \
  token="Basic <your-base64-token>" \
  monitoring-endpoint="<your-metrics-endpoint>"

Advanced Configuration

Override Cluster Name

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  cluster="prod-us-east-1"

If not provided, the cluster name is auto-detected from kubectl config current-context.

Set Deployment Environment

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  env="production"

Default: staging for collector, local for auto-instrumentation.

Deploy with Tolerations

For deploying on nodes with taints (e.g., control-plane, monitoring nodes):

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  tolerations-file=/path/to/tolerations.yaml

Example tolerations files are provided in the examples/ directory:

tolerations-all-nodes.yaml - Deploy on all nodes including control-plane
tolerations-monitoring-nodes.yaml - Deploy on dedicated monitoring nodes
tolerations-spot-instances.yaml - Deploy on spot/preemptible instances
tolerations-multi-taint.yaml - Handle multiple taints
tolerations-nodeSelector-only.yaml - Use nodeSelector without tolerations
tolerations-gpu-nodes.yaml - Deploy on GPU nodes with nvidia.com/gpu taints

Configuration Files

File	Description
`last9-otel-collector-values.yaml`	OpenTelemetry Collector configuration for logs and traces
`last9-otel-collector-metrics-values.yaml`	Optional: Application metrics scraping (Prometheus SD)
`last9-otel-collector-gpu-values.yaml`	Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics)
`k8s-monitoring-values.yaml`	Kube-prometheus-stack configuration for metrics
`last9-kube-events-agent-values.yaml`	Events collection agent configuration
`collector-svc.yaml`	Collector service for application instrumentation
`instrumentation.yaml`	Auto-instrumentation configuration
`deploy.yaml`	Sample application deployment with auto-instrumentation
`tolerations.yaml`	Sample tolerations configuration

Placeholders

The following placeholders are automatically replaced during installation:

{{AUTH_TOKEN}} - Your Last9 authorization token
{{OTEL_ENDPOINT}} - Your OTEL endpoint URL
{{MONITORING_ENDPOINT}} - Your metrics endpoint URL

Uninstallation

Uninstall Everything

./last9-otel-setup.sh uninstall-all

Uninstall Specific Components

# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"

# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"

# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstall

Verification

After installation, verify the deployment:

# Check all pods in last9 namespace
kubectl get pods -n last9

# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector

# Check monitoring stack
kubectl get prometheus -n last9

# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agent

Auto-Instrumentation

The script automatically sets up instrumentation for:

☕ Java - Automatic OTLP export
🐍 Python - Automatic OTLP export
🟢 Node.js - Automatic OTLP export
🔵 Go - Manual instrumentation supported
💎 Ruby - Coming soon

Application Metrics Scraping (Optional)

The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.

Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.

Enable Metrics Scraping

To enable application metrics scraping, deploy with the additional metrics configuration file:

# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-metrics-values.yaml

Configure Last9 Metrics Endpoint:

Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:

{{LAST9_METRICS_ENDPOINT}} - Your Last9 Prometheus remote write URL
{{LAST9_METRICS_USERNAME}} - Your Last9 metrics username
{{LAST9_METRICS_PASSWORD}} - Your Last9 metrics password

Quick Start

Add these annotations to your pod template or service to enable automatic metrics scraping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"  # Optional, defaults to /metrics

That's it! Your application metrics will be automatically:

Discovered - No manual configuration needed
Scraped - Every 30 seconds by default
Enriched - With pod, namespace, node labels
Exported - To Last9 via Prometheus remote write

How It Works

Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
Annotation-Based Filtering - Only scrapes resources with prometheus.io/scrape: "true"
Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
Direct Export - Sends metrics to Last9 Prometheus endpoint

Supported Annotations

Annotation	Required	Default	Description
`prometheus.io/scrape`	Yes	-	Set to "true" to enable scraping
`prometheus.io/port`	Yes	-	Port number exposing /metrics
`prometheus.io/path`	No	/metrics	HTTP path for metrics endpoint

Scaling

This setup scales automatically:

1 service → Automatically scraped
1000 services → Automatically scraped
No configuration changes needed when adding new services

Configuration Files

Base Configuration: last9-otel-collector-values.yaml

Traces and logs collection
Basic OTLP receiver
No metrics scraping

App Metrics Configuration: last9-otel-collector-metrics-values.yaml

Prometheus receiver with kubernetes_sd_configs for auto-discovery
prometheusremotewrite exporter for sending to Last9
RBAC for Kubernetes API access
Increased resource limits for collector pods
BasicAuth extension for Last9 metrics endpoint

GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml

Everything in the app metrics configuration, plus:
DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM — see variants in values file)
Ray head/worker metrics scraping (KubeRay Operator)
Cardinality control via metric keep-list for DCGM

Choose ONE metrics overlay:

App metrics only: --values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml
App + GPU metrics: --values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml

Verification

Check if metrics are being scraped:

# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods

# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888

# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scraped

GPU Metrics (DCGM) & Ray Metrics Scraping

GPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery — no annotation changes needed on DCGM or Ray pods.

Enable GPU Metrics

helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
  --namespace last9 \
  --version 0.125.0 \
  --values last9-otel-collector-values.yaml \
  --values last9-otel-collector-gpu-values.yaml

Note: Use last9-otel-collector-gpu-values.yaml instead of last9-otel-collector-metrics-values.yaml — the GPU file already includes all application metrics scrape jobs.

Prerequisites

DCGM metrics (pick one):
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with app.kubernetes.io/name=nvidia-dcgm-exporter label)
- GKE: GPU node pools enabled — GKE auto-deploys DCGM exporter in gke-managed-system namespace (label: app.kubernetes.io/name=gke-managed-dcgm-exporter)
Ray metrics: KubeRay Operator installed (Ray pods carry ray.io/node-type and ray.io/cluster labels)

Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in last9-otel-collector-gpu-values.yaml for details.

Scrape Jobs

Job	Target	Label Selector	Namespace	Port	Interval
`dcgm-gpu-metrics` (Variant A)	DCGM Exporter pods	`app.kubernetes.io/name=nvidia-dcgm-exporter`	All (auto-discovered)	9400	15s
`dcgm-gpu-metrics` (Variant B)	GKE DCGM Exporter pods	`app.kubernetes.io/name=gke-managed-dcgm-exporter`	`gke-managed-system`	9400	15s
`ray-head`	Ray head nodes	`ray.io/node-type=head`	All	8080	30s
`ray-workers`	Ray worker nodes	`ray.io/node-type=worker`	All	8080	30s

DCGM Metrics Collected

The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:

Utilization: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL
Memory: DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_TOTAL
Temperature & Power: DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_MEMORY_TEMP, DCGM_FI_DEV_POWER_USAGE
Errors: DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
PCIe: DCGM_FI_DEV_PCIE_TX_THROUGHPUT, DCGM_FI_DEV_PCIE_RX_THROUGHPUT
Clock & Performance: DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEM_CLOCK, DCGM_FI_DEV_PSTATE

To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.

GPU Node Tolerations

If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:

./last9-otel-setup.sh \
  token="..." \
  endpoint="..." \
  monitoring-endpoint="..." \
  username="..." \
  password="..." \
  tolerations-file=examples/tolerations-gpu-nodes.yaml

See examples/tolerations-gpu-nodes.yaml for the toleration configuration.

Resource Scaling for Large GPU Fleets

For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:

GPU Nodes	CPU Request/Limit	Memory Request/Limit
1-10	250m / 500m	512Mi / 1Gi
10-50	500m / 1000m	1Gi / 2Gi
50-100	1000m / 2000m	2Gi / 4Gi

Override resources in your Helm values or pass a custom values file.

Detect Your DCGM Variant

Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:

# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter

# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter

If the first command returns pods → use Variant B (uncomment it, comment out Variant A)
If the second command returns pods → use Variant A (the default, no changes needed)
If neither returns pods → your DCGM exporter is not yet installed (see Prerequisites)

GPU & Ray Verification

# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics

# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers

# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"

# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
GCP-Autopilot		GCP-Autopilot
examples		examples
java/k8s		java/k8s
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collector-svc.yaml		collector-svc.yaml
deploy.yaml		deploy.yaml
instrumentation.yaml		instrumentation.yaml
k8s-monitoring-values.yaml		k8s-monitoring-values.yaml
last9-kube-events-agent-values.yaml		last9-kube-events-agent-values.yaml
last9-otel-collector-gpu-gke-values.yaml		last9-otel-collector-gpu-gke-values.yaml
last9-otel-collector-gpu-values.yaml		last9-otel-collector-gpu-values.yaml
last9-otel-collector-metrics-values.yaml		last9-otel-collector-metrics-values.yaml
last9-otel-collector-values.yaml		last9-otel-collector-values.yaml
last9-otel-setup.sh		last9-otel-setup.sh
tolerations.yaml		tolerations.yaml

Folders and files

Latest commit

History

Repository files navigation

Last9 OpenTelemetry Operator Setup

Features

Quick Start

Prerequisites

Option 1: Install Everything (Recommended)

Quick Install (One-liner)

Installation Options

Option 2: Traces Only (Operator + Collector)

Option 3: Logs Only (Collector without Operator)

Option 4: Metrics Only (Kubernetes Monitoring)

Option 5: Kubernetes Events Only

Advanced Configuration

Override Cluster Name

Set Deployment Environment

Deploy with Tolerations

Configuration Files

Placeholders

Uninstallation

Uninstall Everything

Uninstall Specific Components

Verification

Auto-Instrumentation

Application Metrics Scraping (Optional)

Enable Metrics Scraping

Quick Start

How It Works

Supported Annotations

Scaling

Configuration Files

Verification

GPU Metrics (DCGM) & Ray Metrics Scraping

Enable GPU Metrics

Prerequisites

Scrape Jobs

DCGM Metrics Collected

GPU Node Tolerations

Resource Scaling for Large GPU Fleets

Detect Your DCGM Variant

GPU & Ray Verification

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages