Automated setup script for deploying OpenTelemetry Operator, Collector, Kubernetes monitoring, and Events collection to your Kubernetes cluster with Last9 integration.
- β One-command installation - Deploy everything with a single command
- β Flexible deployment options - Install only what you need (logs, traces, metrics, events)
- β Auto-instrumentation - Automatic instrumentation for Java, Python, Node.js, and more
- β Kubernetes monitoring - Full cluster observability with kube-prometheus-stack
- β Events collection - Capture and forward Kubernetes events
- β Cluster identification - Automatic cluster name detection and attribution
- β Tolerations support - Deploy on tainted nodes (control-plane, spot instances, etc.)
- β Environment customization - Override deployment environment and cluster name
kubectlconfigured to access your Kubernetes clusterhelm(v3+) installed
Installs OpenTelemetry Operator, Collector, Kubernetes monitoring stack, and Events agent:
./last9-otel-setup.sh \
token="Basic <your-base64-token>" \
endpoint="<your-otlp-endpoint>" \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<your-username>" \
password="<your-password>"curl -fsSL https://raw.githubusercontent.com/last9/l9-otel-operator/main/last9-otel-setup.sh | bash -s -- \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>" \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<user>" \
password="<pass>"For applications that need distributed tracing:
./last9-otel-setup.sh operator-only \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>"For log collection use cases:
./last9-otel-setup.sh logs-only \
token="Basic <your-token>" \
endpoint="<your-otlp-endpoint>"For cluster metrics and monitoring:
./last9-otel-setup.sh monitoring-only \
monitoring-endpoint="<your-metrics-endpoint>" \
username="<your-username>" \
password="<your-password>"For Kubernetes events collection:
./last9-otel-setup.sh events-only \
endpoint="<your-otlp-endpoint>" \
token="Basic <your-base64-token>" \
monitoring-endpoint="<your-metrics-endpoint>"./last9-otel-setup.sh \
token="..." \
endpoint="..." \
cluster="prod-us-east-1"If not provided, the cluster name is auto-detected from kubectl config current-context.
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
env="production"Default: staging for collector, local for auto-instrumentation.
For deploying on nodes with taints (e.g., control-plane, monitoring nodes):
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
tolerations-file=/path/to/tolerations.yamlExample tolerations files are provided in the examples/ directory:
tolerations-all-nodes.yaml- Deploy on all nodes including control-planetolerations-monitoring-nodes.yaml- Deploy on dedicated monitoring nodestolerations-spot-instances.yaml- Deploy on spot/preemptible instancestolerations-multi-taint.yaml- Handle multiple taintstolerations-nodeSelector-only.yaml- Use nodeSelector without tolerationstolerations-gpu-nodes.yaml- Deploy on GPU nodes withnvidia.com/gputaints
| File | Description |
|---|---|
last9-otel-collector-values.yaml |
OpenTelemetry Collector configuration for logs and traces |
last9-otel-collector-metrics-values.yaml |
Optional: Application metrics scraping (Prometheus SD) |
last9-otel-collector-gpu-values.yaml |
Optional: GPU (DCGM) + Ray metrics scraping (includes app metrics) |
k8s-monitoring-values.yaml |
Kube-prometheus-stack configuration for metrics |
last9-kube-events-agent-values.yaml |
Events collection agent configuration |
collector-svc.yaml |
Collector service for application instrumentation |
instrumentation.yaml |
Auto-instrumentation configuration |
deploy.yaml |
Sample application deployment with auto-instrumentation |
tolerations.yaml |
Sample tolerations configuration |
The following placeholders are automatically replaced during installation:
{{AUTH_TOKEN}}- Your Last9 authorization token{{OTEL_ENDPOINT}}- Your OTEL endpoint URL{{MONITORING_ENDPOINT}}- Your metrics endpoint URL
./last9-otel-setup.sh uninstall-all# Uninstall only monitoring stack
./last9-otel-setup.sh uninstall function="uninstall_last9_monitoring"
# Uninstall only events agent
./last9-otel-setup.sh uninstall function="uninstall_events_agent"
# Uninstall OpenTelemetry components (operator + collector)
./last9-otel-setup.sh uninstallAfter installation, verify the deployment:
# Check all pods in last9 namespace
kubectl get pods -n last9
# Check collector logs
kubectl logs -n last9 -l app.kubernetes.io/name=opentelemetry-collector
# Check monitoring stack
kubectl get prometheus -n last9
# Check events agent
kubectl get pods -n last9 -l app.kubernetes.io/name=last9-kube-events-agentThe script automatically sets up instrumentation for:
- β Java - Automatic OTLP export
- π Python - Automatic OTLP export
- π’ Node.js - Automatic OTLP export
- π΅ Go - Manual instrumentation supported
- π Ruby - Coming soon
The OpenTelemetry Collector can automatically discover and scrape application metrics using Kubernetes service discovery with Prometheus-compatible scraping.
Note: This is an optional feature. Use last9-otel-collector-metrics-values.yaml to enable metrics scraping.
To enable application metrics scraping, deploy with the additional metrics configuration file:
# Deploy with metrics scraping enabled
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
--namespace last9 \
--version 0.125.0 \
--values last9-otel-collector-values.yaml \
--values last9-otel-collector-metrics-values.yamlConfigure Last9 Metrics Endpoint:
Before deploying, update these placeholders in last9-otel-collector-metrics-values.yaml:
{{LAST9_METRICS_ENDPOINT}}- Your Last9 Prometheus remote write URL{{LAST9_METRICS_USERNAME}}- Your Last9 metrics username{{LAST9_METRICS_PASSWORD}}- Your Last9 metrics password
Add these annotations to your pod template or service to enable automatic metrics scraping:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics" # Optional, defaults to /metricsThat's it! Your application metrics will be automatically:
- Discovered - No manual configuration needed
- Scraped - Every 30 seconds by default
- Enriched - With pod, namespace, node labels
- Exported - To Last9 via Prometheus remote write
- Automatic Discovery - OTel Collector watches Kubernetes API for all pods/services
- Annotation-Based Filtering - Only scrapes resources with
prometheus.io/scrape: "true" - Metadata Enrichment - Adds Kubernetes labels automatically (pod, namespace, node, app)
- Direct Export - Sends metrics to Last9 Prometheus endpoint
| Annotation | Required | Default | Description |
|---|---|---|---|
prometheus.io/scrape |
Yes | - | Set to "true" to enable scraping |
prometheus.io/port |
Yes | - | Port number exposing /metrics |
prometheus.io/path |
No | /metrics | HTTP path for metrics endpoint |
This setup scales automatically:
- 1 service β Automatically scraped
- 1000 services β Automatically scraped
- No configuration changes needed when adding new services
Base Configuration: last9-otel-collector-values.yaml
- Traces and logs collection
- Basic OTLP receiver
- No metrics scraping
App Metrics Configuration: last9-otel-collector-metrics-values.yaml
- Prometheus receiver with kubernetes_sd_configs for auto-discovery
- prometheusremotewrite exporter for sending to Last9
- RBAC for Kubernetes API access
- Increased resource limits for collector pods
- BasicAuth extension for Last9 metrics endpoint
GPU + App Metrics Configuration: last9-otel-collector-gpu-values.yaml
- Everything in the app metrics configuration, plus:
- DCGM GPU metrics scraping (NVIDIA GPU Operator or GKE-managed DCGM β see variants in values file)
- Ray head/worker metrics scraping (KubeRay Operator)
- Cardinality control via metric keep-list for DCGM
Choose ONE metrics overlay:
- App metrics only:
--values last9-otel-collector-values.yaml --values last9-otel-collector-metrics-values.yaml - App + GPU metrics:
--values last9-otel-collector-values.yaml --values last9-otel-collector-gpu-values.yaml
Check if metrics are being scraped:
# Check collector logs for scraping
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep kubernetes-pods
# Port-forward to collector metrics endpoint
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
# Check scrape status
curl http://localhost:8888/metrics | grep scrape_samples_scrapedGPU and Ray metrics collection is opt-in. Use last9-otel-collector-gpu-values.yaml instead of the base metrics file to enable these scrape jobs. They use label-based discovery β no annotation changes needed on DCGM or Ray pods.
helm upgrade --install last9-opentelemetry-collector open-telemetry/opentelemetry-collector \
--namespace last9 \
--version 0.125.0 \
--values last9-otel-collector-values.yaml \
--values last9-otel-collector-gpu-values.yamlNote: Use
last9-otel-collector-gpu-values.yamlinstead oflast9-otel-collector-metrics-values.yamlβ the GPU file already includes all application metrics scrape jobs.
- DCGM metrics (pick one):
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with
app.kubernetes.io/name=nvidia-dcgm-exporterlabel) - GKE: GPU node pools enabled β GKE auto-deploys DCGM exporter in
gke-managed-systemnamespace (label:app.kubernetes.io/name=gke-managed-dcgm-exporter)
- Self-managed (EKS, AKS, bare-metal): NVIDIA GPU Operator installed (includes DCGM Exporter with
- Ray metrics: KubeRay Operator installed (Ray pods carry
ray.io/node-typeandray.io/clusterlabels)
Important: The values file ships with two DCGM variants (A and B). Variant A (self-managed NVIDIA GPU Operator) is enabled by default. If you're on GKE, comment out Variant A and uncomment Variant B. See the inline comments in
last9-otel-collector-gpu-values.yamlfor details.
| Job | Target | Label Selector | Namespace | Port | Interval |
|---|---|---|---|---|---|
dcgm-gpu-metrics (Variant A) |
DCGM Exporter pods | app.kubernetes.io/name=nvidia-dcgm-exporter |
All (auto-discovered) | 9400 | 15s |
dcgm-gpu-metrics (Variant B) |
GKE DCGM Exporter pods | app.kubernetes.io/name=gke-managed-dcgm-exporter |
gke-managed-system |
9400 | 15s |
ray-head |
Ray head nodes | ray.io/node-type=head |
All | 8080 | 30s |
ray-workers |
Ray worker nodes | ray.io/node-type=worker |
All | 8080 | 30s |
The DCGM job includes a cardinality keep-list limiting collection to 18 key metrics:
- Utilization:
DCGM_FI_DEV_GPU_UTIL,DCGM_FI_DEV_MEM_COPY_UTIL,DCGM_FI_DEV_ENC_UTIL,DCGM_FI_DEV_DEC_UTIL - Memory:
DCGM_FI_DEV_FB_FREE,DCGM_FI_DEV_FB_USED,DCGM_FI_DEV_FB_TOTAL - Temperature & Power:
DCGM_FI_DEV_GPU_TEMP,DCGM_FI_DEV_MEMORY_TEMP,DCGM_FI_DEV_POWER_USAGE - Errors:
DCGM_FI_DEV_XID_ERRORS,DCGM_FI_DEV_ECC_SBE_VOL_TOTAL,DCGM_FI_DEV_ECC_DBE_VOL_TOTAL - PCIe:
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,DCGM_FI_DEV_PCIE_RX_THROUGHPUT - Clock & Performance:
DCGM_FI_DEV_SM_CLOCK,DCGM_FI_DEV_MEM_CLOCK,DCGM_FI_DEV_PSTATE
To add more DCGM metrics, extend the metric_relabel_configs regex in last9-otel-collector-gpu-values.yaml.
If your GPU nodes have nvidia.com/gpu taints, the OTel Collector and node exporter need tolerations to schedule on those nodes:
./last9-otel-setup.sh \
token="..." \
endpoint="..." \
monitoring-endpoint="..." \
username="..." \
password="..." \
tolerations-file=examples/tolerations-gpu-nodes.yamlSee examples/tolerations-gpu-nodes.yaml for the toleration configuration.
For clusters with many GPU nodes, the collector handles additional scrape targets. Suggested resource scaling:
| GPU Nodes | CPU Request/Limit | Memory Request/Limit |
|---|---|---|
| 1-10 | 250m / 500m | 512Mi / 1Gi |
| 10-50 | 500m / 1000m | 1Gi / 2Gi |
| 50-100 | 1000m / 2000m | 2Gi / 4Gi |
Override resources in your Helm values or pass a custom values file.
Before deploying, check which DCGM exporter is running in your cluster to pick the right variant:
# Check for GKE-managed DCGM exporter (Variant B)
kubectl get pods -n gke-managed-system -l app.kubernetes.io/name=gke-managed-dcgm-exporter
# Check for self-managed NVIDIA GPU Operator DCGM exporter (Variant A)
kubectl get pods -A -l app.kubernetes.io/name=nvidia-dcgm-exporter- If the first command returns pods β use Variant B (uncomment it, comment out Variant A)
- If the second command returns pods β use Variant A (the default, no changes needed)
- If neither returns pods β your DCGM exporter is not yet installed (see Prerequisites)
# Verify DCGM exporter pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep dcgm-gpu-metrics
# Verify Ray head/worker pods are discovered
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-head
kubectl logs -n last9 -l app.kubernetes.io/name=last9-otel-collector | grep ray-workers
# Check DCGM metrics are being scraped
kubectl port-forward -n last9 daemonset/last9-otel-collector 8888:8888
curl -s http://localhost:8888/metrics | grep -c "DCGM_FI_DEV"
# Check Ray metrics are being scraped
curl -s http://localhost:8888/metrics | grep scrape_samples_scraped | grep ray