A production-ready, high-performance metrics collector service written in Go that collects system and application metrics and ships them to remote endpoints with enterprise-grade security.
🚀 Features: System metrics (CPU, Memory, Disk, Network) • GPU monitoring (NVIDIA) • Application endpoint scraping • TLS/mTLS support • Prometheus & HTTP JSON shipping • Docker & Kubernetes ready
- Features
- Quick Start
- Architecture
- Installation
- Configuration
- Usage
- Shipper Types
- TLS Configuration
- Collected Metrics
- Security Considerations
- Deployment
- Performance Tuning
- Development
- Troubleshooting
- FAQ
- Contributing
- License
Get metricsd up and running in 5 minutes:
# Clone and build
git clone https://github.com/0x524A/metricsd.git
cd metricsd
go build -o bin/metricsd cmd/metricsd/main.go
# Create configuration
cp config.example.json config.json
# Edit config.json to set your endpoint
# For example, change endpoint to your Prometheus or metrics collector URL
# Run the service
./bin/metricsd -config config.json
# Check health
curl http://localhost:8080/healthWith TLS:
# Generate self-signed certificates (for testing)
mkdir -p certs && cd certs
openssl req -x509 -newkey rsa:4096 -keyout client.key -out client.crt -days 365 -nodes \
-subj "/CN=metricsd-client"
cd ..
# Update config.json to enable TLS
# Set shipper.tls.enabled to true
# Set certificate paths in shipper.tls section
# Run with TLS
./bin/metricsd -config config.jsonWith Docker:
docker build -t metricsd:latest .
docker run -d -p 8080:8080 -v $(pwd)/config.json:/etc/metricsd/config.json:ro metricsd:latest-
Comprehensive Metrics Collection
- CPU usage (per-core and total utilization)
- Memory usage (RAM and swap statistics)
- Disk I/O and usage statistics
- Network I/O statistics
- GPU metrics via NVIDIA NVML (optional)
- Custom application endpoint scraping
-
Application Metrics Collection
- HTTP endpoint scraping for application metrics
- Support for multiple application endpoints
- JSON-based metrics format
- Configurable timeout and retry logic
-
Flexible Shipping Options
- Prometheus Remote Write protocol with Snappy compression
- HTTP JSON POST
- Advanced TLS/SSL support for secure transmission
- Configurable request timeouts
-
Enterprise-Grade Security
- Full TLS 1.2/1.3 support with custom configuration
- Client certificate authentication (mTLS)
- Custom CA certificate support
- Configurable cipher suites
- SNI (Server Name Indication) support
- TLS version pinning (min/max)
- Session ticket management
- Optional certificate verification bypass for testing
-
Configurable & Extensible
- JSON configuration with environment variable overrides
- Adjustable collection intervals
- Enable/disable specific metric collectors
- Health endpoint for monitoring
- Flexible shipper interface for custom backends
-
Plugin System
- Shell script plugins with JSON output
- Automatic plugin discovery from directory
- Per-plugin timeout and interval scheduling
- Circuit breaker for failing plugins
- Compile-time Go plugin extension point
- Security: path validation, sandboxed execution environment
-
Splunk Integration
- Splunk HEC (HTTP Event Collector) shipper
- JSON file shipper for Splunk Universal Forwarder
- Single-metric and multi-metric JSON formats
-
Debian Packaging
.debpackages for amd64 and arm64- systemd service with security hardening
- Automatic user/group creation
-
Production-Ready
- Structured logging with zerolog
- Graceful shutdown with cleanup
- Error handling and resilience
- SOLID design principles
- Resource cleanup and leak prevention
The service follows SOLID principles with a clean architecture:
metricsd/
├── cmd/metricsd/ # Application entry point
├── internal/
│ ├── collector/ # Collector interface, registry, system/GPU/HTTP collectors
│ ├── plugin/ # Plugin manager, exec plugin, discovery, security, Go registry
│ ├── config/ # Configuration management
│ ├── shipper/ # Prometheus, HTTP JSON, Splunk HEC, file shippers
│ ├── orchestrator/ # Collection orchestration (parallel, retry)
│ └── server/ # HTTP health endpoint
├── plugins/ # Shell script plugins + sidecar configs
├── packaging/debian/ # Debian package scripts + systemd service
└── docs/ # Plugin authoring guide, design specs
- Go 1.24 or later
- NVIDIA drivers and CUDA (optional, for GPU metrics)
# Clone the repository
git clone https://github.com/jainri3/metrics-collector.git
cd metrics-collector
# Download dependencies
go mod download
# Build the binary
go build -o bin/metrics-collector cmd/metrics-collector/main.goCreate a config.json file based on the example:
cp config.example.json config.json{
"server": {
"host": "0.0.0.0",
"port": 8080
},
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": true,
"enable_network": true,
"enable_gpu": false,
"plugins": {
"enabled": false,
"plugins_dir": "./plugins",
"default_timeout_seconds": 30,
"validate_on_startup": true
}
},
"shipper": {
"type": "http_json",
"endpoint": "https://collector.example.com:9090/api/v1/metrics",
"timeout": 30000000000,
"tls": {
"enabled": true,
"cert_file": "/path/to/client-cert.pem",
"key_file": "/path/to/client-key.pem",
"ca_file": "/path/to/ca.pem",
"insecure_skip_verify": false,
"server_name": "collector.example.com",
"min_version": "TLS1.2",
"max_version": "TLS1.3",
"cipher_suites": [
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
],
"session_tickets": true
}
},
"endpoints": [
{
"name": "app1",
"url": "http://localhost:3000/metrics"
}
]
}| Field | Description | Default |
|---|---|---|
server.host |
HTTP server bind address | 0.0.0.0 |
server.port |
HTTP server port | 8080 |
collector.interval_seconds |
Collection interval in seconds | 60 |
collector.enable_cpu |
Enable CPU metrics collection | true |
collector.enable_memory |
Enable memory metrics collection | true |
collector.enable_disk |
Enable disk metrics collection | true |
collector.enable_network |
Enable network metrics collection | true |
collector.enable_gpu |
Enable GPU metrics collection (requires NVIDIA GPU) | false |
shipper.type |
Shipper type: prometheus_remote_write, http_json, or json_file |
- |
shipper.endpoint |
Remote endpoint URL | - |
shipper.timeout |
Request timeout in nanoseconds | 30000000000 (30s) |
shipper.tls.enabled |
Enable TLS/SSL | false |
shipper.tls.cert_file |
Path to client certificate file (PEM) | - |
shipper.tls.key_file |
Path to client private key file (PEM) | - |
shipper.tls.ca_file |
Path to CA certificate file for server verification | - |
shipper.tls.insecure_skip_verify |
Skip server certificate verification (not recommended) | false |
shipper.tls.server_name |
Server name for SNI (overrides hostname from endpoint) | - |
shipper.tls.min_version |
Minimum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
TLS1.2 |
shipper.tls.max_version |
Maximum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
TLS1.3 |
shipper.tls.cipher_suites |
Array of allowed cipher suites (see Cipher Suites section) | System defaults |
shipper.tls.session_tickets |
Enable TLS session ticket resumption | true |
endpoints |
Array of application HTTP endpoints to scrape | [] |
You can override configuration values using environment variables:
| Environment Variable | Description | Example |
|---|---|---|
MC_SERVER_HOST |
Server bind address | 0.0.0.0 |
MC_SERVER_PORT |
Server port number | 8080 |
MC_COLLECTOR_INTERVAL |
Collection interval in seconds | 60 |
MC_SHIPPER_TYPE |
Shipper type | prometheus_remote_write |
MC_SHIPPER_ENDPOINT |
Shipper endpoint URL | https://metrics.example.com/write |
MC_TLS_ENABLED |
Enable TLS | true |
MC_TLS_CERT_FILE |
Client certificate file path | /etc/metricsd/certs/client.crt |
MC_TLS_KEY_FILE |
Client private key file path | /etc/metricsd/certs/client.key |
MC_TLS_CA_FILE |
CA certificate file path | /etc/metricsd/certs/ca.crt |
MC_TLS_SERVER_NAME |
SNI server name | collector.example.com |
MC_TLS_MIN_VERSION |
Minimum TLS version | TLS1.2 |
MC_TLS_INSECURE_SKIP_VERIFY |
Skip certificate verification | false |
MC_FILE_PATH |
File shipper output path | /var/log/metricsd/metrics.json |
MC_FILE_MAX_SIZE_MB |
Maximum file size before rotation (MB) | 100 |
MC_FILE_MAX_FILES |
Number of rotated files to keep | 5 |
metricsd supports shell script plugins that output JSON metrics. Plugins are automatically discovered from the configured plugins directory.
See Plugin Authoring Guide for full documentation.
Plugins are executable scripts that output a JSON array:
#!/bin/bash
echo '[{"name": "my_metric", "value": 42.5, "type": "gauge", "labels": {"env": "prod"}}]'Each plugin can have a sidecar .json config file:
{
"name": "my_plugin",
"timeout": 30,
"enabled": true,
"interval_seconds": 60
}For compile-time Go plugins, implement the collector.Collector interface and register via plugin.RegisterGoPlugin(). See the design spec for details.
# Run with default config.json
./bin/metrics-collector
# Run with custom config file
./bin/metrics-collector -config /path/to/config.json
# Set log level
./bin/metrics-collector -log-level debugdebug- Detailed debugging informationinfo- General informational messages (default)warn- Warning messageserror- Error messages only
The service exposes a health endpoint:
curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"timestamp": "2025-11-05T12:34:56Z",
"uptime": "1h23m45s"
}Ships metrics using the Prometheus remote write protocol with Snappy compression.
{
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "http://prometheus:9090/api/v1/write"
}
}Ships metrics as JSON via HTTP POST.
{
"shipper": {
"type": "http_json",
"endpoint": "http://collector:8080/api/v1/metrics"
}
}Payload format:
{
"timestamp": 1699185296,
"metrics": [
{
"name": "system_cpu_usage_percent",
"value": 45.2,
"type": "gauge",
"labels": {
"core": "0"
}
}
]
}Ships metrics as JSON to a local file with automatic rotation. Ideal for Splunk Universal Forwarder integration or local storage.
{
"shipper": {
"type": "json_file",
"file": {
"path": "/var/log/metricsd/metrics.json",
"max_size_mb": 100,
"max_files": 5
}
}
}Configuration Options:
path: Output file path (required)max_size_mb: Maximum file size before rotation in MB (default: 100)max_files: Number of rotated files to keep (default: 5)
Use Cases:
- Integration with Splunk Universal Forwarder
- Local metric storage and backup
- Offline metric collection
- Log aggregation pipelines
The service supports advanced TLS configuration for secure communication with remote endpoints. This includes mutual TLS (mTLS), custom cipher suites, and version pinning.
For simple TLS with server certificate verification:
{
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "https://metrics.example.com/api/v1/write",
"tls": {
"enabled": true,
"ca_file": "/etc/metricsd/certs/ca.pem"
}
}
}For client certificate authentication:
{
"shipper": {
"type": "http_json",
"endpoint": "https://secure-collector.example.com/metrics",
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/client.crt",
"key_file": "/etc/metricsd/certs/client.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "secure-collector.example.com"
}
}
}Full control over TLS parameters:
{
"shipper": {
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/client.crt",
"key_file": "/etc/metricsd/certs/client.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "metrics.internal.example.com",
"min_version": "TLS1.2",
"max_version": "TLS1.3",
"cipher_suites": [
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
],
"session_tickets": true,
"insecure_skip_verify": false
}
}
}| Option | Description | Values |
|---|---|---|
enabled |
Enable/disable TLS | true, false |
cert_file |
Client certificate for mTLS | Path to PEM file |
key_file |
Client private key for mTLS | Path to PEM file |
ca_file |
CA certificate for server verification | Path to PEM file |
server_name |
SNI hostname override | Domain name |
min_version |
Minimum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
max_version |
Maximum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
cipher_suites |
Allowed cipher suites | Array of suite names |
session_tickets |
Enable session resumption | true, false |
insecure_skip_verify |
Skip certificate verification | true, false (not recommended for production) |
TLS 1.3 Cipher Suites:
TLS_AES_128_GCM_SHA256TLS_AES_256_GCM_SHA384TLS_CHACHA20_POLY1305_SHA256
TLS 1.2 Cipher Suites (Recommended):
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
Additional TLS 1.2 Cipher Suites:
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHATLS_ECDHE_RSA_WITH_AES_128_CBC_SHATLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHATLS_ECDHE_RSA_WITH_AES_256_CBC_SHATLS_RSA_WITH_AES_128_GCM_SHA256TLS_RSA_WITH_AES_256_GCM_SHA384TLS_RSA_WITH_AES_128_CBC_SHA256TLS_RSA_WITH_AES_128_CBC_SHATLS_RSA_WITH_AES_256_CBC_SHA
Note: If cipher suites are not specified, Go's default secure cipher suite list will be used. TLS 1.3 cipher suites cannot be configured in Go and use the protocol's default settings.
- Use TLS 1.2 or higher - Set
min_versiontoTLS1.2minimum - Enable mTLS - Use client certificates for mutual authentication
- Verify certificates - Keep
insecure_skip_verifyasfalsein production - Use strong cipher suites - Prefer ECDHE and AEAD ciphers
- Configure SNI - Set
server_namewhen using name-based virtual hosting - Rotate certificates - Implement a certificate rotation strategy
- Secure key storage - Protect private keys with appropriate file permissions
Generate self-signed CA:
openssl req -x509 -new -nodes -keyout ca.key -sha256 -days 1825 -out ca.crt \
-subj "/C=US/ST=State/L=City/O=Organization/CN=CA"Generate client certificate:
# Generate private key
openssl genrsa -out client.key 2048
# Generate certificate signing request
openssl req -new -key client.key -out client.csr \
-subj "/C=US/ST=State/L=City/O=Organization/CN=metricsd-client"
# Sign with CA
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out client.crt -days 825 -sha256Set secure file permissions:
chmod 600 /etc/metricsd/certs/*.key
chmod 644 /etc/metricsd/certs/*.crt
chown metricsd:metricsd /etc/metricsd/certs/*Certificate verification failed:
- Ensure CA certificate includes the full chain
- Verify
server_namematches the certificate CN or SAN - Check certificate expiration dates
Handshake failure:
- Verify cipher suites are compatible with server
- Check TLS version compatibility (min/max versions)
- Ensure client certificate is valid and trusted by server
Enable debug logging:
./bin/metricsd -log-level debugCPU:
system_cpu_usage_percent- Per-core CPU usagesystem_cpu_usage_total_percent- Overall CPU usagesystem_cpu_count- Number of CPU cores
Memory:
system_memory_total_bytes- Total memorysystem_memory_used_bytes- Used memorysystem_memory_available_bytes- Available memorysystem_memory_usage_percent- Memory usage percentagesystem_swap_total_bytes- Total swap spacesystem_swap_used_bytes- Used swap spacesystem_swap_usage_percent- Swap usage percentage
Disk:
system_disk_total_bytes- Total disk spacesystem_disk_used_bytes- Used disk spacesystem_disk_free_bytes- Free disk spacesystem_disk_usage_percent- Disk usage percentagesystem_disk_read_bytes_total- Total bytes readsystem_disk_write_bytes_total- Total bytes writtensystem_disk_read_count_total- Total read operationssystem_disk_write_count_total- Total write operations
Network:
system_network_bytes_sent_total- Total bytes sentsystem_network_bytes_recv_total- Total bytes receivedsystem_network_packets_sent_total- Total packets sentsystem_network_packets_recv_total- Total packets receivedsystem_network_errors_in_total- Total input errorssystem_network_errors_out_total- Total output errorssystem_network_drop_in_total- Total input dropssystem_network_drop_out_total- Total output drops
GPU (NVIDIA):
system_gpu_count- Number of GPUssystem_gpu_utilization_percent- GPU utilizationsystem_gpu_memory_utilization_percent- GPU memory utilizationsystem_gpu_memory_total_bytes- Total GPU memorysystem_gpu_memory_used_bytes- Used GPU memorysystem_gpu_memory_free_bytes- Free GPU memorysystem_gpu_temperature_celsius- GPU temperaturesystem_gpu_power_usage_milliwatts- GPU power usagesystem_gpu_fan_speed_percent- Fan speedsystem_gpu_clock_sm_mhz- SM clock speedsystem_gpu_clock_memory_mhz- Memory clock speed
Application metrics are prefixed with app_ and include the endpoint name as a label.
Protect sensitive configuration and certificate files:
# Configuration file
chmod 600 /opt/metricsd/config.json
chown metricsd:metricsd /opt/metricsd/config.json
# Certificate directory
chmod 700 /etc/metricsd/certs
chown -R metricsd:metricsd /etc/metricsd/certs
# Private keys
chmod 600 /etc/metricsd/certs/*.key
# Certificates
chmod 644 /etc/metricsd/certs/*.crtAlways run the service as a dedicated non-privileged user:
# Create dedicated user
sudo useradd -r -s /bin/false -d /opt/metricsd metricsd
# Set ownership
sudo chown -R metricsd:metricsd /opt/metricsd- Use TLS for all remote communications
- Enable mTLS when possible for mutual authentication
- Restrict network access using firewalls
- Use internal/private networks when available
- Regularly update certificates before expiration
- Store sensitive values in environment variables
- Use secrets management tools (HashiCorp Vault, AWS Secrets Manager, etc.)
- Rotate credentials regularly
- Audit configuration changes
- Enable detailed logging for security monitoring
Create /etc/systemd/system/metricsd.service:
[Unit]
Description=Metrics Collector Service (metricsd)
Documentation=https://github.com/0x524A/metricsd
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=metricsd
Group=metricsd
WorkingDirectory=/opt/metricsd
ExecStart=/opt/metricsd/bin/metricsd -config /opt/metricsd/config.json -log-level info
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10
KillMode=process
TimeoutStopSec=30
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/metricsd
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
# Resource limits
LimitNOFILE=65536
LimitNPROC=512
[Install]
WantedBy=multi-user.targetInstall and enable:
# Copy binary and config
sudo mkdir -p /opt/metricsd/{bin,certs}
sudo cp bin/metricsd /opt/metricsd/bin/
sudo cp config.json /opt/metricsd/
# Create user
sudo useradd -r -s /bin/false -d /opt/metricsd metricsd
# Set permissions
sudo chown -R metricsd:metricsd /opt/metricsd
sudo chmod 600 /opt/metricsd/config.json
sudo chmod 755 /opt/metricsd/bin/metricsd
# Install and start service
sudo cp metricsd.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable metricsd
sudo systemctl start metricsd
# Check status
sudo systemctl status metricsd
sudo journalctl -u metricsd -fPrerequisites:
- Docker installed (version 20.10+ recommended)
- Docker Compose (optional, for easier deployment)
- At least 500MB free disk space for the image
Step 1: Create the Dockerfile
Create a file named Dockerfile in the project root:
FROM golang:1.24-bookworm AS builder
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
make \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# Build with all features including GPU support (NVML)
RUN go build -ldflags '-w -s' -o metricsd cmd/metricsd/main.go
FROM debian:bookworm-slim
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
tzdata \
wget \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -g 1000 metricsd && \
useradd -r -u 1000 -g metricsd -s /bin/false metricsd
# Create directories
RUN mkdir -p /etc/metricsd/certs /var/lib/metricsd
RUN chown -R metricsd:metricsd /etc/metricsd /var/lib/metricsd
WORKDIR /home/metricsd
# Copy binary
COPY --from=builder /app/metricsd /usr/local/bin/metricsd
RUN chmod +x /usr/local/bin/metricsd
# Switch to non-root user
USER metricsd
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/metricsd"]
CMD ["-config", "/etc/metricsd/config.json"]Step 2: Build the Image
# Basic build
docker build -t metricsd:latest .
# Build with custom tag
docker build -t metricsd:v1.0.0 .
# Build with specific platform (for cross-platform)
docker build --platform linux/amd64 -t metricsd:latest .
# Build with build arguments (if needed)
docker build --build-arg GO_VERSION=1.21 -t metricsd:latest .
# Build with no cache (clean build)
docker build --no-cache -t metricsd:latest .
# Build and show build progress
docker build --progress=plain -t metricsd:latest .Step 3: Verify the Build
# List the image
docker images | grep metricsd
# Check image size (should be around 20-30MB)
docker images metricsd:latest --format "{{.Size}}"
# Inspect the image
docker inspect metricsd:latest
# Test run (quick check)
docker run --rm metricsd:latest -helpStep 4: Tag for Registry (Optional)
# Tag for Docker Hub
docker tag metricsd:latest 0x524A/metricsd:latest
docker tag metricsd:latest 0x524A/metricsd:v1.0.0
# Tag for private registry
docker tag metricsd:latest registry.example.com/metricsd:latest
# Push to registry
docker push 0x524A/metricsd:latestOptimizing the Build
Create a .dockerignore file to exclude unnecessary files:
# .dockerignore
.git
.gitignore
.github
README.md
LICENSE
*.md
.vscode
.idea
bin/
*.log
*.tmp
.env
.DS_Store
Makefile
docker-compose.yml
Build Troubleshooting
Common build issues:
# Issue: "cannot find package"
# Solution: Ensure go.mod and go.sum are present
go mod tidy
docker build -t metricsd:latest .
# Issue: "no space left on device"
# Solution: Clean up Docker
docker system prune -a --volumes
# Issue: Build is slow
# Solution: Use BuildKit (faster builds)
DOCKER_BUILDKIT=1 docker build -t metricsd:latest .
# Issue: Platform mismatch (M1 Mac, ARM)
# Solution: Build for specific platform
docker build --platform linux/amd64 -t metricsd:latest .
# Issue: Can't connect to Docker daemon
# Solution: Start Docker or check permissions
sudo systemctl start docker # Linux
sudo usermod -aG docker $USER # Add user to docker groupdocker-compose.yml (for container metrics):
version: '3.8'
services:
metricsd:
build: .
image: metricsd:latest
container_name: metricsd
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
- MC_LOG_LEVEL=info
- MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write
- MC_TLS_ENABLED=true
- MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt
- MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key
- MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt
networks:
- metrics
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
networks:
metrics:
driver: bridgedocker-compose.yml (for HOST metrics - recommended for production):
version: '3.8'
services:
metricsd:
build: .
image: metricsd:latest
container_name: metricsd
restart: unless-stopped
# Use host network to access host metrics
network_mode: host
# Use host PID namespace to see host processes
pid: host
volumes:
# Mount host filesystems for accurate host metrics
- /:/rootfs:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
# Tell gopsutil to use host filesystems
- HOST_PROC=/host/proc
- HOST_SYS=/host/sys
- HOST_ROOT=/rootfs
- MC_LOG_LEVEL=info
- MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write
- MC_TLS_ENABLED=true
- MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt
- MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key
- MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
# Privileged mode may be needed for full system access
# privileged: true
# Or use specific capabilities
cap_add:
- SYS_PTRACE
- SYS_ADMINPrerequisites:
- Built Docker image (see steps above)
config.jsonfile prepared- TLS certificates (optional, if using TLS)
Option 1: Quick Start (Container Metrics)
# Prepare configuration
cp config.example.json config.json
# Edit config.json with your settings
# Run container
docker run -d \
--name metricsd \
-p 8080:8080 \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e MC_LOG_LEVEL=info \
metricsd:latest
# Check if it's running
docker ps | grep metricsd
# View logs
docker logs -f metricsd
# Check health
curl http://localhost:8080/healthOption 2: With TLS (Secure)
# Ensure you have certificates
ls -la certs/
# Should have: client.crt, client.key, ca.crt
# Run with TLS
docker run -d \
--name metricsd \
-p 8080:8080 \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-v $(pwd)/certs:/etc/metricsd/certs:ro \
-e MC_LOG_LEVEL=info \
-e MC_TLS_ENABLED=true \
-e MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt \
-e MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key \
-e MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt \
metricsd:latestOption 3: Host Metrics Collection (Recommended for Production)
This mounts host filesystems to collect actual host metrics instead of container metrics:
docker run -d \
--name metricsd-host \
--pid=host \
--network=host \
--restart=unless-stopped \
-v /:/rootfs:ro \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-v $(pwd)/certs:/etc/metricsd/certs:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
-e HOST_ROOT=/rootfs \
-e MC_LOG_LEVEL=info \
metricsd:latestOption 4: Using Docker Compose (Easiest)
# Build and start
docker-compose up -d
# View logs
docker-compose logs -f metricsd
# Stop
docker-compose down
# Rebuild and restart
docker-compose up -d --build
# View service status
docker-compose psContainer Management:
# Stop container
docker stop metricsd
# Start container
docker start metricsd
# Restart container
docker restart metricsd
# Remove container
docker rm -f metricsd
# View logs (last 100 lines)
docker logs --tail 100 metricsd
# Follow logs in real-time
docker logs -f metricsd
# Check container health status
docker inspect --format='{{.State.Health.Status}}' metricsd
# Execute command in container
docker exec -it metricsd sh
# View container resource usage
docker stats metricsd
# Export container logs to file
docker logs metricsd > metricsd.log 2>&1Note: The Deployment below collects pod/container metrics. To collect node/host metrics in Kubernetes, use a DaemonSet instead. See the "Collecting Host Metrics from Docker Container" section for a DaemonSet example.
deployment.yaml (for pod metrics):
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricsd-config
namespace: monitoring
data:
config.json: |
{
"server": {
"host": "0.0.0.0",
"port": 8080
},
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": true,
"enable_network": true,
"enable_gpu": false
},
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "https://prometheus.monitoring.svc.cluster.local:9090/api/v1/write",
"timeout": 30000000000,
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/tls.crt",
"key_file": "/etc/metricsd/certs/tls.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "prometheus.monitoring.svc.cluster.local",
"min_version": "TLS1.2"
}
},
"endpoints": []
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metricsd
namespace: monitoring
labels:
app: metricsd
spec:
replicas: 1
selector:
matchLabels:
app: metricsd
template:
metadata:
labels:
app: metricsd
spec:
serviceAccountName: metricsd
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: metricsd
image: metricsd:latest
imagePullPolicy: IfNotPresent
args:
- "-config"
- "/etc/metricsd/config.json"
- "-log-level"
- "info"
ports:
- name: http
containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/metricsd
readOnly: true
- name: certs
mountPath: /etc/metricsd/certs
readOnly: true
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumes:
- name: config
configMap:
name: metricsd-config
- name: certs
secret:
secretName: metricsd-tls
---
apiVersion: v1
kind: Service
metadata:
name: metricsd
namespace: monitoring
labels:
app: metricsd
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: http
protocol: TCP
name: http
selector:
app: metricsd
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metricsd
namespace: monitoringCreate TLS secret:
kubectl create secret generic metricsd-tls \
--from-file=tls.crt=certs/client.crt \
--from-file=tls.key=certs/client.key \
--from-file=ca.crt=certs/ca.crt \
-n monitoringDeploy:
kubectl apply -f deployment.yaml
kubectl get pods -n monitoring
kubectl logs -f -n monitoring deployment/metricsdBy default, a containerized application collects metrics from inside the container (container CPU, container memory, etc.). To collect metrics from the host system instead, you need to mount host filesystems into the container.
- Container metrics: Shows resource usage of the container itself (limited by cgroups)
- Host metrics: Shows actual host machine CPU, memory, disk, and network usage
- Use case: Monitoring the physical/virtual machine where Docker is running
Mount these host paths into your container:
| Host Path | Container Mount | Purpose |
|---|---|---|
/proc |
/host/proc:ro |
Process information, CPU stats |
/sys |
/host/sys:ro |
System information, block devices |
/ |
/rootfs:ro |
Root filesystem for disk metrics |
/var/run/docker.sock |
/var/run/docker.sock:ro |
Docker socket (optional) |
Set these environment variables to tell the gopsutil library to use host paths:
HOST_PROC=/host/proc
HOST_SYS=/host/sys
HOST_ROOT=/rootfsdocker run -d \
--name metricsd-host-metrics \
--pid=host \
--network=host \
--restart=unless-stopped \
-v /:/rootfs:ro \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
-e HOST_ROOT=/rootfs \
-e MC_LOG_LEVEL=info \
metricsd:latestversion: '3.8'
services:
metricsd-host:
image: metricsd:latest
container_name: metricsd-host-metrics
restart: unless-stopped
network_mode: host # Access host network interfaces
pid: host # Access host processes
volumes:
- /:/rootfs:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
- HOST_PROC=/host/proc
- HOST_SYS=/host/sys
- HOST_ROOT=/rootfs
cap_add:
- SYS_PTRACE # For process monitoringWhen collecting host metrics:
- ✅ Use read-only mounts (
:ro) for host filesystems - ✅ Minimize capabilities - only add what's needed (SYS_PTRACE, SYS_ADMIN)
⚠️ Avoidprivileged: trueunless absolutely necessary- ✅ Run as non-root user when possible
- ✅ Review mounted paths - only mount what you need
For Kubernetes, use a DaemonSet to run one pod per node:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: metricsd-host
namespace: monitoring
spec:
selector:
matchLabels:
app: metricsd-host
template:
metadata:
labels:
app: metricsd-host
spec:
hostNetwork: true
hostPID: true
containers:
- name: metricsd
image: metricsd:latest
env:
- name: HOST_PROC
value: /host/proc
- name: HOST_SYS
value: /host/sys
- name: HOST_ROOT
value: /rootfs
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /rootfs
readOnly: true
- name: config
mountPath: /etc/metricsd
- name: certs
mountPath: /etc/metricsd/certs
securityContext:
capabilities:
add:
- SYS_PTRACE
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
- name: config
configMap:
name: metricsd-config
- name: certs
secret:
secretName: metricsd-tlsCheck the logs to ensure host metrics are being collected:
# Check logs
docker logs metricsd-host-metrics
# You should see metrics for ALL host CPUs, not just container limits
# Example: If host has 16 cores, you should see metrics for all 16
# Test with debug logging
docker run --rm -it \
--pid=host \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
metricsd:latest -config /etc/metricsd/config.json -log-level debugAdjust based on your needs:
- High-frequency monitoring: 10-30 seconds
- Standard monitoring: 60 seconds (recommended)
- Low-frequency monitoring: 300+ seconds
- Enable session tickets - Reduces TLS handshake overhead
- Use TLS 1.3 - Faster handshake and better performance
- Connection pooling - Automatically handled by the HTTP client
- Keep-alive - Connections are reused between shipments
Typical resource usage:
- CPU: 50-200m (minimal overhead)
- Memory: 50-150 MB RSS
- Network: Depends on metric volume and shipping frequency
Optimize with:
{
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": false,
"enable_network": false,
"enable_gpu": false
}
}The service exposes its own health endpoint:
- Monitor HTTP response time at
/health - Check logs for shipping errors
- Monitor system resource usage
- Set up alerts for service failures
# Clone repository
git clone https://github.com/0x524A/metricsd.git
cd metricsd
# Install dependencies
go mod download
# Build
make build
# Run with development config
./bin/metricsd -config config.json -log-level debugmetricsd/
├── cmd/
│ └── metricsd/ # Main application entry point
│ └── main.go
├── internal/ # Internal packages
│ ├── collector/ # Metric collectors
│ │ ├── collector.go # Collector interface & registry
│ │ ├── system.go # System metrics (CPU, memory, disk, network)
│ │ ├── gpu.go # GPU metrics (NVIDIA NVML)
│ │ └── http.go # HTTP endpoint scraper
│ ├── config/ # Configuration management
│ │ └── config.go # Config structs & validation
│ ├── shipper/ # Metric shipping backends
│ │ ├── shipper.go # Shipper interface
│ │ ├── prometheus.go # Prometheus remote write protocol
│ │ └── http_json.go # HTTP JSON POST
│ ├── orchestrator/ # Collection & shipping coordination
│ │ └── orchestrator.go
│ └── server/ # HTTP server (health checks)
│ └── server.go
├── bin/ # Compiled binaries
├── config.json # Runtime configuration
├── config.example.json # Example configuration
├── Makefile # Build automation
├── go.mod # Go module definition
└── README.md # This file
# Run all tests
go test ./...
# Run with coverage
go test -cover ./...
# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run specific package tests
go test ./internal/collector/...
# Run with verbose output
go test -v ./...
# Run benchmarks
go test -bench=. ./...# Build for current platform
go build -o bin/metricsd cmd/metricsd/main.go
# Build with optimizations
go build -ldflags="-s -w" -o bin/metricsd cmd/metricsd/main.go
# Build for multiple platforms
GOOS=linux GOARCH=amd64 go build -o bin/metricsd-linux-amd64 cmd/metricsd/main.go
GOOS=darwin GOARCH=amd64 go build -o bin/metricsd-darwin-amd64 cmd/metricsd/main.go
GOOS=windows GOARCH=amd64 go build -o bin/metricsd-windows-amd64.exe cmd/metricsd/main.go
# Using Makefile (if available)
make build
make test
make cleanFollow standard Go conventions:
- Use
gofmtfor formatting - Use
golintfor linting - Use
go vetfor static analysis
# Format code
gofmt -w .
# Run linter
golangci-lint run
# Static analysis
go vet ./...- Create a new collector in
internal/collector/:
package collector
type MyCollector struct {
// fields
}
func NewMyCollector() *MyCollector {
return &MyCollector{}
}
func (c *MyCollector) Collect(ctx context.Context) ([]Metric, error) {
// Implementation
return metrics, nil
}
func (c *MyCollector) Name() string {
return "my_collector"
}- Register in
cmd/metricsd/main.go:
myCollector := collector.NewMyCollector()
registry.Register(myCollector)- Create a new shipper in
internal/shipper/:
package shipper
type MyShipper struct {
endpoint string
client *http.Client
}
func NewMyShipper(endpoint string, tlsConfig *tls.Config) (*MyShipper, error) {
// Implementation
return &MyShipper{...}, nil
}
func (s *MyShipper) Ship(ctx context.Context, metrics []collector.Metric) error {
// Implementation
return nil
}
func (s *MyShipper) Close() error {
// Cleanup
return nil
}-
Add shipper type to config validation in
internal/config/config.go -
Add initialization in
cmd/metricsd/main.go
The project adheres to SOLID principles:
-
Single Responsibility Principle (SRP)
- Each collector focuses on one metric source
- Each shipper handles one protocol
- Orchestrator only coordinates collection and shipping
-
Open/Closed Principle (OCP)
- New collectors can be added without modifying existing code
- New shippers can be plugged in via the interface
- Configuration is extensible
-
Liskov Substitution Principle (LSP)
- All collectors implement the
Collectorinterface - All shippers implement the
Shipperinterface - Components are interchangeable
- All collectors implement the
-
Interface Segregation Principle (ISP)
- Small, focused interfaces (
Collector,Shipper) - Clients depend only on methods they use
- No fat interfaces
- Small, focused interfaces (
-
Dependency Inversion Principle (DIP)
- High-level modules depend on abstractions (interfaces)
- Concrete implementations are injected
- Loose coupling throughout the codebase
Service won't start:
# Check logs
sudo journalctl -u metricsd -n 50
# Verify configuration
./bin/metricsd -config config.json # Should show validation errors
# Check file permissions
ls -la /opt/metricsd/config.json
ls -la /etc/metricsd/certs/TLS handshake errors:
# Test TLS connection
openssl s_client -connect metrics.example.com:443 \
-cert /etc/metricsd/certs/client.crt \
-key /etc/metricsd/certs/client.key \
-CAfile /etc/metricsd/certs/ca.crt
# Verify certificate
openssl x509 -in /etc/metricsd/certs/client.crt -text -noout
# Check certificate expiration
openssl x509 -in /etc/metricsd/certs/client.crt -checkend 0Metrics not shipping:
- Check network connectivity to endpoint
- Verify TLS configuration
- Check endpoint authentication requirements
- Review logs for error messages
- Test endpoint manually with curl
High memory usage:
- Reduce collection frequency
- Disable unused collectors
- Check for memory leaks in logs
- Monitor with pprof if needed
Permission denied errors:
# Fix ownership
sudo chown -R metricsd:metricsd /opt/metricsd
sudo chown -R metricsd:metricsd /etc/metricsd
# Fix permissions
sudo chmod 600 /opt/metricsd/config.json
sudo chmod 600 /etc/metricsd/certs/*.key
sudo chmod 644 /etc/metricsd/certs/*.crtQ: Can I use metricsd without TLS?
A: Yes, set shipper.tls.enabled to false. However, TLS is strongly recommended for production.
Q: Does metricsd support custom metrics?
A: Yes, add application endpoints to the endpoints array in the configuration. The HTTP collector will scrape them.
Q: How do I rotate TLS certificates? A: Update the certificate files, then restart the service. Consider implementing a certificate rotation process with minimal downtime.
Q: Can I ship to multiple endpoints? A: Currently, one shipper endpoint is supported per instance. Run multiple instances for multiple destinations.
Q: What's the performance impact? A: Minimal. Typical CPU usage is <1% and memory usage is around 50-150MB depending on enabled collectors.
Q: How do I monitor metricsd itself?
A: Use the /health endpoint and monitor the service logs. You can also use process monitoring tools.
Q: Does it work on Windows? A: Yes, but some system metrics may have limited support. GPU metrics require NVIDIA drivers.
Q: Can I use this with Grafana? A: Yes, ship metrics to Prometheus (using remote write) and configure Grafana to query Prometheus.
Q: How do I debug TLS issues?
A: Enable debug logging with -log-level debug and review the detailed TLS handshake logs.
Q: Is IPv6 supported? A: Yes, both IPv4 and IPv6 are supported for all network operations.
Q: How do I collect host metrics when running in Docker?
A: Mount the host's /proc, /sys, and / into the container and set environment variables. See the "Collecting Host Metrics from Docker Container" section for complete instructions.
Q: Why are my CPU/memory metrics showing container limits instead of host resources?
A: Without host filesystem mounts, the container only sees its own cgroup limits. Mount host paths and set HOST_PROC=/host/proc and HOST_SYS=/host/sys to collect host metrics.
- Add support for multiple shipper endpoints
- Implement metric aggregation and buffering
- Add support for metric filtering and transformation
- Implement retry logic with exponential backoff
- Add support for custom labels on system metrics
- Implement metric caching for offline scenarios
- Add Datadog, InfluxDB, and other shipper backends
- Add web UI for configuration and monitoring
- Implement metric sampling for high-volume scenarios
- Add support for Windows-specific metrics
- Implement health check with detailed status information
MIT License - see LICENSE file for details
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure tests pass (
go test ./...) - Format your code (
gofmt -w .) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Go best practices and idioms
- Maintain SOLID design principles
- Add tests for new functionality
- Update documentation as needed
- Keep commits atomic and well-described
- Ensure backward compatibility when possible
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code comments
When reporting bugs, please include:
- metricsd version
- Operating system and version
- Go version
- Configuration file (sanitized)
- Relevant log output
- Steps to reproduce
Feature requests are welcome! Please:
- Check existing issues first
- Provide detailed use case
- Explain expected behavior
- Consider contributing the feature
Built with:
- zerolog - Fast structured logging
- gopsutil - System metrics collection
- prometheus/client_golang - Prometheus integration
- NVML - GPU metrics
- Your Name - Initial work
See also the list of contributors who participated in this project.
Made with ❤️ by the metricsd team