A production-ready Kubernetes home lab running on Proxmox VE, demonstrating platform engineering skills through hands-on infrastructure operation. This project bridges my automotive systems engineering background with modern cloud-native technologies, showcasing the ability to design, deploy, and maintain production workloads on Kubernetes.
Status: ✅ Operational (24 epics complete) Cluster: 5-node K3s v1.34.3+k3s1 (including GPU worker) Workloads: Full ML inference stack, Document management, Workflow automation, Git hosting Observability: Prometheus, Grafana, Loki, Alertmanager
After years working on IVI systems (In-Vehicle Infotainment) and LBS Navigation Systems—building embedded, mobile, and cloud solutions for online routing—I'm expanding into cloud-native platform engineering. This lab serves as a hands-on portfolio demonstrating Kubernetes infrastructure deployment and operations. It's a working environment that I use daily, monitor, maintain, and continuously improve.
- Real operational complexity: Persistent storage, TLS certificates, load balancing, log aggregation, alerts
- Production practices: GitOps, ADRs, runbooks, backup/restore procedures, upgrade testing
- AI-assisted workflow: Leveraging Claude Code and BMAD methodology for systematic implementation
- Engineering judgment: Every decision documented with trade-off analysis (see ADRs)
- Automotive → K8s bridge: Applying systems thinking from real-time embedded work to distributed systems
┌─────────────────────────────────────────────────────────────────────────────┐
│ Tailscale VPN │
│ (Remote Access Layer) │
└──────────────────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────────────────▼──────────────────────────────────────────┐
│ Traefik Ingress │
│ (TLS Termination, Routing) │
└───┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────────┘
│ │ │ │ │ │ │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│Grafana│ │Open- │ │Paper- │ │ Gitea │ │ n8n │ │Stirling│ │ K8s │
│ │ │WebUI │ │less │ │ │ │ │ │ PDF │ │Dashbd │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │ │ │ │
└─────────┴────┬────┴─────────┴────┬────┴─────────┴─────────┘
│ │
│ (AI + Data + Storage backends)
│ │
┌──────────────────▼───┐ ┌────────────▼───┐ ┌─────────────────────┐
│ LiteLLM │ │ PostgreSQL │ │ Synology DS920+ │
│ Proxy │ │ (data) │ │ (NFS + k3s-nas VM) │
└──────────┬───────────┘ └────────────────┘ └─────────────────────┘
│
┌──────────┼──────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌────────┐ ┌────────┐
│ vLLM │ │ Ollama │ │ OpenAI │
│ (GPU) │ │ (CPU) │ │(Cloud) │
└───┬───┘ └────────┘ └────────┘
│
┌───▼────────────────┐
│ RTX 3060 eGPU │
│ (12GB VRAM) │
│ k3s-gpu-worker │
└────────────────────┘
| Node | Role | Specs | Purpose |
|---|---|---|---|
k3s-master |
Control plane | 192.168.2.20 | API server, etcd, scheduler |
k3s-worker-01 |
CPU worker | 192.168.2.21 | General workloads |
k3s-worker-02 |
CPU worker | 192.168.2.22 | General workloads |
k3s-gpu-worker |
GPU worker | Intel NUC + RTX 3060 eGPU (12GB) | ML inference (vLLM) |
k3s-nas-worker |
NAS worker | Synology DS920+ VM | Storage-adjacent workloads |
| Layer | Technology | Decision Rationale |
|---|---|---|
| Orchestration | K3s v1.34.3 | Lightweight, production-ready K8s. Half the memory of k0s, built-in storage/ingress. See ADR-001 |
| Compute | 5x nodes (VMs + bare metal) | Mixed: Proxmox VMs, Intel NUC with eGPU, Synology NAS VM |
| GPU Inference | vLLM + NVIDIA Container Toolkit | High-performance LLM serving with AWQ quantization |
| LLM Proxy | LiteLLM | Unified OpenAI-compatible API with automatic failover |
| Storage | NFS from Synology DS920+ | Existing NAS asset, CSI driver maturity, snapshot support |
| Ingress | Traefik (K3s bundled) | Zero-config LB, native K8s integration, automatic cert renewal |
| TLS | cert-manager + Let's Encrypt | Industry standard, automated renewal, staging/prod environments |
| Load Balancer | MetalLB (Layer 2) | Simple home network setup, no BGP complexity needed |
| Observability | kube-prometheus-stack + Loki | Complete stack (metrics, logs, alerting), Grafana dashboards, mobile alerts |
| GitOps | Git as source of truth | All manifests version-controlled, Helm values files, reproducible deployments |
Key Design Decisions:
- No public exposure: Tailscale VPN-only access (security > convenience)
- External storage: NFS over in-cluster solutions (leverage existing NAS investment, snapshots)
- Three-tier ML inference: vLLM (GPU) → Ollama (CPU) → OpenAI (cloud) with automatic failover
- Helm for apps: Values files over
--setflags (version control, repeatability) - ADR documentation: Every architectural choice captured with context and alternatives
See docs/adrs/ for detailed decision records.
The cluster runs a sophisticated ML inference platform with automatic failover:
LiteLLM Proxy → vLLM (GPU, primary) → Ollama (CPU, fallback) → OpenAI (cloud, emergency)
GPU Modes (switchable on k3s-gpu-worker):
ml- Qwen 2.5 7B AWQ (general inference, 8K context)r1- DeepSeek-R1 7B AWQ (reasoning tasks with chain-of-thought)gaming- vLLM scaled to 0, GPU released for Steam
# Check current mode
ssh k3s-gpu-worker "gpu-mode status"
# Switch modes
ssh k3s-gpu-worker "gpu-mode ml" # General inference
ssh k3s-gpu-worker "gpu-mode r1" # Reasoning model
ssh k3s-gpu-worker "gpu-mode gaming" # Release GPUConsumers:
- Open-WebUI: ChatGPT-like interface at https://chat.home.jetzinger.com
- Paperless-AI: Document classification and RAG queries
- n8n: Workflow automation with LLM integration
- OpenClaw: Personal AI assistant (see below)
OpenClaw is a self-hosted AI assistant running on the K3s cluster, accessible via Telegram and Discord. It provides frontier-quality conversational AI with automatic fallback to local inference when cloud APIs are unavailable.
┌─────────────────────────────────────────────────────────────────────────────┐
│ OpenClaw Gateway (apps namespace) │
│ https://openclaw.home.jetzinger.com │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Telegram │ │ Discord │ │ Web UI │ │
│ │ (outbound) │ │ (outbound) │ │ (Traefik) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Agent Engine │ │
│ │ + LanceDB │ │
│ │ (long-term │ │
│ │ memory) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────────────┴────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Claude Opus │ │ LiteLLM Fallback │ │
│ │ 4.5 │──────────▶│ (existing stack) │ │
│ │ (PRIMARY) │ on fail │ │ │
│ │ │ │ vLLM → Ollama → │ │
│ │ Anthropic OAuth │ │ OpenAI │ │
│ └─────────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Feature | Implementation |
|---|---|
| Primary LLM | Claude Opus 4.5 via Anthropic OAuth (frontier reasoning) |
| Fallback Chain | LiteLLM → vLLM (GPU) → Ollama (CPU) → OpenAI (cloud) |
| Messaging | Telegram, Discord with DM allowlist security |
| Memory | LanceDB with OpenAI embeddings for long-term context |
| Voice | ElevenLabs TTS/STT (optional) |
| Monitoring | Grafana dashboard (LogQL), P1 alerts via ntfy.sh |
Unlike typical setups where local models are primary, OpenClaw uses an "inverse fallback" approach:
- Primary: Claude Opus 4.5 (best reasoning quality)
- Fallback: Existing LiteLLM three-tier stack (when cloud unavailable)
This ensures the best possible responses while maintaining high availability.
| Access Method | URL/Details |
|---|---|
| Control UI | https://openclaw.home.jetzinger.com (Tailscale required) |
| Telegram | DM the bot (allowlisted users only) |
| Discord | Bot in private server (allowlisted users only) |
See ADR-011 for detailed architectural decisions.
Prerequisites:
- 3x VMs (2 CPU, 4GB RAM each) or bare metal nodes
- Ubuntu 22.04 LTS installed on all nodes
- Network: Static IPs assigned, nodes can reach each other
- Optional: Tailscale account for remote access
- Optional: NFS server for persistent storage
- Optional: NVIDIA GPU for ML inference
Time to working cluster: ~90 minutes (tested)
# On k3s-master node (192.168.2.20)
git clone https://github.com/tjetzinger/home-lab.git
cd home-lab/infrastructure/k3s
chmod +x install-master.sh
sudo ./install-master.sh
# Verify control plane
sudo kubectl get nodes
sudo kubectl get pods -n kube-systemSee docs/implementation-artifacts/1-1-create-k3s-control-plane.md for detailed walkthrough.
# On each worker node (192.168.2.21, 192.168.2.22)
chmod +x install-worker.sh
sudo K3S_URL=https://192.168.2.20:6443 \
K3S_TOKEN=<token-from-master> \
./install-worker.sh
# Verify cluster
kubectl get nodes
# Should show: k3s-master, k3s-worker-01, k3s-worker-02See docs/implementation-artifacts/1-2-add-first-worker-node.md and 1-3-add-second-worker-node.md.
# Configure local kubectl to access cluster remotely via Tailscale
./infrastructure/k3s/kubeconfig-setup.sh
kubectl get nodes # Should work from your laptop nowSee docs/implementation-artifacts/1-4-configure-remote-kubectl-access.md.
Deploy storage, load balancing, and TLS:
# NFS Storage Provisioner
helm upgrade --install nfs-subdir-external-provisioner \
nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
-f infrastructure/nfs/values-homelab.yaml \
-n kube-system
# MetalLB Load Balancer
helm upgrade --install metallb metallb/metallb \
-f infrastructure/metallb/values-homelab.yaml \
-n infra --create-namespace
# cert-manager for TLS
helm upgrade --install cert-manager jetstack/cert-manager \
-f infrastructure/cert-manager/values-homelab.yaml \
-n infra --set installCRDs=trueDetailed procedures: Epic 2 stories (2-1 through 2-4) and Epic 3 stories (3-1 through 3-5).
# kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
-f monitoring/prometheus/values-homelab.yaml \
-n monitoring --create-namespace
# Loki for log aggregation
helm upgrade --install loki grafana/loki-stack \
-f monitoring/loki/values-homelab.yaml \
-n monitoring
# Access Grafana
kubectl get ingress -n monitoring
# Visit https://grafana.home.jetzinger.comSee Epic 4 stories (4-1 through 4-6) for observability setup details.
# Check all nodes healthy
kubectl get nodes
# Check system pods running
kubectl get pods -n kube-system
kubectl get pods -n infra
kubectl get pods -n monitoring
# Verify storage class
kubectl get storageclass
# Check TLS certificates
kubectl get certificate -A
# Test ingress
curl https://grafana.home.jetzinger.comhome-lab/
├── infrastructure/ # Core cluster components
│ ├── k3s/ # Control plane and worker install scripts
│ ├── nfs/ # NFS CSI provisioner Helm values
│ ├── metallb/ # MetalLB load balancer config
│ ├── cert-manager/ # TLS certificate automation
│ └── traefik/ # Ingress controller config (K3s bundled)
│
├── applications/ # Workload deployments
│ ├── vllm/ # GPU inference engine (Qwen, DeepSeek-R1)
│ ├── litellm/ # LLM proxy with fallback chain
│ ├── ollama/ # CPU fallback inference
│ ├── open-webui/ # ChatGPT-like interface
│ ├── paperless/ # Document management (Paperless-ngx)
│ ├── paperless-ai/ # AI-powered document classification
│ ├── gitea/ # Self-hosted Git
│ ├── postgres/ # PostgreSQL database
│ ├── n8n/ # Workflow automation
│ ├── stirling-pdf/ # PDF tools
│ ├── gotenberg/ # Document conversion
│ ├── tika/ # Content extraction
│ ├── nginx/ # Development reverse proxy
│ └── dev-containers/ # Remote development environments
│
├── monitoring/ # Observability stack
│ ├── prometheus/ # kube-prometheus-stack Helm values
│ ├── grafana/ # Grafana dashboards and datasources
│ └── loki/ # Log aggregation Helm values
│
├── scripts/ # Automation and utilities
│ └── gpu-worker/ # GPU mode switching scripts
│
├── docs/ # Documentation
│ ├── VISUAL_TOUR.md # Grafana screenshots & architecture diagrams
│ ├── adrs/ # Architecture Decision Records
│ ├── runbooks/ # Operational procedures
│ ├── planning-artifacts/ # PRD, architecture, epics
│ └── implementation-artifacts/ # Story files, sprint status
│
└── _bmad/ # BMAD AI workflow framework
Key Files:
docs/VISUAL_TOUR.md- Grafana dashboard screenshots and architecture diagramsinfrastructure/*/values-homelab.yaml- Helm chart customizationsdocs/adrs/ADR-*.md- Architectural decisions with trade-offsdocs/runbooks/*.md- Operational proceduresdocs/implementation-artifacts/*.md- Story-by-story implementationCLAUDE.md- AI-assisted development instructions
- Metrics: Prometheus scraping all cluster components, custom ServiceMonitors for apps
- Dashboards: Grafana with K8s cluster overview, node metrics, pod resources, GPU utilization
- Logs: Loki aggregating logs from all namespaces, queryable via Grafana
- Alerts: Alertmanager configured for P1 scenarios (disk full, pod crashes, certificate expiry)
- Notifications: Mobile push alerts via ntfy.sh for critical issues
See Epic 4 stories for observability implementation.
- Cluster state: etcd snapshots via K3s built-in, restored and tested
- PostgreSQL: pg_dump scheduled backups to NFS, restore procedure validated
- NFS volumes: Synology snapshot schedules (hourly, daily, weekly retention)
- Configuration: All manifests and Helm values in Git (infrastructure as code)
See runbooks: cluster-backup.md, cluster-restore.md, postgres-backup.md.
- K3s upgrades: Tested procedure with rollback plan (see k3s-upgrade.md)
- OS patching: Automatic security updates via unattended-upgrades
- Certificate renewal: Automated via cert-manager, manual renewal documented
- GPU mode switching: Documented procedure for ML ↔ Gaming mode transitions
See Epic 8 stories for operational procedures.
This project was built incrementally across 24 epics, each delivering specific infrastructure capabilities. All planning and implementation artifacts are available in docs/.
| Epic | Name | Outcome |
|---|---|---|
| 1 | Foundation - K3s Cluster | Multi-node K3s cluster with Tailscale remote access |
| 2 | Storage & Persistence | NFS storage provisioning from Synology NAS |
| 3 | Ingress, TLS & Exposure | HTTPS services via *.home.jetzinger.com with Let's Encrypt |
| 4 | Observability Stack | Prometheus, Grafana, Loki, Alertmanager with mobile alerts |
| 5 | PostgreSQL Database | Production PostgreSQL with backup/restore capability |
| 6 | AI Inference Platform | Ollama LLM inference + n8n workflow automation |
| 7 | Development Proxy | Nginx proxy for local development servers |
| 8 | Cluster Operations | K3s upgrades, backup/restore, maintenance procedures |
| 9 | Portfolio & Showcase | Public GitHub repo, ADRs, technical blog posts |
| 10 | Document Management | Paperless-ngx with OCR, Tika, Gotenberg, Stirling-PDF |
| 11 | Dev Containers | Remote development via VS Code + Claude Code |
| 12 | GPU/ML Platform | Intel NUC + RTX 3060 eGPU with vLLM/Qwen 2.5 |
| 13 | Steam Gaming | Dual-use GPU: ML inference ↔ Steam gaming mode switching |
| 14 | LiteLLM Proxy | Three-tier fallback: vLLM → Ollama → OpenAI + external providers |
| 15 | Network Enhancement | Tailscale subnet routing for full home network access |
| 16 | NAS Worker Node | Lightweight K3s worker VM on Synology DS920+ |
| 17 | Open-WebUI | ChatGPT-like interface for all LLM models |
| 18 | K8s Dashboard | Cluster visualization and resource monitoring |
| 19 | Self-Hosted Git | Gitea with PostgreSQL backend and SSH access |
| 20 | Reasoning Models | DeepSeek-R1 7B with R1-Mode for chain-of-thought reasoning |
| 21 | OpenClaw Core Gateway | Personal AI assistant with Opus 4.5, LiteLLM fallback, Telegram |
| 22 | OpenClaw Multi-Channel | Discord channel, MCP research tools, cross-channel context |
| 23 | OpenClaw Advanced | Voice interaction, sub-agent routing, browser automation |
| 24 | OpenClaw Observability | Loki log dashboard, Blackbox probes, P1 alerts, ADR documentation |
Detailed Documentation:
- Epic definitions: docs/planning-artifacts/epics.md
- Story files: docs/implementation-artifacts/
- Sprint status: docs/implementation-artifacts/sprint-status.yaml
This project demonstrates systematic AI-assisted infrastructure development using Claude Code and the BMAD (Build-Measure-Achieve-Document) methodology. Every component was implemented through structured planning, execution, and documentation cycles.
Claude Code is Anthropic's official CLI tool that integrates Claude AI directly into the development workflow. Unlike generic AI chat interfaces, Claude Code:
- Understands full project context: Reads architecture docs, PRDs, and codebase structure
- Executes tasks autonomously: Creates files, runs commands, validates changes
- Follows project conventions: Adheres to patterns defined in
CLAUDE.mdand architecture docs - Maintains conversation context: Long-running sessions with automatic summarization
BMAD is a multi-agent AI workflow framework that enforces systematic software delivery. Located in _bmad/, it orchestrates the entire development lifecycle:
Phase 1: Discovery → Product requirements, user journeys, success criteria Phase 2: Architecture → Technology selection, component design, ADRs Phase 3: Planning → Epics, user stories, complexity estimation Phase 4: Implementation → Story execution with gap analysis and code review
# Workflow for each story:
1. /create-story # Generate story file from epic
2. /dev-story # Implement with gap analysis
3. /code-review # Adversarial quality check
4. Document & iterate # Update ADRs, runbooksFor hiring managers: This isn't a tutorial follow-along. It's a methodology-driven project that produces production-quality infrastructure with systematic documentation. The AI accelerated delivery, but the engineering judgment, architectural decisions, and operational discipline are human.
| Application | Purpose | Namespace | URL |
|---|---|---|---|
| Grafana | Metrics visualization | monitoring | https://grafana.home.jetzinger.com |
| Open-WebUI | ChatGPT-like interface | apps | https://chat.home.jetzinger.com |
| OpenClaw | Personal AI assistant | apps | https://openclaw.home.jetzinger.com |
| Paperless-ngx | Document management | docs | https://paperless.home.jetzinger.com |
| Paperless-AI | AI document classification | docs | https://paperless-ai.home.jetzinger.com |
| Gitea | Self-hosted Git | apps | https://git.home.jetzinger.com |
| n8n | Workflow automation | apps | https://n8n.home.jetzinger.com |
| Stirling-PDF | PDF tools | docs | https://pdf.home.jetzinger.com |
| K8s Dashboard | Cluster visualization | kubernetes-dashboard | https://dashboard.home.jetzinger.com |
| LiteLLM | LLM proxy | ml | https://litellm.home.jetzinger.com |
| vLLM | GPU inference | ml | ClusterIP only |
| Ollama | CPU inference | ml | ClusterIP only |
| PostgreSQL | Relational database | data | ClusterIP only |
| Prometheus | Metrics collection | monitoring | ClusterIP only |
| Alertmanager | Alert routing | monitoring | ClusterIP only |
| Loki | Log aggregation | monitoring | ClusterIP only |
- Portfolio Summary: docs/PORTFOLIO.md - High-level project summary
- Visual Tour: docs/VISUAL_TOUR.md - Screenshots and diagrams
- Architecture Decision Records: docs/adrs/ - Technical choices
- Operational Runbooks: docs/runbooks/ - P1 procedures
- Implementation Stories: docs/implementation-artifacts/ - Build documentation
- PRD and Planning: docs/planning-artifacts/ - Requirements and architecture
- Blog Posts: docs/blog-posts/ - Technical write-ups
- LinkedIn: linkedin.com/in/tjetzinger
Tom Jetzinger Platform Engineering | Kubernetes | Systems Architecture
Questions about this lab, my transition from automotive to cloud-native, or how I implemented specific features? I'm happy to discuss!
- LinkedIn: linkedin.com/in/tjetzinger
- Email: [email protected]
- GitHub: @tjetzinger
This project is shared for educational and portfolio purposes. Configuration files and documentation are MIT licensed. See individual component licenses for third-party software.
Note: Secrets, API keys, and kubeconfig files are excluded via .gitignore. This repository contains no sensitive credentials.