Home Lab: Production-Grade K3s Platform

A production-ready Kubernetes home lab running on Proxmox VE, demonstrating platform engineering skills through hands-on infrastructure operation. This project bridges my automotive systems engineering background with modern cloud-native technologies, showcasing the ability to design, deploy, and maintain production workloads on Kubernetes.

Status: ✅ Operational (24 epics complete) Cluster: 5-node K3s v1.34.3+k3s1 (including GPU worker) Workloads: Full ML inference stack, Document management, Workflow automation, Git hosting Observability: Prometheus, Grafana, Loki, Alertmanager

Why This Project?

After years working on IVI systems (In-Vehicle Infotainment) and LBS Navigation Systems—building embedded, mobile, and cloud solutions for online routing—I'm expanding into cloud-native platform engineering. This lab serves as a hands-on portfolio demonstrating Kubernetes infrastructure deployment and operations. It's a working environment that I use daily, monitor, maintain, and continuously improve.

What Makes This Different

Real operational complexity: Persistent storage, TLS certificates, load balancing, log aggregation, alerts
Production practices: GitOps, ADRs, runbooks, backup/restore procedures, upgrade testing
AI-assisted workflow: Leveraging Claude Code and BMAD methodology for systematic implementation
Engineering judgment: Every decision documented with trade-off analysis (see ADRs)
Automotive → K8s bridge: Applying systems thinking from real-time embedded work to distributed systems

Architecture Overview

High-Level Design

┌─────────────────────────────────────────────────────────────────────────────┐
│                              Tailscale VPN                                   │
│                         (Remote Access Layer)                                │
└──────────────────────────────────┬──────────────────────────────────────────┘
                                   │
┌──────────────────────────────────▼──────────────────────────────────────────┐
│                            Traefik Ingress                                   │
│                       (TLS Termination, Routing)                             │
└───┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────────┘
    │         │         │         │         │         │         │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│Grafana│ │Open-  │ │Paper- │ │ Gitea │ │  n8n  │ │Stirling│ │ K8s   │
│       │ │WebUI  │ │less   │ │       │ │       │ │  PDF  │ │Dashbd │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
    │         │         │         │         │         │         │
    └─────────┴────┬────┴─────────┴────┬────┴─────────┴─────────┘
                   │                   │
                   │ (AI + Data + Storage backends)
                   │                   │
┌──────────────────▼───┐  ┌────────────▼───┐  ┌─────────────────────┐
│      LiteLLM         │  │   PostgreSQL   │  │   Synology DS920+   │
│       Proxy          │  │     (data)     │  │  (NFS + k3s-nas VM) │
└──────────┬───────────┘  └────────────────┘  └─────────────────────┘
           │
┌──────────┼──────────┐
│          │          │
▼          ▼          ▼
┌───────┐ ┌────────┐ ┌────────┐
│ vLLM  │ │ Ollama │ │ OpenAI │
│ (GPU) │ │ (CPU)  │ │(Cloud) │
└───┬───┘ └────────┘ └────────┘
    │
┌───▼────────────────┐
│   RTX 3060 eGPU    │
│    (12GB VRAM)     │
│   k3s-gpu-worker   │
└────────────────────┘

Cluster Nodes

Node	Role	Specs	Purpose
`k3s-master`	Control plane	192.168.2.20	API server, etcd, scheduler
`k3s-worker-01`	CPU worker	192.168.2.21	General workloads
`k3s-worker-02`	CPU worker	192.168.2.22	General workloads
`k3s-gpu-worker`	GPU worker	Intel NUC + RTX 3060 eGPU (12GB)	ML inference (vLLM)
`k3s-nas-worker`	NAS worker	Synology DS920+ VM	Storage-adjacent workloads

Technology Stack

Layer	Technology	Decision Rationale
Orchestration	K3s v1.34.3	Lightweight, production-ready K8s. Half the memory of k0s, built-in storage/ingress. See ADR-001
Compute	5x nodes (VMs + bare metal)	Mixed: Proxmox VMs, Intel NUC with eGPU, Synology NAS VM
GPU Inference	vLLM + NVIDIA Container Toolkit	High-performance LLM serving with AWQ quantization
LLM Proxy	LiteLLM	Unified OpenAI-compatible API with automatic failover
Storage	NFS from Synology DS920+	Existing NAS asset, CSI driver maturity, snapshot support
Ingress	Traefik (K3s bundled)	Zero-config LB, native K8s integration, automatic cert renewal
TLS	cert-manager + Let's Encrypt	Industry standard, automated renewal, staging/prod environments
Load Balancer	MetalLB (Layer 2)	Simple home network setup, no BGP complexity needed
Observability	kube-prometheus-stack + Loki	Complete stack (metrics, logs, alerting), Grafana dashboards, mobile alerts
GitOps	Git as source of truth	All manifests version-controlled, Helm values files, reproducible deployments

Key Design Decisions:

No public exposure: Tailscale VPN-only access (security > convenience)
External storage: NFS over in-cluster solutions (leverage existing NAS investment, snapshots)
Three-tier ML inference: vLLM (GPU) → Ollama (CPU) → OpenAI (cloud) with automatic failover
Helm for apps: Values files over --set flags (version control, repeatability)
ADR documentation: Every architectural choice captured with context and alternatives

See docs/adrs/ for detailed decision records.

ML Inference Stack

The cluster runs a sophisticated ML inference platform with automatic failover:

LiteLLM Proxy → vLLM (GPU, primary) → Ollama (CPU, fallback) → OpenAI (cloud, emergency)

GPU Modes (switchable on k3s-gpu-worker):

ml - Qwen 2.5 7B AWQ (general inference, 8K context)
r1 - DeepSeek-R1 7B AWQ (reasoning tasks with chain-of-thought)
gaming - vLLM scaled to 0, GPU released for Steam

# Check current mode
ssh k3s-gpu-worker "gpu-mode status"

# Switch modes
ssh k3s-gpu-worker "gpu-mode ml"      # General inference
ssh k3s-gpu-worker "gpu-mode r1"      # Reasoning model
ssh k3s-gpu-worker "gpu-mode gaming"  # Release GPU

Consumers:

Open-WebUI: ChatGPT-like interface at https://chat.home.jetzinger.com
Paperless-AI: Document classification and RAG queries
n8n: Workflow automation with LLM integration
OpenClaw: Personal AI assistant (see below)

OpenClaw Personal AI Assistant

OpenClaw is a self-hosted AI assistant running on the K3s cluster, accessible via Telegram and Discord. It provides frontier-quality conversational AI with automatic fallback to local inference when cloud APIs are unavailable.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        OpenClaw Gateway (apps namespace)                     │
│                     https://openclaw.home.jetzinger.com                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                         │
│  │  Telegram   │  │   Discord   │  │   Web UI    │                         │
│  │  (outbound) │  │  (outbound) │  │  (Traefik)  │                         │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                         │
│         │                │                │                                 │
│         └────────────────┼────────────────┘                                 │
│                          │                                                  │
│                          ▼                                                  │
│                 ┌─────────────────┐                                         │
│                 │   Agent Engine  │                                         │
│                 │    + LanceDB    │                                         │
│                 │  (long-term     │                                         │
│                 │   memory)       │                                         │
│                 └────────┬────────┘                                         │
│                          │                                                  │
│         ┌────────────────┴────────────────┐                                 │
│         │                                 │                                 │
│         ▼                                 ▼                                 │
│  ┌─────────────────┐           ┌─────────────────────┐                     │
│  │  Claude Opus    │           │   LiteLLM Fallback  │                     │
│  │     4.5         │──────────▶│  (existing stack)   │                     │
│  │   (PRIMARY)     │  on fail  │                     │                     │
│  │                 │           │  vLLM → Ollama →    │                     │
│  │ Anthropic OAuth │           │  OpenAI             │                     │
│  └─────────────────┘           └─────────────────────┘                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Features

Feature	Implementation
Primary LLM	Claude Opus 4.5 via Anthropic OAuth (frontier reasoning)
Fallback Chain	LiteLLM → vLLM (GPU) → Ollama (CPU) → OpenAI (cloud)
Messaging	Telegram, Discord with DM allowlist security
Memory	LanceDB with OpenAI embeddings for long-term context
Voice	ElevenLabs TTS/STT (optional)
Monitoring	Grafana dashboard (LogQL), P1 alerts via ntfy.sh

Inverse Fallback Pattern

Unlike typical setups where local models are primary, OpenClaw uses an "inverse fallback" approach:

Primary: Claude Opus 4.5 (best reasoning quality)
Fallback: Existing LiteLLM three-tier stack (when cloud unavailable)

This ensures the best possible responses while maintaining high availability.

Accessing OpenClaw

Access Method	URL/Details
Control UI	https://openclaw.home.jetzinger.com (Tailscale required)
Telegram	DM the bot (allowlisted users only)
Discord	Bot in private server (allowlisted users only)

See ADR-011 for detailed architectural decisions.

Quick Start

Prerequisites:

3x VMs (2 CPU, 4GB RAM each) or bare metal nodes
Ubuntu 22.04 LTS installed on all nodes
Network: Static IPs assigned, nodes can reach each other
Optional: Tailscale account for remote access
Optional: NFS server for persistent storage
Optional: NVIDIA GPU for ML inference

Time to working cluster: ~90 minutes (tested)

Step 1: Control Plane Setup

# On k3s-master node (192.168.2.20)
git clone https://github.com/tjetzinger/home-lab.git
cd home-lab/infrastructure/k3s
chmod +x install-master.sh
sudo ./install-master.sh

# Verify control plane
sudo kubectl get nodes
sudo kubectl get pods -n kube-system

See docs/implementation-artifacts/1-1-create-k3s-control-plane.md for detailed walkthrough.

Step 2: Worker Nodes

# On each worker node (192.168.2.21, 192.168.2.22)
chmod +x install-worker.sh
sudo K3S_URL=https://192.168.2.20:6443 \
     K3S_TOKEN=<token-from-master> \
     ./install-worker.sh

# Verify cluster
kubectl get nodes
# Should show: k3s-master, k3s-worker-01, k3s-worker-02

See docs/implementation-artifacts/1-2-add-first-worker-node.md and 1-3-add-second-worker-node.md.

Step 3: Remote Access (Optional)

# Configure local kubectl to access cluster remotely via Tailscale
./infrastructure/k3s/kubeconfig-setup.sh
kubectl get nodes  # Should work from your laptop now

See docs/implementation-artifacts/1-4-configure-remote-kubectl-access.md.

Step 4: Core Infrastructure

Deploy storage, load balancing, and TLS:

# NFS Storage Provisioner
helm upgrade --install nfs-subdir-external-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  -f infrastructure/nfs/values-homelab.yaml \
  -n kube-system

# MetalLB Load Balancer
helm upgrade --install metallb metallb/metallb \
  -f infrastructure/metallb/values-homelab.yaml \
  -n infra --create-namespace

# cert-manager for TLS
helm upgrade --install cert-manager jetstack/cert-manager \
  -f infrastructure/cert-manager/values-homelab.yaml \
  -n infra --set installCRDs=true

Detailed procedures: Epic 2 stories (2-1 through 2-4) and Epic 3 stories (3-1 through 3-5).

Step 5: Monitoring Stack

# kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -f monitoring/prometheus/values-homelab.yaml \
  -n monitoring --create-namespace

# Loki for log aggregation
helm upgrade --install loki grafana/loki-stack \
  -f monitoring/loki/values-homelab.yaml \
  -n monitoring

# Access Grafana
kubectl get ingress -n monitoring
# Visit https://grafana.home.jetzinger.com

See Epic 4 stories (4-1 through 4-6) for observability setup details.

Verification

# Check all nodes healthy
kubectl get nodes

# Check system pods running
kubectl get pods -n kube-system
kubectl get pods -n infra
kubectl get pods -n monitoring

# Verify storage class
kubectl get storageclass

# Check TLS certificates
kubectl get certificate -A

# Test ingress
curl https://grafana.home.jetzinger.com

Repository Structure

home-lab/
├── infrastructure/           # Core cluster components
│   ├── k3s/                 # Control plane and worker install scripts
│   ├── nfs/                 # NFS CSI provisioner Helm values
│   ├── metallb/             # MetalLB load balancer config
│   ├── cert-manager/        # TLS certificate automation
│   └── traefik/             # Ingress controller config (K3s bundled)
│
├── applications/            # Workload deployments
│   ├── vllm/                # GPU inference engine (Qwen, DeepSeek-R1)
│   ├── litellm/             # LLM proxy with fallback chain
│   ├── ollama/              # CPU fallback inference
│   ├── open-webui/          # ChatGPT-like interface
│   ├── paperless/           # Document management (Paperless-ngx)
│   ├── paperless-ai/        # AI-powered document classification
│   ├── gitea/               # Self-hosted Git
│   ├── postgres/            # PostgreSQL database
│   ├── n8n/                 # Workflow automation
│   ├── stirling-pdf/        # PDF tools
│   ├── gotenberg/           # Document conversion
│   ├── tika/                # Content extraction
│   ├── nginx/               # Development reverse proxy
│   └── dev-containers/      # Remote development environments
│
├── monitoring/              # Observability stack
│   ├── prometheus/          # kube-prometheus-stack Helm values
│   ├── grafana/             # Grafana dashboards and datasources
│   └── loki/                # Log aggregation Helm values
│
├── scripts/                 # Automation and utilities
│   └── gpu-worker/          # GPU mode switching scripts
│
├── docs/                    # Documentation
│   ├── VISUAL_TOUR.md       # Grafana screenshots & architecture diagrams
│   ├── adrs/                # Architecture Decision Records
│   ├── runbooks/            # Operational procedures
│   ├── planning-artifacts/  # PRD, architecture, epics
│   └── implementation-artifacts/  # Story files, sprint status
│
└── _bmad/                   # BMAD AI workflow framework

Key Files:

docs/VISUAL_TOUR.md - Grafana dashboard screenshots and architecture diagrams
infrastructure/*/values-homelab.yaml - Helm chart customizations
docs/adrs/ADR-*.md - Architectural decisions with trade-offs
docs/runbooks/*.md - Operational procedures
docs/implementation-artifacts/*.md - Story-by-story implementation
CLAUDE.md - AI-assisted development instructions

Operational Excellence

Monitoring and Alerts

Metrics: Prometheus scraping all cluster components, custom ServiceMonitors for apps
Dashboards: Grafana with K8s cluster overview, node metrics, pod resources, GPU utilization
Logs: Loki aggregating logs from all namespaces, queryable via Grafana
Alerts: Alertmanager configured for P1 scenarios (disk full, pod crashes, certificate expiry)
Notifications: Mobile push alerts via ntfy.sh for critical issues

See Epic 4 stories for observability implementation.

Backup and Recovery

Cluster state: etcd snapshots via K3s built-in, restored and tested
PostgreSQL: pg_dump scheduled backups to NFS, restore procedure validated
NFS volumes: Synology snapshot schedules (hourly, daily, weekly retention)
Configuration: All manifests and Helm values in Git (infrastructure as code)

See runbooks: cluster-backup.md, cluster-restore.md, postgres-backup.md.

Maintenance Procedures

K3s upgrades: Tested procedure with rollback plan (see k3s-upgrade.md)
OS patching: Automatic security updates via unattended-upgrades
Certificate renewal: Automated via cert-manager, manual renewal documented
GPU mode switching: Documented procedure for ML ↔ Gaming mode transitions

See Epic 8 stories for operational procedures.

Implementation Journey: 24 Epics

This project was built incrementally across 24 epics, each delivering specific infrastructure capabilities. All planning and implementation artifacts are available in docs/.

Epic	Name	Outcome
1	Foundation - K3s Cluster	Multi-node K3s cluster with Tailscale remote access
2	Storage & Persistence	NFS storage provisioning from Synology NAS
3	Ingress, TLS & Exposure	HTTPS services via `*.home.jetzinger.com` with Let's Encrypt
4	Observability Stack	Prometheus, Grafana, Loki, Alertmanager with mobile alerts
5	PostgreSQL Database	Production PostgreSQL with backup/restore capability
6	AI Inference Platform	Ollama LLM inference + n8n workflow automation
7	Development Proxy	Nginx proxy for local development servers
8	Cluster Operations	K3s upgrades, backup/restore, maintenance procedures
9	Portfolio & Showcase	Public GitHub repo, ADRs, technical blog posts
10	Document Management	Paperless-ngx with OCR, Tika, Gotenberg, Stirling-PDF
11	Dev Containers	Remote development via VS Code + Claude Code
12	GPU/ML Platform	Intel NUC + RTX 3060 eGPU with vLLM/Qwen 2.5
13	Steam Gaming	Dual-use GPU: ML inference ↔ Steam gaming mode switching
14	LiteLLM Proxy	Three-tier fallback: vLLM → Ollama → OpenAI + external providers
15	Network Enhancement	Tailscale subnet routing for full home network access
16	NAS Worker Node	Lightweight K3s worker VM on Synology DS920+
17	Open-WebUI	ChatGPT-like interface for all LLM models
18	K8s Dashboard	Cluster visualization and resource monitoring
19	Self-Hosted Git	Gitea with PostgreSQL backend and SSH access
20	Reasoning Models	DeepSeek-R1 7B with R1-Mode for chain-of-thought reasoning
21	OpenClaw Core Gateway	Personal AI assistant with Opus 4.5, LiteLLM fallback, Telegram
22	OpenClaw Multi-Channel	Discord channel, MCP research tools, cross-channel context
23	OpenClaw Advanced	Voice interaction, sub-agent routing, browser automation
24	OpenClaw Observability	Loki log dashboard, Blackbox probes, P1 alerts, ADR documentation

Detailed Documentation:

Epic definitions: docs/planning-artifacts/epics.md
Story files: docs/implementation-artifacts/
Sprint status: docs/implementation-artifacts/sprint-status.yaml

Development Methodology

This project demonstrates systematic AI-assisted infrastructure development using Claude Code and the BMAD (Build-Measure-Achieve-Document) methodology. Every component was implemented through structured planning, execution, and documentation cycles.

Claude Code: AI-Powered Development

Claude Code is Anthropic's official CLI tool that integrates Claude AI directly into the development workflow. Unlike generic AI chat interfaces, Claude Code:

Understands full project context: Reads architecture docs, PRDs, and codebase structure
Executes tasks autonomously: Creates files, runs commands, validates changes
Follows project conventions: Adheres to patterns defined in CLAUDE.md and architecture docs
Maintains conversation context: Long-running sessions with automatic summarization

BMAD Methodology: Structured Implementation

BMAD is a multi-agent AI workflow framework that enforces systematic software delivery. Located in _bmad/, it orchestrates the entire development lifecycle:

Phase 1: Discovery → Product requirements, user journeys, success criteria Phase 2: Architecture → Technology selection, component design, ADRs Phase 3: Planning → Epics, user stories, complexity estimation Phase 4: Implementation → Story execution with gap analysis and code review

# Workflow for each story:
1. /create-story          # Generate story file from epic
2. /dev-story             # Implement with gap analysis
3. /code-review           # Adversarial quality check
4. Document & iterate     # Update ADRs, runbooks

For hiring managers: This isn't a tutorial follow-along. It's a methodology-driven project that produces production-quality infrastructure with systematic documentation. The AI accelerated delivery, but the engineering judgment, architectural decisions, and operational discipline are human.

Current Workloads

Application	Purpose	Namespace	URL
Grafana	Metrics visualization	monitoring	https://grafana.home.jetzinger.com
Open-WebUI	ChatGPT-like interface	apps	https://chat.home.jetzinger.com
OpenClaw	Personal AI assistant	apps	https://openclaw.home.jetzinger.com
Paperless-ngx	Document management	docs	https://paperless.home.jetzinger.com
Paperless-AI	AI document classification	docs	https://paperless-ai.home.jetzinger.com
Gitea	Self-hosted Git	apps	https://git.home.jetzinger.com
n8n	Workflow automation	apps	https://n8n.home.jetzinger.com
Stirling-PDF	PDF tools	docs	https://pdf.home.jetzinger.com
K8s Dashboard	Cluster visualization	kubernetes-dashboard	https://dashboard.home.jetzinger.com
LiteLLM	LLM proxy	ml	https://litellm.home.jetzinger.com
vLLM	GPU inference	ml	ClusterIP only
Ollama	CPU inference	ml	ClusterIP only
PostgreSQL	Relational database	data	ClusterIP only
Prometheus	Metrics collection	monitoring	ClusterIP only
Alertmanager	Alert routing	monitoring	ClusterIP only
Loki	Log aggregation	monitoring	ClusterIP only

Links and References

Documentation

Portfolio Summary: docs/PORTFOLIO.md - High-level project summary
Visual Tour: docs/VISUAL_TOUR.md - Screenshots and diagrams
Architecture Decision Records: docs/adrs/ - Technical choices
Operational Runbooks: docs/runbooks/ - P1 procedures
Implementation Stories: docs/implementation-artifacts/ - Build documentation
PRD and Planning: docs/planning-artifacts/ - Requirements and architecture

External Resources

Blog Posts: docs/blog-posts/ - Technical write-ups
LinkedIn: linkedin.com/in/tjetzinger

Key Technologies

Contact and Feedback

Tom Jetzinger Platform Engineering | Kubernetes | Systems Architecture

Questions about this lab, my transition from automotive to cloud-native, or how I implemented specific features? I'm happy to discuss!

LinkedIn: linkedin.com/in/tjetzinger
Email: [email protected]
GitHub: @tjetzinger

License

This project is shared for educational and portfolio purposes. Configuration files and documentation are MIT licensed. See individual component licenses for third-party software.

Note: Secrets, API keys, and kubeconfig files are excluded via .gitignore. This repository contains no sensitive credentials.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
_bmad		_bmad
applications		applications
docs		docs
infrastructure		infrastructure
monitoring		monitoring
scripts		scripts
.gitignore		.gitignore
.ideas		.ideas
CLAUDE.md		CLAUDE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation