`PART I` — Hardware & Setup

 ██████╗  ██████╗       ██████╗ ███████╗ ██████╗
 ██╔══██╗██╔════╝       ╚════██╗██╔════╝██╔═████╗
 ██████╔╝██║      █████╗ █████╔╝███████╗██║██╔██║
 ██╔══██╗██║      ╚════╝██╔═══╝ ╚════██║████╔╝██║
 ██████╔╝╚██████╗       ███████╗███████║╚██████╔╝
 ╚═════╝  ╚═════╝       ╚══════╝╚══════╝ ╚═════╝

GPU-accelerated AI home server on an obscure AMD APU — Vulkan inference, autonomous intelligence, Signal chat

Zen 2 · RDNA 1.5 · 16 GB unified · Vulkan · 14B @ 27 tok/s · 330 autonomous jobs/cycle · 130 dashboard pages

The BC-250 powered by an ATX supply, cooled by a broken AIO radiator with 3 fans just sitting on top of it. Somehow runs 24/7 without issues so far.

A complete guide to running a 35B-parameter MoE LLM, FLUX.2 image generation, and 330 autonomous jobs on the AMD BC-250 — an obscure APU (Zen 2 CPU + Cyan Skillfish RDNA 1.5 GPU) found in Samsung's blockchain/distributed-ledger rack appliances. Not a "crypto mining GPU," not a PS5 prototype — it's a custom SoC that Samsung used for private DLT infrastructure, repurposed here as a headless AI server with a community-patched BIOS.

Qwen3.5-35B MoE at 38 tok/s, FLUX.2-klein-9B at best quality, hardware-specific driver workarounds, memory tuning notes, and real-world benchmarks on this niche hardware.

What makes this unusual: The BC-250's Cyan Skillfish GPU (GFX1013) is one of the few documented cases of LLM inference on RDNA 1.5. ROCm doesn't support it. OpenCL doesn't expose it. The only viable compute path is Vulkan — and even that required working around two kernel memory bottlenecks (GTT cap + TTM pages_limit) before 14B models would run.

░░ Contents

§	Section	For	What you'll find
	`PART I ─ HARDWARE & SETUP`
1	Hardware Overview	BC-250 owners	Specs, memory architecture, power
2	Driver & Compute Stack	BC-250 owners	What works (Vulkan), what doesn't (ROCm)
3	Ollama + Vulkan Setup	BC-250 owners	Install, GPU memory tuning (GTT + TTM)
4	Models & Benchmarks	LLM users	Model compatibility, speed, memory budget
	`PART II ─ AI STACK`
5	Signal Chat Bot	Bot builders	Chat, vision analysis, audio transcription, smart routing
6	Image Generation	Creative users	FLUX.2-klein-9B, synchronous pipeline
	`PART III ─ MONITORING & INTEL`
7	Netscan Ecosystem	Home lab admins	330 jobs, queue-runner v7, 130-page dashboard
8	Career Intelligence	Job seekers	Two-phase scanner, salary, patents
	`PART IV ─ REFERENCE`
9	Repository Structure	Contributors	File layout, deployment paths
10	Troubleshooting	Everyone	Common issues and fixes
11	Known Limitations	Maintainers	What's broken, what to watch out for
12	Software Versions	Everyone	Pinned versions of all components
13	References	Everyone	Links to all upstream projects and models
A	OpenClaw Archive	Historical	Original architecture, why we ditched it

`PART I` — Hardware & Setup

1. Hardware Overview

The AMD BC-250 is a custom APU originally designed for Samsung's blockchain/distributed-ledger rack appliances (not a traditional "mining GPU"). It's a full SoC — Zen 2 CPU and Cyan Skillfish RDNA 1.5 GPU on a single package, with 16 GB of on-package unified memory. Samsung deployed these in rack-mount enclosures for private DLT workloads; decommissioned boards now sell for ~$100–150 on the secondhand market, making them an affordable option for running 14B LLMs on dedicated hardware.

▸ Origin story — Samsung, 5G operators, and AliExpress

What it was built for: Samsung commissioned these custom AMD SoCs to build rack-mount servers for private DLT (Distributed Ledger Technology) infrastructure — not public cryptocurrency mining. The target customers were South Korean 5G operators (SK Telecom and others), who were early adopters of 5G deployment. Private blockchain solved several real problems for 5G telcos:

IoT microtransactions: 5G networks connect millions of smart devices. DLT enables cheap, instant machine-to-machine contract settlement without overloading central databases.
Digital identity & security: Operators used DLT registries for cryptographic customer authentication and digital identity wallets (e.g. Samsung Pay integration).
Inter-operator settlement: Blockchain streamlined real-time roaming fee reconciliation and data exchange between telecom partners.

Who made the hardware: The SoC was designed by AMD (Zen 2 CPU + RDNA 1.5 GPU). Samsung designed the overall system and wrote the factory BIOS. The physical boards were manufactured by ASRock Rack (ASRock's server division) as an OEM contractor — Samsung rack enclosures typically held 12 BC-250 boards each. ASRock Rack is known for producing highly custom designs for large tech companies.

How they ended up on AliExpress: Classic corporate e-waste cycle. As 5G infrastructure evolved, entire Korean server racks were decommissioned. Specialized recycling centers (mostly near Shenzhen, China) buy pallets of retired servers in bulk — often by weight. Workers disassemble the racks, test individual boards, and list working BC-250 modules on AliExpress as all-in-one SBC platforms for $100–150.

Not a PlayStation 5. Despite superficial similarities (both use Zen 2 + 16 GB memory), the BC-250 has nothing to do with the PS5. The PS5's Oberon SoC is RDNA 2 (GFX10.3, gfx1030+); the BC-250's Cyan Skillfish is RDNA 1.5 (GFX10.1, gfx1013) — a hybrid architecture: GFX10.1 instruction set (RDNA 1) but with hardware ray tracing support (full VK_KHR_ray_tracing_pipeline, VK_KHR_acceleration_structure, VK_KHR_ray_query). LLVM's AMDGPU processor table lists GFX1013 as product "TBA" under GFX10.1, confirming it was never a retail part. Samsung also licensed RDNA 2 for mobile (Exynos 2200 / Xclipse 920) — that's a completely separate deal.

Why "RDNA 1.5"? GFX1013 doesn't fit cleanly into AMD's public RDNA generations. It has the RDNA 1 (GFX10.1) ISA and shader compiler target, but includes hardware ray tracing — a feature AMD only shipped publicly with RDNA 2 (GFX10.3). This makes Cyan Skillfish a transitional/custom design, likely built for Samsung's specific workload requirements. We call it "RDNA 1.5" as a practical label.

BIOS is not stock. The board ships with a minimal Samsung BIOS meant for rack operation. A community-patched BIOS (from AMD BC-250 docs) enables standard UEFI features (boot menu, NVMe boot, fan control).

Component	Details
CPU	Zen 2 — 6c/12t @ 2.0 GHz
GPU	Cyan Skillfish — RDNA 1.5, `GFX1013`, 24 CUs (1536 SPs), ray tracing capable
Memory	16 GB unified (16 × 1 GB on-package), shared CPU/GPU
VRAM	512 MB BIOS-carved framebuffer (same physical UMA pool — see note below)
GTT	16 GiB (tuned via `ttm.pages_limit=4194304`, default 7.4 GiB)
Vulkan total	16.5 GiB after tuning
Storage	475 GB NVMe
OS	Fedora 43, kernel 6.18.9, headless
TDP	220W board (inference: 130–155W, between jobs: 55–60W, true idle w/o model: ~35W)
BIOS	Community-patched UEFI (not Samsung stock) — AMD BC-250 docs
CPU governor	`performance` (stock `schedutil` causes LLM latency spikes)

Unified memory is your friend (but needs tuning)

CPU and GPU share the same 16 GB physical pool (UMA — Unified Memory Architecture). The 512 MB "dedicated framebuffer" reported by mem_info_vram_total is carved from the same physical memory — it's a BIOS reservation, not separate silicon. The rest is accessible as GTT (Graphics Translation Table).

UMA reality: On unified memory, "100% GPU offload" means the model weights and KV cache live in GTT-mapped pages that the GPU accesses directly — there's no PCIe copy. However, it's still the same physical RAM the CPU uses. "Fallback to CPU" on UMA isn't catastrophic like on discrete GPUs (no bus transfer penalty), but GPU ALUs are faster than CPU ALUs for matrix ops.

Two bottlenecks must be fixed:

GTT cap — amdgpu driver defaults to 50% of RAM (~7.4 GiB). The legacy fix was amdgpu.gttsize=14336 in kernel cmdline, but this is no longer needed.
TTM pages_limit — kernel TTM memory manager independently caps allocations at ~7.4 GiB. Fix: ttm.pages_limit=4194304 (16 GiB in 4K pages). This is the only tuning needed.

✅ GTT migration complete: amdgpu.gttsize was removed from kernel cmdline. With ttm.pages_limit=4194304 alone, GTT grew from 14→16 GiB and Vulkan available from 14.0→16.5 GiB. The deprecated parameter was actually limiting the allocation.

After tuning: Vulkan sees 16.5 GiB — enough for 14B parameter models at 40K context with Q4_0 KV cache, all inference on GPU.

2. Driver & Compute Stack

The BC-250's GFX1013 sits awkwardly between supported driver tiers.

Layer	Status	Notes
amdgpu kernel driver	✅	Auto-detected, firmware loaded
Vulkan (RADV/Mesa)	✅	Mesa 25.3.4, Vulkan 1.4.328
ROCm / HIP	❌	`rocblas_abort()` — GFX1013 not in GPU list
OpenCL (rusticl)	❌	Mesa's rusticl doesn't expose GFX1013

Why ROCm fails: GFX1013 is listed in LLVM as supporting rocm-amdhsa, but AMD's ROCm userspace (rocBLAS/Tensile) doesn't ship GFX1013 solution libraries. Vulkan is the only viable GPU compute path.

▸ Verification commands

vulkaninfo --summary
# → GPU0: AMD BC-250 (RADV GFX1013), Vulkan 1.4.328, INTEGRATED_GPU

cat /sys/class/drm/card1/device/mem_info_vram_total   # → 536870912 (512 MB)
cat /sys/class/drm/card1/device/mem_info_gtt_total    # → 15032385536 (14 GiB)

3. Ollama + Vulkan Setup

3.1 Install and enable Vulkan

curl -fsSL https://ollama.com/install.sh | sh

# Enable Vulkan backend (disabled by default)
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment=OLLAMA_VULKAN=1
Environment=OLLAMA_KEEP_ALIVE=30m
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_GPU_OVERHEAD=0
Environment=OLLAMA_CONTEXT_LENGTH=16384
Environment=OLLAMA_MAX_QUEUE=4
OOMScoreAdjust=-1000
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

OOMScoreAdjust=-1000 protects Ollama from the OOM killer — the model process must survive at all costs (see §3.4).

ROCm will crash during startup — expected and harmless. Ollama catches it and uses Vulkan.

3.2 Tune GTT size

✅ No longer needed. The amdgpu.gttsize parameter has been removed. With ttm.pages_limit=4194304 alone, GTT allocates 16 GiB (more than the old 14 GiB). Verify:

cat /sys/class/drm/card1/device/mem_info_gtt_total  # → 17179869184 (16 GiB)
# If you still have amdgpu.gttsize in cmdline, remove it:
sudo grubby --update-kernel=ALL --remove-args="amdgpu.gttsize=14336"

3.3 Tune TTM pages_limit ← unlocks 14B models

This is the key fix. Without this fix, 14B models load fine but produce HTTP 500 during inference.

# Runtime (immediate)
echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit
echo 4194304 | sudo tee /sys/module/ttm/parameters/page_pool_size

# Persistent
echo "options ttm pages_limit=4194304 page_pool_size=4194304" | \
  sudo tee /etc/modprobe.d/ttm-gpu-memory.conf
printf "w /sys/module/ttm/parameters/pages_limit - - - - 4194304\n\
w /sys/module/ttm/parameters/page_pool_size - - - - 4194304\n" | \
  sudo tee /etc/tmpfiles.d/gpu-ttm-memory.conf
sudo dracut -f

3.4 Context window — the main gotcha

Ollama allocates KV cache based on the model's declared context window. Without a cap, large models request more KV cache than the BC-250 can handle, causing TTM fragmentation, OOM kills, or deadlocks on this UMA system.

Fix: Set OLLAMA_CONTEXT_LENGTH=16384 in the Ollama systemd override (see §3.3). This caps all inference to 16K context by default — matching the MoE primary model's limit.

Individual requests can override with {"options": {"num_ctx": 65536}} when using qwen3.5:9b (which handles 65K). The cap only affects the default allocation.

History of context tuning:

Date	Context Cap	Primary Model	Why
Feb 2026	40960	qwen3:14b	Default — caused deadlocks (TTM fragmentation)
Feb 25	24576	qwen3:14b	Sweet spot: ~27 tok/s, 26K was 10% slower, 28K+ deadlocked
Mar 14	16384	qwen3.5-35b-a3b MoE	MoE maxes at 16K (KV cache exceeds VRAM at 24K+). 9B fallback can go to 65K per-request.

Why 24K → 16K? The 35B MoE's total weight (11 GB GGUF) is larger than qwen3:14b (9.3 GB). At 24K+ context the KV cache can't fit alongside the MoE weights. 16K is the maximum stable context for the MoE with all layers on GPU. See §4.3 for detailed KV cache scaling.

3.5 Swap — NVMe-backed safety net

With the model consuming 11+ GB on a 14 GB system, disk swap is essential for surviving inference peaks.

NVMe wear concern: Swap is a safety net, not an active paging target. In steady state, swap usage is ~400 MB (OS buffers pushed out to make room for model weights). SMART data after months of 24/7 operation: 3% wear, 25.4 TB total written. The model runs entirely in RAM — swap catches transient spikes during model load/unload transitions. Consumer NVMe drives rated for 300–600 TBW will last years at this rate.

# Create 16 GB swap file (btrfs requires dd, not fallocate)
sudo dd if=/dev/zero of=/swapfile bs=1M count=16384 status=progress
sudo chattr +C /swapfile   # disable btrfs copy-on-write
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon -p 10 /swapfile

# Make permanent
echo '/swapfile none swap sw,pri=10 0 0' | sudo tee -a /etc/fstab

Disable/reduce zram — zram compresses pages in physical RAM, competing with the model:

sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf
# Or disable entirely: zram-size = 0

3.6 Verify

sudo journalctl -u ollama -n 20 | grep total
# → total="11.1 GiB" available="11.1 GiB"  (with qwen3-14b-16k)
free -h
# → Swap: 15Gi total, ~1.4Gi used

3.7 Disable GUI (saves ~1 GB)

sudo systemctl set-default multi-user.target && sudo reboot

3.8 CPU governor — lock to `performance`

The stock schedutil governor down-clocks during idle, causing 50–100ms latency spikes at inference start. Lock all cores to full speed:

# Runtime (immediate)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Persistent (systemd-tmpfiles)
echo 'w /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - - - - performance' | \
  sudo tee /etc/tmpfiles.d/cpu-governor.conf

Memory layout after tuning

16 GB Unified Memory

Region	Size	Notes
VRAM carveout	512 MB	BIOS-reserved from UMA pool (not separate memory)
GTT	16 GiB	Tuned via `ttm.pages_limit=4194304` (default 7.4 GiB). `amdgpu.gttsize` removed — no longer needed.
TTM pages_limit	16 GiB	`ttm.pages_limit=4194304` — the only memory tuning parameter needed

Vulkan heap	Size
Device-local	8.33 GiB
Host-visible	8.17 GiB
Total	16.5 GiB → 14B models fit, all inference on GPU (UMA — same physical pool)

Consumer	Usage	Notes
Model weights (qwen3:14b)	8.2 GiB GPU + 0.4 GiB CPU	Q4_K_M quantization
KV cache (FP16 @ 24K)	3.8 GiB	With Q4_0: only 1.8 GiB for 40K context
Compute graph	0.17 GiB	GPU-side
signal-cli + queue-runner	~1.0 GiB	System RAM
OS + services	~0.9 GiB	Headless Fedora 43
NVMe swap	16 GiB (374 MB used)	Safety net
zram	0 B (allocated, not active)	Device exists but disksize=0
Total loaded	12.5 GiB (FP16) / 10.6 GiB (Q4_0)	3.9–5.9 GiB free

4. Models & Benchmarks

4.1 Compatibility table

Ollama 0.18.0 · Vulkan · RADV Mesa 25.3.4 · 16.5 GiB Vulkan · FP16 KV

Model	Params	Quant	tok/s	Prefill	Max Ctx	VRAM @4K	Status
qwen3.5-35b-a3b-iq2m	35B/3B	UD-IQ2_M	38	233	16K	12.3 GiB	🏆 Primary — MoE
qwen3.5:9b	9.7B	Q4_K_M	32	230	65K	8.6 GiB	🏆 Best context+vision
qwen2.5:3b	3.1B	Q4_K_M	104	515	64K	3.4 GiB	✅ Fast, lightweight
qwen2.5:7b	7.6B	Q4_K_M	56	248	64K	6.5 GiB	✅ Great quality/speed
qwen2.5-coder:7b	7.6B	Q4_K_M	56	246	64K	6.4 GiB	✅ Code-focused
llama3.1:8b	8.0B	Q4_K_M	52	246	48K	11.0 GiB	✅ Fast 8B
mannix/llama3.1-8b-lexi	8.0B	Q4_0	51	308	48K	10.6 GiB	✅ Uncensored 8B
huihui_ai/seed-coder-abliterate	8.3B	Q4_K_M	52	231	64K	9.1 GiB	✅ Code gen, uncensored
qwen3:8b	8.2B	Q4_K_M	44	251	64K	9.8 GiB	✅ Thinking mode
huihui_ai/qwen3-abliterated:8b	8.2B	Q4_K_M	46	250	64K	9.7 GiB	✅ Abliterated 8B
gemma2:9b	9.2B	Q4_0	38	219	48K	9.2 GiB	✅ Fixed! (was 91% before GTT fix)
mistral-nemo:12b	12.2B	Q4_0	34	137	24K	10.8 GiB	⚠️ 32K deadlocks
qwen3:14b	14.8B	Q4_K_M	27	131	24K	13.5 GiB	✅ Previous primary
huihui_ai/qwen3-abliterated:14b	14.8B	Q4_K_M	28	137	24K	11.4 GiB	✅ Abliterated
phi4:14b	14.7B	Q4_K_M	29	128	40K	11.8 GiB	🏆 Best 14B context
Qwen3-30B-A3B (Q2_K)	30.5B	Q2_K	61	—	16K	11.5 GiB	⚠️ MoE fast, heavy quant
qwen3.5-27b-iq2m	26.9B	IQ2_M	0	—	—	13.5 GiB	❌ Non-functional¹

All models run 100% on GPU after GTT tuning (16 GiB). Before the fix, gemma2:9b was only 91% GPU-offloaded (26 tok/s → 38 tok/s after fix).

¹ Why 27B dense fails: The dense architecture requires all 27B parameters in every forward pass. Without matrix cores (GFX1013 has none), each token requires ~27B multiplications through general-purpose shader cores. Result: 0 tokens generated in 5 minutes. The 35B MoE with only 3B active params per token avoids this entirely — compute is ~9× less per token despite having more total knowledge stored.

Prefill column: Measured at ~400 tokens prompt size (warm model, FP16 KV). Prefill rate depends on prompt length — see §4.5 for detailed sweep. Smaller models (3B) saturate the GPU compute and achieve higher prefill. Larger models (14B) are memory-bandwidth-limited at ~128–137 tok/s. MoE and 9B land between at ~230 tok/s — the MoE benefits from only loading 3B active expert weights per token during prefill. Qwen3-30B-A3B and qwen3.5-27b not measured (deprecated/non-functional).

March 14 — Qwen3.5 era: Ollama upgraded 0.16.1→0.18.0 (required for Qwen3.5). The qwen3.5-35b-a3b MoE (35B total, 3B active per token) at IQ2_M quantization is now the primary model on BC-250: 38 tok/s, 233 tok/s prefill, 16K context, multimodal (vision+tools+thinking). The qwen3.5:9b provides 65K context with vision when longer documents are needed. Both are Qwen3.5 architecture — a newer generation than Qwen3.

⚠️ IQ2_M quality tradeoff: The extreme quantization (~2.5 bits per parameter) is a significant quality compromise — perplexity increases and complex mathematical reasoning degrades compared to higher-precision quantizations. For everyday tasks (summarization, JSON extraction, tool use, chat) the quality is adequate. For tasks requiring precise reasoning, the qwen3.5:9b fallback (Q4_K_M, ~4.5 bits) provides substantially better accuracy. This is an informed tradeoff: more knowledge at lower precision vs less knowledge at higher precision.

4.2 Benchmark visualization

Generation speed (tok/s) — higher is better:

Model                    tok/s    Max Ctx   ██ = 10 tok/s
─────────────────────────────────────────────────────────
qwen2.5:3b               104      64K  ██████████▌
Qwen3-30B-A3B Q2_K        61      16K  ██████▏
qwen2.5:7b                56      64K  █████▌
qwen2.5-coder:7b          56      64K  █████▌
llama3.1:8b                52      48K  █████▏
seed-coder-abl:8b          52      64K  █████▏
lexi-8b (uncensored)      51      48K  █████
qwen3-abl:8b              46      64K  ████▌
qwen3:8b                  44      64K  ████▍
★ qwen3.5-35b-a3b MoE     38      16K  ███▊  ← PRIMARY (35B/3B)
gemma2:9b                 38      48K  ███▊
★ qwen3.5:9b               32      65K  ███▏  ← best ctx + vision
mistral-nemo:12b          34      24K  ███▍
phi4:14b                  29      40K  ██▉
qwen3-abl:14b             28      24K  ██▊
qwen3:14b                 27      24K  ██▋
qwen3.5-27b (dense)        0       —   ❌ non-functional

Context ceiling per model (FP16 KV, all GPU):

Model            16K  24K  32K  48K  64K
──────────────────────────────────────────
qwen2.5:3b        ✅   ✅   ✅   ✅   ✅
qwen2.5:7b        ✅   ✅   ✅   ✅   ✅
qwen2.5-coder:7b  ✅   ✅   ✅   ✅   ✅
qwen3:8b          ✅   ✅   ✅   ✅   ✅
qwen3-abl:8b      ✅   ✅   ✅   ✅   ✅
seed-coder:8b     ✅   ✅   ✅   ✅   ✅
★ qwen3.5:9b      ✅   ✅   ✅   ✅   ✅
llama3.1:8b       ✅   ✅   ✅   ✅   ❌
lexi-8b           ✅   ✅   ✅   ✅   ❌
gemma2:9b         ✅   ✅   ✅   ✅   —
mistral-nemo:12b  ✅   ✅   ❌   —    —
qwen3:14b         ✅   ✅   ❌   —    —
qwen3-abl:14b     ✅   ✅   ❌   —    —
phi4:14b          ✅   ✅   ✅   —    —
★ 35B-A3B iq2m    ✅   ❌   —    —    —
30B-A3B Q2_K      ✅   ❌   —    —    —
qwen3.5-27b iq2m  ❌   —    —    —    —

4K and 8K columns omitted — every model passes at those sizes.

✅ = works 100% GPU | ❌ = timeout/deadlock | — = not tested (too large)

Key insight: Speed is constant across context sizes with FP16 KV (speed only degrades when the context is actually filled — see §4.4). The context ceiling is purely a memory constraint: weights + KV cache + compute graph must fit in 16.5 GiB.

Graphical benchmarks:

Generation Speed	Prefill Speed

4.3 Context window experiments

The context window directly controls KV cache size, and on 16 GB unified memory, every megabyte counts. After v7 (OpenClaw removal freed ~700 MB, GTT bumped to 14 GB), we re-tested all context sizes systematically:

Context window vs memory (qwen3:14b Q4_K_M, flash attention, 16 GB GTT)

Context	RAM Used	Free	Swap	Speed	Status
8192	~9.5 GB	6.5 GB	—	~27 t/s	✅ Safe
12288	~10.3 GB	5.7 GB	—	~27 t/s	✅ Conservative
16384	~11.1 GB	4.9 GB	—	~27 t/s	✅ Comfortable
18432	~13.2 GB	2.7 GB	0.9 GB	26.8 t/s	✅ Works
20480	~13.7 GB	2.3 GB	0.9 GB	26.8 t/s	✅ Works
22528	~14.0 GB	2.0 GB	0.9 GB	26.7 t/s	✅ Works
24576	~14.4 GB	1.5 GB	0.9 GB	26.7 t/s	✅ Max for qwen3:14b
26624	~14.6 GB	1.3 GB	1.0 GB	23.9 t/s	⚠️ 10% slower
28672	~14.2 GB	—	1.7 GB	timeout	❌ Deadlocks
32768	~15.7 GB	0.2 GB	2.1 GB	timeout	❌ Deadlocks
40960	~16.0 GB	0	—	—	💀 TTM fragmentation¹

24K is the sweet spot — full speed (~27 tok/s), leaves ~1.5 GB for OS/services with stable swap at 0.9 GB. 26K works but inference drops 10% due to swap pressure. 28K+ deadlocks under Vulkan.

¹ Why 40K fails isn't raw OOM. The math: 9.3 GB weights + 2 GB KV cache + 1 GB OS ≈ 12.3 GB < 16 GB available. The actual failure is TTM fragmentation — the kernel's TTM memory manager can't allocate a contiguous block large enough for the KV cache because physical pages are fragmented across GPU and CPU consumers. This is a UMA-specific problem: on discrete GPUs with dedicated VRAM, fragmentation doesn't cross the PCIe boundary.

History: The original 24K experiment (Feb 25) deadlocked because OpenClaw gateway consumed ~700 MB. After v7 removed OpenClaw and bumped GTT to 14 GB (Mar 5), 24K became stable. Flash attention (OLLAMA_FLASH_ATTENTION=1) is essential — without it, 24K would not fit.

4.4 KV cache quantization — breaking the context ceiling

UPDATE: KV cache quantization WORKS on Vulkan. Our README previously stated it was a no-op — that was wrong. Tested on Ollama 0.16.1 + RADV Mesa 25.3.4:

KV Type	24K ctx	32K ctx	48K ctx	KV Cache Size @24K	Gen tok/s	Notes
FP16 (default)	✅	⚠️ 10% slow	❌ deadlock	~3.8 GiB	27.2	Current production
Q8_0	✅	✅	✅	2.0 GiB	27.3	Conservative upgrade
Q4_0	✅	✅	✅	1.1 GiB	27.3	← recommended

KV cache scaling (Q4_0): ~45 MiB per 1K tokens (16K=720M, 24K=1.1G, 40K=1.8G).

Extreme context tests (Q4_0): Ollama's scheduler auto-sizes KV to what fits in VRAM. With 14.5 GiB available, model weights 8.2 GiB, the maximum KV allocation is ~40K tokens (1.8 GiB). Requesting larger num_ctx is accepted but the runner silently caps and truncates prompts to the actual KV limit.

Generation speed degrades with context fill (Q4_0, all layers on GPU):

Tokens in context	Gen tok/s	Prefill tok/s	Notes
~100 (empty)	27.2	58	Headline number
3,300	24.6	113	Typical Signal chat
10,000	20.7	70	Long job output
30,000	13.4	53	Heavy document analysis
40,960 (max fill)	~10*	~42	Theoretical, near KV limit

* Estimated from degradation curve. One test at 41K showed 1.2 tok/s, but that was caused by model partial offload (21/41 layers spilled to CPU), not normal operation.

Q8_0 ceiling: Fits up to ~64K context on GPU. At 80K, KV cache spills to CPU (7 tok/s — unusable). Non-deterministic — depends on memory state at load time.

Not deploying to production. MoE model (primary) is capped at 16K context — KV quantization provides no benefit (bottleneck is weight size, not KV). Potentially useful for the 9B fallback model at 40K+ context, but not worth the quality risk.

# If ever needed for 9B model at extreme context:
# Environment=OLLAMA_KV_CACHE_TYPE=q4_0
# in /etc/systemd/system/ollama.service.d/override.conf

Current production: FP16 KV (Ollama default). Context capped at 16K for MoE via OLLAMA_CONTEXT_LENGTH=16384.

4.5 Prefill (prompt evaluation) benchmarks

On UMA, both prefill and generation share memory bandwidth (~51 GB/s DDR4-3200). Prefill is the time the model spends "reading" the prompt before generating the first token.

For embedded engineers: Think of LLM inference as two phases — like a bootloader and a main loop. Prefill is the "bootloader": the model processes the entire input prompt in one burst (parallel, compute-bound — like DMA-ing a firmware image into SRAM). Token generation is the "main loop": the model produces output tokens one at a time, sequentially (memory-bandwidth-bound — like polling a UART at a fixed baud rate). MoE (Mixture of Experts) is like having 35 specialized ISRs but only routing to 3 of them per interrupt — you get the routing intelligence of knowing all 35, but only pay the execution cost of 3. That's why a 35B-parameter MoE runs faster than a 14B dense model on hardware without matrix cores.

Prefill rate vs prompt size — production models (FP16 KV cache, warm):

qwen3.5-35b-a3b-iq2m (MoE 35B/3B active, UD-IQ2_M):

Prompt Size	Tokens	Prefill	Gen tok/s	TTFT (warm)
Tiny	17	53 tok/s	39.3	0.3s
Short	42	68 tok/s	39.6	0.6s
Medium	384	231 tok/s	38.5	1.7s
Long	1,179	228 tok/s	38.3	5.2s

qwen3.5:9b (Q4_K_M, dense 9.7B):

Prompt Size	Tokens	Prefill	Gen tok/s	TTFT (warm)
Tiny	17	61 tok/s	33.2	0.3s
Short	42	118 tok/s	33.0	0.4s
Medium	384	229 tok/s	33.0	1.7s
Long	1,179	225 tok/s	32.5	5.2s

Observations: Both production models converge to ~230 tok/s prefill at medium-to-long prompts — the DDR4 bandwidth ceiling. At tiny prompts (<50 tokens), GPU compute overhead dominates and prefill drops to 53–61 tok/s. Generation rate is stable: MoE holds 38–39 tok/s, 9B holds 32–33 tok/s regardless of prompt size. TTFT scales linearly: at 384 tokens it's ~1.7s, at 1.2K tokens it's ~5.2s. For real-world Signal chat (3K system prompt + conversation), expect TTFT of ~15–20s on cold start, <2s when the model is warm (prompt cached via OLLAMA_KEEP_ALIVE=30m).

Historical: qwen3:14b Q4_K_M (previous primary, 24K context)

Prompt Size	Tokens	Prefill	Gen tok/s	TTFT (warm)
Tiny	86	88 tok/s	27.2	~1s
Short	353	67 tok/s	27.2	~5s
Medium	1,351	128 tok/s	26.1	~11s
Long	3,354	113 tok/s	24.6	~30s
XL	6,686	88 tok/s	22.5	~76s
Massive	10,014	70 tok/s	20.7	~143s

Generation rate degrades with context: 27.2 tok/s @small → 20.7 tok/s @10K tokens.

Graphical: prefill rate and generation rate vs prompt size:

Model Landscape Bubble Chart — generation speed × prefill speed × max context (bubble size = context window, unique color per model):

4.6 Memory budget

qwen3.5-35b-a3b-iq2m · headless server (from Ollama logs)

Component	MoE @4K ctx	MoE @16K ctx	Notes
Model weights (GPU)	10.3 GiB	~8.2 GiB	41/41 layers on Vulkan0; spills to CPU at higher ctx
Model weights (CPU)	0.3 GiB	~0.4 GiB	Spilled layers + embeddings
KV cache (GPU)	1.6 GiB	~3.8 GiB	Grows ~0.4 GiB per 1K tokens
Compute graph	~0.2 GiB	~0.2 GiB	GPU-side
Ollama total	12.3 GiB	~12.5 GiB	Ollama dynamically spills weights to make room for KV
OS + services	~0.9 GiB	~0.9 GiB	Headless Fedora 43
Free (of 16.5 Vulkan)	~4.2 GiB	~4.0 GiB
NVMe swap	16 GiB		Safety net

MoE memory dynamics: As context grows, Ollama intelligently spills weight layers from GPU to CPU to maintain a ~12.5 GiB total. The MoE's total weight (11 GB GGUF) is larger than qwen3:14b (9.3 GB), but only 3B params activate per token — so CPU-spilled layers that aren't selected experts cause zero compute penalty. At 24K+ context, the KV cache exceeds what can fit alongside the weights, causing OOM or timeout.

4.7 Model recommendations

Qwen3.5 is the latest generation — multimodal (vision + tools + thinking), Apache 2.0.

Use Case	Recommended Model	tok/s	Max Ctx	Why
🏆 General AI / primary	qwen3.5-35b-a3b-iq2m	38	16K	35B knowledge, 3B active, fastest reasoning
🏆 Long context / vision	qwen3.5:9b	32	65K	Multimodal, stable context scaling, vision
Long context (14B)	phi4:14b	29	40K	Best 14B model for long context on this hardware
Fast batch jobs	qwen2.5:7b	56	64K	2× faster than 14B, 64K context
Code generation	qwen2.5-coder:7b	56	64K	Same speed as base, code-specialized
Speed-critical	qwen2.5:3b	104	64K	4× faster, use for simple tasks
Previous primary	qwen3:14b (abliterated)	28	24K	Replaced by Qwen3.5 models

Production dual-model config: qwen3.5-35b-a3b-iq2m as primary with OLLAMA_CONTEXT_LENGTH=16384. For tasks needing >16K context or vision (image analysis), switch to qwen3.5:9b which handles 65K context and can process images.

The MoE wins over the 9B dense model in generation speed (38 vs 32 tok/s) because only 3B parameters activate per token on hardware without matrix cores — fewer multiplications wins. Both models achieve similar prefill rates (~230 tok/s at ~400 tokens), but the 9B wins in context capacity (65K vs 16K) because its smaller total weight leaves more room for KV cache.

# Primary model (35B MoE) — custom GGUF via Modelfile
# See tmp/Modelfile-qwen35-35b-a3b for setup
ollama create qwen3.5-35b-a3b-iq2m -f Modelfile-qwen35-35b-a3b

# High-context model (vision+65K, official Ollama)
ollama pull qwen3.5:9b

# Context is capped via OLLAMA_CONTEXT_LENGTH=16384 in systemd (see §3.3, §3.4)
# Individual requests can override with {"options": {"num_ctx": 65536}} when using 9b

Why not a bigger MoE? Even though only 3B params activate per token, all 35B params must reside in memory — the router decides per-token which experts to fire, so every weight must be loaded. At IQ2_M (~2.5 bits per parameter), 35B = 11 GB GGUF. The next MoE up — Qwen3-235B-A22B — would be ~44 GB at IQ2_M (2.7× too large). Mixtral 8×22B (141B) would be ~35 GB. Going below IQ2_M (e.g. IQ1_S at ~1.5 bits) causes quality collapse. The qwen3.5-35b-a3b at IQ2_M is the largest MoE that fits 16 GB with usable quantization on this hardware.

`PART II` — AI Stack

5. Signal Chat Bot

The BC-250 runs a personal AI assistant accessible via Signal messenger — no gateway, no middleware. signal-cli runs as a standalone systemd service exposing a JSON-RPC API, and queue-runner handles all LLM interaction directly.

  Signal --> signal-cli (JSON-RPC :8080) --> queue-runner --> Ollama --> GPU (Vulkan)

Software: signal-cli v0.13.24 (native binary) · Ollama 0.18+ · queue-runner v7

5.1 Why not OpenClaw

OpenClaw was the original gateway (v2026.2.26, Node.js). It was replaced because:

Problem	Impact
~700 MB RSS	On a 16 GB system, that's 4.4% of RAM wasted on a routing layer
15+ second overhead per job	Agent turn setup, tool resolution, system prompt injection — for every cron job
Unreliable model routing	Fallback chains and timeout cascades caused 5-min "fetch failed" errors
No subprocess support	Couldn't run Python/bash scripts directly — had to shell out through the agent
9.6K system prompt	Couldn't be trimmed below ~4K tokens without breaking tool dispatch
Orphan processes	signal-cli children survived gateway OOM kills, holding port 8080

The replacement: queue-runner talks to signal-cli and Ollama directly via HTTP APIs. Zero middleware.

See Appendix A for the original OpenClaw configuration.

5.2 signal-cli service

signal-cli runs as a standalone systemd daemon with JSON-RPC:

# /etc/systemd/system/signal-cli.service
[Unit]
Description=signal-cli JSON-RPC daemon
After=network.target

[Service]
Type=simple
ExecStart=/opt/signal-cli/bin/signal-cli --output=json \
  -u +<BOT_PHONE> jsonRpc --socket http://127.0.0.1:8080
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Register a separate phone number for the bot via signal-cli register or signal-cli link.

5.3 Chat architecture

Between every queued job, queue-runner.py polls the signal-cli journal for incoming messages. Messages are routed based on content type:

queue-runner v7 — continuous loop

  job N  →  check Signal inbox  →  route message  →  job N+1
                    |                     |
                    v                     |
            journalctl -u          ┌──────┼──────┐
            signal-cli             │      │      │
                                audio  image   text
                                   │      │      │
                                   v      v      v
                              whisper  qwen3.5  choose_model()
                              -cli     :9b      MoE or 9B
                              (Vulkan) vision   ↓
                                   │      │   Ollama /api/chat
                                   │      │      │
                                   v      v      v
                              signal-cli: send reply

Key parameters:

Setting	Value	Purpose
`SIGNAL_CHAT_CTX`	16384	MoE model context window
`VISION_MODEL`	qwen3.5:9b	Vision analysis model (multimodal)
`VISION_CTX`	4096	Vision context (image tokens are large)
`ROUTING_TOKEN_THRESHOLD`	8000	Switch to 9B for long prompts
`SIGNAL_CHAT_MAX_EXEC`	3	Max shell commands per message
`SIGNAL_EXEC_TIMEOUT_S`	30	Per-command timeout
`SIGNAL_MAX_REPLY`	1800	Signal message character limit

5.4 Tool use — EXEC

The LLM can request shell commands via EXEC(command) in its response. queue-runner intercepts these, runs them, feeds stdout back into the conversation, and lets the LLM synthesize a final answer:

  User: "what's the disk usage?"
  LLM:  [thinking...] EXEC(df -h /)
  Runner: executes → feeds output back
  LLM:  "Root is 67% full, 48G free on your 128GB NVMe."

Supported patterns: web search (ddgr), file reads (cat, head), system diagnostics (journalctl, systemctl, df, free), data queries (jq on JSON files). Up to 3 commands per turn.

5.5 Image generation via chat

When the LLM detects an image request, it emits EXEC(/opt/stable-diffusion.cpp/generate-and-send "prompt"). queue-runner intercepts this pattern and handles it synchronously:

Stop Ollama (free GPU VRAM)
Run sd-cli with FLUX.2-klein-9B (4 steps, 512×512, ~105s)
Send image as Signal attachment
Restart Ollama

Bot is offline during generation (~2–3 minutes total including model reload).

Image editing (Kontext): Send a photo to Signal with an edit instruction ("make it cyberpunk", "add a hat"). The LLM emits EXEC(/opt/stable-diffusion.cpp/edit-image "instruction"), queue-runner runs FLUX.1-Kontext-dev with the photo as reference, and sends back the edited image (~5–10 min @512²). Input images are automatically resized to 512×512. See §6.2 for a demo (Sonic → Shadow the Hedgehog).

Video generation: Ask for a video/animation. Uses WAN 2.1 T2V 1.3B (~38 min for 17 frames @480×320).

ESRGAN upscale: Every generated image is automatically upscaled 4× with RealESRGAN_x4plus (512²→2048² in ~25s). Both versions sent via Signal — thumbnail + full-res. Also available on-demand via chat.

⚠️ GFX1013 bug: sd-cli hangs after writing the output image (Vulkan cleanup). queue-runner polls for the file and kills the process.

5.6 Personality — "Clawd"

The system prompt defines a cynical, darkly funny personality ("House MD meets a sysadmin lobster"). Key traits:

Direct, no corporate speak, no disclaimers
Dark humor about the hardware constraints
Full access to /opt/netscan/data/ for grounded answers
Knows AK's professional context (embedded Linux, camera drivers, V4L2/libcamera)
Uncensored creative generation (abliterated model, no safety filters)

The personality is baked into queue-runner.py's SYSTEM_PROMPT — no external workspace files needed.

5.7 Response times

Scenario	Latency
Text reply (warm)	10–30s
Complex reasoning with tool use	30–90s
Image generation (FLUX.2-klein-9B 512²)	~105s
Image generation + auto-upscale 4×	~130s
Image editing (Kontext 512²)	~5 min
Video generation (WAN 2.1 480×320)	~38 min
ESRGAN 4× upscale (on-demand)	~25s
Cold start (model reload)	30–60s
Voice note transcription (≤40s)	3–5s
Vision analysis (photo → description)	~40–80s

5.8 Vision analysis

Send a photo to Signal without an edit keyword (no "draw", "generate", "create") and the bot analyzes it using qwen3.5:9b's native multimodal vision. The 9B model processes base64-encoded images via Ollama's /api/chat endpoint.

  User: [photo of a circuit board] "what chip is this?"
  Router: image + non-edit text → vision analysis (9B)
  9B:    "That's an STM32F407 — the LQFP-100 package, 168 MHz Cortex-M4."

How edit vs. analysis is decided:

Input	Keywords detected	Action
Photo + "make it cyberpunk"	✓ edit	→ Kontext image editing (§5.5)
Photo + "what is this?"	✗	→ qwen3.5:9b vision analysis
Photo (no text)	✗	→ qwen3.5:9b vision analysis

Key detail: qwen3.5:9b requires "think": false in the API call. With thinking enabled, the model produces only hidden thinking tokens and returns an empty visible response. Discovered via 7 iterative tests (tests 1–6 all returned empty content).

The MoE model (qwen3.5-35b-a3b-iq2m) has no vision capability — it returns HTTP 500 when given images. This is why model routing is essential.

5.9 Audio transcription

Send a voice note to Signal and the bot transcribes it using whisper.cpp with Vulkan GPU acceleration:

  User: [voice note, 15 seconds, Polish]
  Router: audio/* → whisper-cli (auto language detection)
  Whisper: "Hej, sprawdź mi pogodę na jutro" (pl, 15.2s audio)
  Router: → feed transcription to LLM for response
  LLM:   "Jutro 18°C, częściowe zachmurzenie..."

Whisper setup on BC-250:

Component	Value
Runtime	whisper.cpp (Vulkan, built from source)
Model	ggml-large-v3-turbo (1.6 GB)
Binary	`/opt/whisper.cpp/build/bin/whisper-cli`
Threads	6 (all Zen 2 cores)
Language	Auto-detect (EN/PL confirmed)

Why large-v3-turbo, not large-v3?

Both models were benchmarked with real English TTS speech (flite) at three durations. The speed difference is modest (~2×), but memory is the dealbreaker — the larger model doesn't fit alongside Ollama in 16 GB.

Speed comparison:

Audio	large-v3-turbo	large-v3	Speedup
3.6s	3.3s	7.9s	2.4×
18.2s	3.5s	8.9s	2.6×
39.2s	4.3s	8.1s	1.9×

The memory problem:

The BC-250 has 16 GB total (UMA — shared between CPU and GPU). The Ollama MoE model takes 10.6 GB. OS and buffers need ~3.5 GB. That leaves the memory budget looking like this:

Scenario	Ollama	Whisper	OS/buffers	Total	Fits 16 GB?
Ollama only	10.6 GB	—	3.5 GB	14.1 GB	✅ 1.9 GB free
+ large-v3-turbo	10.6 GB	1.6 GB	3.5 GB	15.7 GB	✅ 0.3 GB free
+ large-v3	10.6 GB	2.9 GB	3.5 GB	17.0 GB	❌ 1.0 GB overflow → swap

When the total exceeds 16 GB, the kernel pushes pages to NVMe swap. This shows up as a measurable swap delta:

large-v3 pushes ~1 GB into swap on first load. large-v3-turbo causes zero swap. Once pages are evicted, subsequent large-v3 runs may show 0 swap delta (the 39s test) because those pages were already swapped out by earlier runs — but the damage (swap pressure, latency spikes) already happened.

Quality is comparable. Both models tested on a 39s embedded-systems passage (flite TTS). Both made the same synthesis artifacts ("kilobots" for "kilobytes", "Wipcomer" for "libcamera"). Neither is clearly better on robotic TTS.

Verdict: large-v3-turbo — 2× faster, 45% smaller, zero swap pressure. The quality tradeoff is negligible on BC-250's memory budget.

5.10 Smart model routing

queue-runner automatically selects the best model for each message based on content:

def choose_chat_model(user_text, has_image=False):
    if has_image:
        return "qwen3.5:9b", 4096       # only model with vision
    if estimate_tokens(user_text) > 8000:
        return "qwen3.5:9b", 16384      # 9B handles 65K context
    return "qwen3.5-35b-a3b-iq2m", 16384  # MoE — faster, smarter

Route	Model	Speed	When
Default	qwen3.5-35b-a3b MoE	37.7 tok/s	Normal chat (most messages)
Vision	qwen3.5:9b	31.8 tok/s	Photo attached (no edit keywords)
Long context	qwen3.5:9b	31.8 tok/s	Prompt > 8K tokens

The MoE activates only 3B of its 35B parameters per token, giving it faster generation than the dense 9B despite being a "larger" model. Both models are Qwen3.5-family and produce comparable text quality for short exchanges. The 9B is reserved for tasks that require vision or long context — capabilities the MoE lacks.

6. Image Generation

Stable Diffusion via stable-diffusion.cpp with native Vulkan backend.

▸ Build from source

sudo dnf install -y vulkan-headers vulkan-loader-devel glslc git cmake gcc g++ make
cd /opt && sudo git clone --recursive https://github.com/leejet/stable-diffusion.cpp.git
sudo chown -R $(whoami) /opt/stable-diffusion.cpp && cd stable-diffusion.cpp
mkdir -p build && cd build && cmake .. -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

6.1 Models

FLUX.2-klein-9B — recommended, best quality, Apache 2.0:

mkdir -p /opt/stable-diffusion.cpp/models/flux2 && cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (9B, Q4_0, 5.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF/resolve/main/flux-2-klein-9b-Q4_0.gguf"
# Qwen3-8B text encoder (Q4_K_M, 4.7 GB)
curl -L -o qwen3-8b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
# FLUX.2 VAE (321 MB) — different from FLUX.1 VAE!
curl -L -o flux2-vae.safetensors "https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors"

Memory: 5.3 GB VRAM (diffusion) + 6.2 GB VRAM (Qwen3-8B encoder) + 95 MB (VAE) = ~11.8 GB total. Stresses the 16.5 GB Vulkan pool properly. Best quality of all tested models.

FLUX.2-klein-4B — fast alternative, Apache 2.0:

cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (4B, Q4_0, 2.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF/resolve/main/flux-2-klein-4b-Q4_0.gguf"
# Qwen3-4B text encoder (Q4_K_M, 2.4 GB)
curl -L -o qwen3-4b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"
# Reuses same flux2-vae.safetensors from above

Memory: 2.3 GB VRAM (diffusion) + 3.6 GB VRAM (Qwen3-4B encoder) + 95 MB (VAE) = ~6 GB total. 7× faster than 9B but lower quality. Good for quick previews.

FLUX.1-schnell — previous default, Apache 2.0:

mkdir -p /opt/stable-diffusion.cpp/models/flux && cd /opt/stable-diffusion.cpp/models/flux
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/flux1-schnell-q4_k.gguf"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/ae.safetensors"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/clip_l.safetensors"
curl -L -O "https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/resolve/main/t5-v1_1-xxl-encoder-Q4_K_M.gguf"

Memory: 6.5 GB VRAM (diffusion) + 2.9 GB RAM (T5-XXL Q4_K_M) = ~10 GB total.

Chroma flash Q4_0 — alternative, open-source:

cd /opt/stable-diffusion.cpp/models/flux
curl -L -o chroma-unlocked-v47-flash-Q4_0.gguf "https://huggingface.co/leejet/Chroma-GGUF/resolve/main/chroma-unlocked-v47-flash-Q4_0.gguf"
# Reuses existing T5-XXL and FLUX.1 ae.safetensors from above

Memory: 5.1 GB VRAM (diffusion) + 3.2 GB RAM (T5-XXL) = ~8.4 GB total.

SD-Turbo — fast fallback, lower quality:

cd /opt/stable-diffusion.cpp/models
curl -L -o sd-turbo.safetensors \
  "https://huggingface.co/stabilityai/sd-turbo/resolve/main/sd_turbo.safetensors"

6.2 Performance

Benchmarked 2026-03-14, sd.cpp master-525-d6dd6d7, Vulkan GFX1013 (16.5 GiB), Ollama stopped.

Important: FLUX GGUF files must use --diffusion-model flag, not -m. The -m flag fails with "get sd version from file failed" because GGUF metadata is empty after tensor name conversion. This applies to all sd.cpp versions.

🏆 FLUX.2-klein-9B Q4_0 — new default (best quality):

Resolution	Steps	Time	s/step	Notes
512×512	4	104s	15.4	Default, ~11.8 GB VRAM total
768×768	4	129s	21.3	Best balance of quality vs time

FLUX.2-klein-9B uses a Qwen3-8B LLM as text encoder — richer prompt understanding and finer detail than the 4B variant. Stresses the 16.5 GB Vulkan pool properly (11.8 GB used). The --offload-to-cpu flag is essential (manages UMA allocation pools).

FLUX.2-klein-4B Q4_0 — fast alternative:

Resolution	Steps	Time	s/step	Notes
512×512	4	20s	3.95	Fast preview, ~6 GB VRAM total
512×512	8	26s	2.66	Better quality, GPU warm
768×768	4	30s	5.43	Great quality, no tiling
1024×1024	4	63s	10.18	VAE tiling required
1024×1024	4	❌ FAIL	—	Without `--vae-tiling` (VAE OOM)

7× faster than 9B but noticeably less detailed. Good for quick previews or batch generation.

FLUX.1-schnell Q4_K — previous default:

Resolution	Steps	Time	Notes
512×512	4	30s	~10 GB VRAM (6.5 diffusion + 3.4 encoders)
768×768	4	91s	VAE tiling kicks in
1024×1024	4	146s	VAE tiling, good quality
512×512	8	77s	More steps, marginal improvement

Chroma flash Q4_0 — quality alternative (reuses T5+VAE from FLUX.1):

Resolution	Steps	Time	Notes
512×512	4	85s	Sampling 46s + encoder 37s
512×512	8	130s	Sampling 96s
768×768	8	240s	Sampling 195s

Chroma uses cfg-based guidance (like FLUX.1-dev) but is fully open. Quality is better than schnell per step, but 4× slower than FLUX.2-klein.

FLUX.1-dev Q4_K_S — high-quality, slow (city96/FLUX.1-dev-gguf, 6.8 GB):

Resolution	Steps	Time	Notes
512×512	20	279s	Sampling 253s (12.65 s/step), ~6.6 GB VRAM
768×768	20	❌ FAIL	Guidance model compute graph exceeds VRAM

SD-Turbo — fast fallback:

Resolution	Steps	Time	Notes
512×512	1	11s	Minimum viable, ~2 GB VRAM
768×768	4	21s	Decent for quick previews

Head-to-head comparison (same prompt, same hardware, back-to-back):

Model	512² @4s	768² @4s	VRAM	Diffusion	Encoder
FLUX.2-klein-9B	104s	129s	11.8 GB	5.3 GB	Qwen3-8B (4.7 GB)
FLUX.2-klein-4B	20s	30s	6 GB	2.3 GB	Qwen3-4B (2.4 GB)
FLUX.1-schnell	30s	91s	10 GB	6.5 GB	CLIP+T5 (3.4 GB)
Chroma flash	85s	240s⁸	8.4 GB	5.1 GB	T5 (3.2 GB)
FLUX.1-dev	279s²⁰	❌	10 GB	6.8 GB	CLIP+T5 (3.4 GB)
SD-Turbo	11s¹	21s	2 GB	2 GB	(built-in)

FLUX.2-klein-9B is the quality winner — more detail, better text understanding, and it actually stresses the 16.5 GB GPU properly (11.8 GB used vs 6 GB for 4B). The 4B version is 7× faster but leaves 10 GB unused.

🔬 Quality shootout — same prompt, same seed (42), 512×512 @4 steps:

All models tested back-to-back on the same prompt: "a cyberpunk cityscape at sunset with neon lights reflecting on wet streets, highly detailed"

Model	Time	s/step	VRAM	File Size	Quality
FLUX.2-klein-9B	104s	15.4	11.8 GB	709 KB	★★★★ — finest detail, best reflections
FLUX.2-klein-4B	15s	2.7	6.0 GB	704 KB	★★★ — good but less detail
FLUX.1-schnell	31s	6.5	10.1 GB	609 KB	★★ — decent, less coherent
Chroma flash (8 steps)	120s	14.1	8.4 GB	204 KB	★★ — artistic but softer

Example outputs (same prompt, same seed 42, 512×512):

FLUX.2-klein-9B (★★★★)	FLUX.2-klein-4B (★★★)

104s, 11.8 GB VRAM	15s, 6.0 GB VRAM

FLUX.1-schnell (★★)	Chroma flash (★★)

31s, 10.1 GB VRAM	120s, 8.4 GB VRAM

The 9B model produces visibly more detail in fine structures (neon reflections, wet streets, building facades). The 4B is the speed champion but sacrifices detail. Chroma has a distinctive artistic style but outputs smaller, softer images. FLUX.1-schnell sits in the middle.

Summary: recommended settings for production

Use case	Model	Resolution	Steps	Time
Default (Signal)	FLUX.2-klein-9B	512×512	4	~105s
High quality	FLUX.2-klein-9B	768×768	4	~130s
Quick preview	FLUX.2-klein-4B	512×512	4	~20s
Poster/wallpaper	FLUX.2-klein-4B	1024×1024	4	~63s
Best quality (slow)	Chroma flash	512×512	8	~130s

# FLUX.2-klein-9B — recommended production command:
/opt/stable-diffusion.cpp/build/bin/sd-cli \
  --diffusion-model models/flux2/flux-2-klein-9b-Q4_0.gguf \
  --vae models/flux2/flux2-vae.safetensors \
  --llm models/flux2/qwen3-8b-Q4_K_M.gguf \
  -p "your prompt here" \
  --cfg-scale 1.0 --steps 4 -H 512 -W 512 \
  --offload-to-cpu --diffusion-fa -v \
  -o output.png

6.2.1 Upgrade roadmap — beyond the current stack

sd.cpp (master-525+) supports more models. The BC-250 has ~16.5 GB with Ollama stopped (post-GTT migration). All models use --offload-to-cpu (UMA — no PCIe penalty).

Image generation — tested models:

Model	Params	GGUF Size	Total RAM¹	Steps	Quality	Status
FLUX.2-klein-9B Q4_0	9B	5.3 GB	~11.8 GB	4	★★★★	✅ Current default, 104s @512²
FLUX.2-klein-4B Q4_0	4B	2.3 GB	~6 GB	4	★★★	✅ Fast alternative, 20s @512²
FLUX.1-schnell Q4_K	12B	6.5 GB	~10 GB	4	★★	✅ Previous default, 30s @512²
Chroma flash Q4_0	12B	5.1 GB	~8.4 GB	4–8	★★★	✅ Tested — 85s @512², better quality
FLUX.1-dev Q4_K_S	12B	6.8 GB	~10 GB	20	★★★★	✅ Tested — 279s @512², ❌768²+
SD-Turbo	1.1B	~2 GB	~2.5 GB	1–4	★	✅ Fast preview, 11s @512²
SD3.5-medium Q4_0	2.5B	1.7 GB	~6 GB	28	★★★	✅ Tested — 49s @512², needs clip_g+clip_l+T5+F16 VAE³

¹ Total RAM includes diffusion model + text encoder(s) + VAE.

³ BF16 VAE gotcha — see SD3.5 section below.

Video generation — tested models:

Model	Params	GGUF Size	Total RAM¹	Frames	Time	Status
WAN 2.1 T2V 1.3B Q4_0	1.3B	826 MB	~5 GB	17 @480×320	~38 min	✅ Works on BC-250

WAN requires umt5-xxl text encoder (3.5 GB Q4_K_M) + WAN VAE (243 MB). Outputs raw AVI (MJPEG). No matrix cores = slow but works.

Video generation — tested (OOM):

Model	Params	GGUF Size	Total RAM¹	Notes
WAN 2.2 TI2V 5B Q4_0	5B	2.9 GB	~9 GB	❌ OOM crash at Q4_0. Model (2.9G) + VAE (1.4G) + T5 (4.7G) = 9 GB — exceeds UMA budget during video denoising. May work with Q2_K model + Q2_K T5 (~6 GB) but untested.

Image editing — FLUX.1-Kontext-dev:

Model	Params	GGUF Size	Total RAM¹	Status
FLUX.1-Kontext-dev Q4_0	12B	6.8 GB	~10 GB	✅ Tested — 316s @512² (no swap). 1024² causes swap pressure (40+ min). Uses `-r` flag, reuses FLUX.1 T5/CLIP/VAE

Kontext is a dedicated image editing model by Black Forest Labs. It takes a reference image via -r and a text instruction to produce an edited version. Uses existing FLUX.1 encoders (T5-XXL, CLIP_L) and VAE (ae.safetensors) from /opt/stable-diffusion.cpp/models/flux/.
# Edit an existing image with Kontext:
sd-cli --diffusion-model models/flux/flux1-kontext-dev-Q4_0.gguf \
  --vae models/flux/ae.safetensors --clip_l models/flux/clip_l.safetensors \
  --t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf --clip-on-cpu \
  -r input.png -p "change the sky to sunset" --cfg-scale 3.5 --steps 28 \
  --sampling-method euler --offload-to-cpu --diffusion-fa -o output.png

Kontext demo — "turn Sonic into Shadow the Hedgehog":

Input (1200×1600 → resized to 512×512)	Output (512×512, 647s)	Output + ESRGAN 4× (2048×2048, +25s)

The 4× upscaled version (right) is generated automatically by the ESRGAN auto-upscale pipeline — every generated/edited image gets a 2048×2048 version sent alongside the 512×512 original. Total overhead: ~25s with tile 192. See ESRGAN benchmarks below.

SD3.5-medium benchmark details

Timing breakdown (512×512, 28 steps, seed 42):

Phase	Time	Notes
CLIP + T5 encoding	3.5s	clip_l + clip_g + t5-v1_1-xxl Q4_K_M
Diffusion sampling	43s	28 steps × 1.5s/it (mmdit 2.1 GB on Vulkan)
VAE decode	2.3s	F16-converted VAE (94.6 MB)
Total	49s

Model stack on disk:

Component	File	Size
Diffusion	sd3.5_medium-q4_0.gguf	1.7 GB
CLIP-L	clip_l.safetensors (shared with FLUX)	246 MB
CLIP-G	clip_g.safetensors	1.3 GB
T5-XXL	t5-v1_1-xxl-encoder-Q4_K_M.gguf (shared with FLUX)	2.9 GB
VAE	sd3_vae_f16.safetensors (converted from BF16)	160 MB
Total on disk		~6.3 GB

# SD3.5-medium generation command:
sd-cli --diffusion-model models/sd3/sd3.5_medium-q4_0.gguf \
  --vae models/sd3/sd3_vae_f16.safetensors \
  --clip_l models/flux/clip_l.safetensors \
  --clip_g models/sd3/clip_g.safetensors \
  --t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf \
  -p "prompt" --cfg-scale 4.5 --sampling-method euler --steps 28 \
  -W 512 -H 512 --diffusion-fa --offload-to-cpu -o output.png

⚠ BF16 VAE gotcha: The upstream SD3 VAE (diffusion_pytorch_model.safetensors) uses BF16 tensors. GFX1013 Vulkan has no BF16 support — the output is a solid blue/yellow rectangle. Fix: convert to F16 with python3 convert_vae_bf16_to_f16.py input.safetensors output.safetensors (script in /tmp/).

WAN 2.1 T2V 1.3B benchmark details

Timing breakdown (480×320, 17 frames, 50 steps, seed 42):

Phase	Time	Notes
umt5-xxl encoding	~4s	3.5 GB Q4_K_M text encoder
Diffusion sampling	~35 min	17 frames × 50 steps. No matrix cores → pure scalar Vulkan
VAE decode	~30s	WAN VAE (243 MB), decodes all 17 frames
Total	~38 min

Model stack on disk:

Component	File	Size
Diffusion	Wan2.1-T2V-1.3B-Q4_0.gguf	826 MB
Text encoder	umt5-xxl-encoder-Q4_K_M.gguf	3.5 GB
VAE	wan_2.1_vae.safetensors	243 MB
Total on disk		~4.5 GB

# WAN 2.1 text-to-video generation:
sd-cli -M vid_gen \
  --diffusion-model models/wan/Wan2.1-T2V-1.3B-Q4_0.gguf \
  --vae models/wan/wan_2.1_vae.safetensors \
  --t5xxl models/wan/umt5-xxl-encoder-Q4_K_M.gguf \
  -p "A cat walking across a sunny garden" \
  --cfg-scale 6.0 --sampling-method euler \
  -W 480 -H 320 --diffusion-fa --offload-to-cpu \
  --video-frames 17 --flow-shift 3.0 -o output.mp4

Output format: sd.cpp produces raw AVI (MJPEG) regardless of the -o extension. The 17-frame clip plays at 16 fps (~1 second). Quality is recognizable but noisy — expected at Q4_0 with scalar-only Vulkan compute.

Why so slow? Each video frame is a full diffusion pass through the 1.3B model. With 17 frames × 50 steps × no matrix cores, every multiply is scalar. A GPU with tensor/matrix units (RDNA3+, Turing+) would be 5–10× faster.

WAN 2.1 demo — "A cat walking across a sunny garden":

17 frames @480×320, 50 steps, Q4_0 quantization, EUR scheduler, cfg-scale 6.0. Generated in ~38 minutes on GFX1013 scalar Vulkan — no matrix/tensor cores. The BC-250 rendered every frame through pure ALU compute. Noisy but recognizable — a real video from a 1.3B parameter model on a secondhand BC-250.

ESRGAN 4× upscale benchmarks

All generated images are automatically upscaled with RealESRGAN_x4plus (64 MB model, 4× scaling). Runs immediately after generation while Ollama is still stopped — zero extra GPU-swap cost.

ESRGAN tile size benchmark (512² input → 2048² output):

Tile Size	Time	Output	Notes
128 (default)	15s	2048×2048, 5.1 MB	Fastest, visible seams possible
192 (production)	25s	2048×2048, 5.1 MB	Best quality/speed tradeoff
256	41s	2048×2048, 5.1 MB	Smoothest seams, 2.7× slower
128 ×2 passes (16×!)	4m 50s	8192×8192, 67 MB	512²→8192² in under 5 min

Production uses tile 192: larger tiles mean fewer seam boundaries → cleaner upscale. The 16× mode (two ESRGAN passes) produces 67-megapixel images from 512² input — available on-demand via EXEC(upscale ...) but not automatic (too large for Signal).

Image/video pipeline timing

End-to-end timing for all generation modes on BC-250:

Phase breakdown — where the time goes in each pipeline:

FLUX.1-schnell resolution scaling — time vs pixel count:

`PART III` — Monitoring & Intelligence

7. Netscan Ecosystem

A research, monitoring, and data collection system with 330 autonomous jobs running on a GPU-constrained single-board computer. Dashboard at http://<LAN_IP>:8888 — 29 main pages + 101 per-host detail pages.

7.1 Architecture — queue-runner v7

The BC-250 has 16 GB GTT shared with the CPU — only one LLM job can run at a time. queue-runner.py (systemd service) orchestrates all 330 jobs in a continuous loop, with Signal chat between every job:

queue-runner v7 -- Continuous Loop + Signal Chat

Cycle N:
  330 jobs sequential, ordered by category:
  scrape -> infra -> lore -> academic -> repo -> company -> career
         -> think -> csi -> meta -> market -> report
  HA observations interleaved every 50 jobs
  Signal inbox checked between EVERY job
  Chat processed with LLM (EXEC tool use + image gen)
  Crash recovery: resumes from last completed job

Cycle N+1:
  Immediately starts -- no pause, no idle windows
  No nightly/daytime distinction

Key design decisions (v5 → v7):

v5 (OpenClaw era)	v7 (current)
Nightly batch + daytime fill	Continuous loop, no distinction
354 jobs (including duplicates)	330 jobs (deduped, expanded)
LLM jobs routed through `openclaw cron run`	All jobs run as direct subprocesses
Signal via OpenClaw gateway (~700 MB)	signal-cli standalone (~100 MB)
Chat only when gateway available	Chat between every job
Async SD pipeline (worker scripts, 45s delay)	Synchronous SD (stop Ollama → generate → restart)
GPU idle detection for user chat preemption	No preemption needed — chat is interleaved

All jobs run as direct subprocesses — subprocess.Popen for Python/bash scripts, no LLM agent routing. This is 3–10× faster than the old openclaw cron run path and eliminates the gateway dependency entirely.

7.1.1 Queue ordering

The queue prioritizes data diversity — all dashboard tabs get fresh data even if the cycle is interrupted. See §7.3 for the full category breakdown with GPU times. HA observations are interleaved every 50 jobs, and Signal chat is checked between every job.

7.1.2 GPU idle detection

GPU idle detection is used for legacy --daytime mode and Ollama health checks:

# Three-tier detection:
# 1. Ollama /api/ps → no models loaded → definitely idle
# 2. sysfs pp_dpm_sclk → clock < 1200 MHz → model loaded but not computing
# 3. Ollama expires_at → model about to unload → idle for 3+ min

In continuous loop mode (default), GPU detection is only used for pre-flight health checks — not for yielding to user chat, since chat is interleaved between jobs.

7.2 Scripts

GPU jobs (queue-runner — sequential, one at a time):

Script	Purpose	Jobs
`career-scan.py`	Two-phase career scanner (§8)	1
`career-think.py`	Per-company career deep analysis	65
`salary-tracker.py`	Salary intel — NoFluffJobs, career-scan extraction	1
`company-intel.py`	Deep company intel — GoWork, DDG news, layoffs (43 entities)	1
`company-think-*`	Focused company deep-dives	106
`patent-watch.py`	IR/RGB camera patent monitor — Google Patents, EPO OPS, DuckDuckGo	1
`event-scout.py`	Meetup/conference tracker — Poland, Europe	1
`leak-monitor.py`	CTI: 11 OSINT sources — HIBP, Hudson Rock, GitHub dorks, Ahmia dark web, CISA KEV, ransomware, Telegram	1
`idle-think.sh`	Research brain — 8 task types → JSON notes	34
`ha-journal.py`	Home Assistant analysis (climate, sensors, anomalies)	2
`ha-correlate.py`	HA cross-sensor correlation	2
`city-watch.py`	SkyscraperCity local construction tracker	1
`csi-sensor-watch.py`	CSI camera sensor patent/news monitor	1
`csi-think.py`	CSI camera domain analysis (drivers, ISP, GMSL)	6
`lore-digest.sh`	Kernel mailing list digests (8 feeds)	8
`repo-watch.sh`	Upstream repos (GStreamer, libcamera, v4l-utils, FFmpeg, LinuxTV)	8
`repo-think.py`	LLM analysis of repo changes	26
`market-think.py`	Market sector analysis + synthesis	19
`life-think.py`	Cross-domain life advisor	2
`system-think.py`	GPU/security/health system intelligence	3
`radio-scan.py`	Radio hobbyist forum tracker	1
`career-digest.py`	Weekly career digest → Signal (Sunday)	1
`daily-summary.py`	End-of-cycle summary → dashboard + Signal	2
`academic-watch.py`	Academic publication monitor (4 topics × 3 types)	12
`book-watch.py`	Book/publication tracker (11 subjects)	11
`news-watch.py`	Tech news aggregation + RSS	2
`weather-watch.py`	Weather forecast + HA sensor correlation	2
`car-tracker.py`	GPS car tracker (SinoTrack API)	1
`frost-guard.py`	Frost/freeze risk alerter	1

CPU jobs (system crontab — independent of queue-runner):

Script	Frequency	Purpose
`gpu-monitor.sh` + `.py`	1 min	GPU utilization sampling (3-state)
`presence.sh`	5 min	Phone presence tracker
`syslog.sh`	5 min	System health logger
`watchdog.py`	30 min (live), 06:00 (full)	Network security — ARP, DNS, TLS, vulnerability scoring
`scan.sh` + `enumerate.sh`	04:00	Network scan + enumeration (nmap)
`vulnscan.sh`	Weekly (Sun)	Vulnerability scan
`repo-watch.sh`	08:00, 14:00, 18:00	Upstream repo data collection
`report.sh`	08:30	Morning report rebuild
`generate-html.py`	After each queue-runner job	Dashboard HTML builder (6900+ lines)
`gpu-monitor.py chart`	22:55	Daily GPU utilization chart

7.3 Job scheduling — queue-runner v7

Job categories (auto-classified by name pattern):

Category	Jobs	Typical GPU time	Examples
`scrape`	29	0.1h	career-scan, salary, patents, book-watch, repo-scan (no LLM)
`infra`	6	0.6h	leak-monitor, netscan, watchdog, frost-guard, radio-scan
`lore`	8	0.5h	lore-digest per mailing list feed
`academic`	12	—	academic-watch per topic × type
`repo`	27	0.3h	LLM analysis of repo changes + weekly digest
`company`	107	0.9h	company-intel + competitive/financial/strategy deep-dives
`career`	66	1.9h	career-think per company + weekly digest
`think`	34	2.0h	research, trends, crawl, crossfeed
`csi`	6	0.3h	CSI camera domain analysis
`meta`	5	—	life-think, system-think
`market`	19	0.9h	market-think per asset + synthesis
`ha`	4	1.0h	ha-correlate, ha-journal (interleaved)
`report`	4	—	daily-summary, news + weather analysis
`weekly`	3	—	vulnscan, csi-sensor-discover/improve
Total	330	~9h

Data flow:

jobs.json (330 jobs)
  |
  v
queue-runner.py
  |
  |-- All jobs -> subprocess.Popen -> python3/bash /opt/netscan/...
  |                                         |
  |       JSON results <--------------------+
  |         |
  |         |-- /opt/netscan/data/{category}/*.json
  |         |
  |         +-- generate-html.py -> /opt/netscan/web/*.html -> nginx :8888
  |
  |-- Signal chat (between every job)
  |     via JSON-RPC http://127.0.0.1:8080/api/v1/rpc
  |
  +-- Signal alerts (career matches, leaks, events, daily summary)

7.4 Data flow & locations

All paths relative to /opt/netscan/:

Data	Path	Source
Research notes	`data/think/note-*.json` + `notes-index.json`	idle-think.sh
Career scans	`data/career/scan-*.json` + `latest-scan.json`	career-scan.py
Career analysis	`data/career/think-*.json`	career-think.py
Salary	`data/salary/salary-*.json` (180-day history)	salary-tracker.py
Company intel	`data/intel/intel-*.json` + `company-intel-deep.json`	company-intel.py
Patents	`data/patents/patents-*.json` + `patent-db.json`	patent-watch.py
Events	`data/events/events-*.json` + `event-db.json`	event-scout.py
Leaks / CTI	`data/leaks/leak-intel.json`	leak-monitor.py
City watch	`data/city/city-watch-*.json`	city-watch.py
CSI sensors	`data/csi-sensors/csi-sensor-*.json`	csi-sensor-watch.py
HA correlations	`data/correlate/correlate-*.json`	ha-correlate.py
HA journal	`data/ha-journal-*.json`	ha-journal.py
Mailing lists	`data/{lkml,soc,jetson,libcamera,dri,usb,riscv,dt}/`	lore-digest.sh
Repos	`data/repos/`	repo-watch.sh, repo-think.py
Market	`data/market/`	market-think.py
Academic	`data/academic/`	academic-watch (LLM)
GPU load	`data/gpu-load.tsv`	gpu-monitor.sh
System health	`data/syslog/health-*.tsv` (30-day retention)	syslog.sh
Network hosts	`data/hosts-db.json`	scan.sh
Presence	`data/presence-state.json`	presence.sh
Radio	`data/radio/`	radio-scan.py
Queue state	`data/queue-runner-state.json`	queue-runner.py

7.5 Dashboard — 29 main pages + 101 host detail pages

Served by nginx at :8888, generated by generate-html.py (6900+ lines):

Page	Content	Data source
`index.html`	Overview — hosts, presence, latest notes, status	aggregated
`home.html`	Home Assistant — climate, energy, anomalies	ha-journal, ha-correlate
`career.html`	Career intelligence — matches, trends	career-scan, career-think
`market.html`	Market analysis — sectors, commodities, crypto	market-think
`advisor.html`	Life advisor — cross-domain synthesis	life-think
`notes.html`	Research brain — all think notes	idle-think
`leaks.html`	CTI / leak monitor	leak-monitor
`issues.html`	Upstream issue tracking	repo-think
`events.html`	Events calendar — Poland, Europe	event-scout
`lkml.html`	Linux Media mailing list digest	lore-digest (linux-media)
`soc.html`	SoC bringup mailing list	lore-digest (soc-bringup)
`jetson.html`	Jetson/Tegra mailing list	lore-digest (jetson-tegra)
`libcamera.html`	libcamera mailing list	lore-digest (libcamera)
`dri.html`	DRI-devel mailing list	lore-digest (dri-devel)
`usb.html`	Linux USB mailing list	lore-digest (linux-usb)
`riscv.html`	Linux RISC-V mailing list	lore-digest (linux-riscv)
`dt.html`	Devicetree mailing list	lore-digest (devicetree)
`academic.html`	Academic publications	academic-watch
`hosts.html`	Network device inventory	scan.sh
`security.html`	Host security scoring	vulnscan.sh
`presence.html`	Phone detection timeline	presence.sh
`load.html`	GPU utilization heatmap + schedule	gpu-monitor
`radio.html`	Radio hobbyist activity	radio-scan.py
`car.html`	Car tracker	car-tracker
`weather.html`	Weather forecast + HA sensor correlation	weather-watch.py
`news.html`	Tech news aggregation + RSS	news-watch.py
`health.html`	System health assessment (services, data freshness, LLM quality)	bc250-extended-health.py
`history.html`	Changelog	—
`log.html`	Raw scan logs	—
`host/*.html`	Per-host detail pages (101 hosts)	scan.sh, enumerate.sh

Mailing list feeds are configured in digest-feeds.json — 8 feeds from lore.kernel.org, each with relevance scoring keywords.

7.6 GPU monitoring — 3-state

Per-minute sampling via pp_dpm_sclk:

State	Clock	Temp	Meaning
`generating`	2000 MHz	~77°C	Active LLM inference
`loaded`	1000 MHz	~56°C	Model in VRAM, idle
`idle`	1000 MHz	<50°C	No model loaded

7.7 Configuration & state files

File	Purpose
`profile.json`	Public interests — tracked repos, keywords, technologies
`profile-private.json`	Career context — target companies, salary expectations (gitignored)
`watchlist.json`	Auto-evolving interest tracker
`digest-feeds.json`	Mailing list feed URLs (8 feeds from lore.kernel.org)
`repo-feeds.json`	Repository API endpoints
`sensor-watchlist.json`	CSI camera sensor tracking list
`queue-runner-state.json`	Cycle count, resume index (in data/)
`/opt/netscan/data/jobs.json`	All 330 job definitions

7.8 Resilience

Mechanism	Details
Systemd watchdog	`WatchdogSec=14400` (4h) — queue-runner pings every 30s during job execution
Crash recovery	State file records nightly batch progress; on restart, resumes from last completed job
Midnight crossing	Resume index valid for both today and yesterday's date (batch starts 23:00 day N, may crash after midnight day N+1)
Atomic state writes	Write to `.tmp` file, `fsync()`, then `rename()` — survives SIGABRT/power loss
Ollama health checks	Pre-flight check before each job; exponential backoff wait if unhealthy
Network down	Detects network loss, waits with backoff up to 10min
GPU deadlock protection	If GPU busy for > 60min continuously, breaks and moves on
OOM protection	Ollama `OOMScoreAdjust=-1000`, 16 GB NVMe swap, zram limited to 2 GB
Signal delivery	`--best-effort-deliver` flag — delivery failures don't mark job as failed

8. Career Intelligence

Automated career opportunity scanner with a two-phase anti-hallucination architecture.

8.1 Two-phase design

  HTML page
    +-> Phase 1: extract jobs (NO candidate profile) -> raw job list
                                                            |
  Candidate Profile + single job ---------------------------+
    +-> Phase 2: score match -> repeat per job
                                   +-> aggregate -> JSON + Signal alerts

Phase 1 extracts jobs from raw HTML without seeing the candidate profile — prevents the LLM from inventing matching jobs. Phase 2 scores each job individually against the profile.

8.2 Alert thresholds

Category	Score	Alert?
⚡ Hot match	≥70%	✅ (up to 5/scan)
🌍 Worth checking	55–69% + remote	✅ (up to 2/scan)
Good / Weak	<55%	Dashboard only

Software houses (SII, GlobalLogic, Sysgo…) appear on the dashboard but never trigger alerts.

8.3 Salary tracker · `salary-tracker.py`

Nightly at 01:30. Sources: career-scan extraction, NoFluffJobs API, JustJoinIT, Bulldogjob. Tracks embedded Linux / camera driver compensation in Poland. 180-day rolling history.

8.4 Company intelligence · `company-intel.py`

Nightly at 01:50. Deep-dives into 43 tracked companies across 8 sources: GoWork.pl reviews, DuckDuckGo news, Layoffs.fyi, company pages, 4programmers.net, Reddit, SemiWiki, Hacker News. LLM-scored sentiment (-5 to +5) with cross-company synthesis.

GoWork.pl: New Next.js SPA breaks scrapers. Scanner uses the old /opinie_czytaj,{entity_id} URLs (still server-rendered).

8.5 Patent watch · `patent-watch.py`

Nightly at 02:10. Monitors 6 search queries (MIPI CSI, IR/RGB dual camera, ISP pipeline, automotive ADAS, sensor fusion, V4L2/libcamera) across Google Patents, EPO OPS, and DuckDuckGo. Scored by relevance keywords × watched assignee bonus.

8.6 Event scout · `event-scout.py`

Nightly at 02:30. Discovers tech events with geographic scoring (local 10, nearby 8, Poland 5, Europe 3, Online 9). Sources: Crossweb.pl, Konfeo, Meetup, Eventbrite, DDG, 14 known conference sites.

`PART IV` — Reference

9. Repository Structure

▸ Full tree

bc250/
├── README.md                       ← you are here
├── netscan/                        → /opt/netscan/
│   ├── queue-runner.py             # v7 — continuous loop + Signal chat (330 jobs)
│   ├── career-scan.py              # Two-phase career scanner
│   ├── career-think.py             # Per-company career analysis
│   ├── salary-tracker.py           # Salary intelligence
│   ├── company-intel.py            # Company deep-dive
│   ├── company-think.py            # Per-entity company analysis
│   ├── patent-watch.py             # Patent monitor
│   ├── event-scout.py              # Event tracker
│   ├── city-watch.py               # SkyscraperCity local construction monitor
│   ├── leak-monitor.py             # CTI: 11 OSINT sources + Ahmia dark web
│   ├── ha-journal.py               # Home Assistant journal
│   ├── ha-correlate.py             # HA cross-sensor correlation
│   ├── ha-observe.py               # Quick HA queries
│   ├── csi-sensor-watch.py         # CSI camera sensor patent/news
│   ├── csi-think.py                # CSI camera domain analysis
│   ├── radio-scan.py               # Radio hobbyist forum tracker
│   ├── market-think.py             # Market sector analysis
│   ├── life-think.py               # Cross-domain life advisor
│   ├── system-think.py             # GPU/security/health system intelligence
│   ├── career-digest.py            # Weekly career digest → Signal (Sunday)
│   ├── daily-summary.py            # End-of-cycle Signal summary
│   ├── frost-guard.py              # Frost/freeze risk alerter
│   ├── repo-think.py               # LLM analysis of repo changes
│   ├── academic-watch.py           # Academic publication monitor
│   ├── news-watch.py               # Tech news aggregation + RSS feeds
│   ├── book-watch.py               # Book/publication tracker
│   ├── weather-watch.py            # Weather forecast + HA sensor correlation
│   ├── car-tracker.py              # GPS car tracker (SinoTrack API, trip/stop detection)
│   ├── bc250-extended-health.py    # System health assessment (services, data freshness, LLM quality)
│   ├── llm_sanitize.py             # LLM output sanitizer (thinking tags, JSON repair)
│   ├── generate-html.py            # Dashboard builder (6900+ lines, 29 main + 101 host pages)
│   ├── gpu-monitor.py              # GPU data collector
│   ├── idle-think.sh               # Research brain (8 task types)
│   ├── repo-watch.sh               # Upstream repo monitor
│   ├── lore-digest.sh              # Mailing list digests (8 feeds)
│   ├── bc250-health-check.sh       # Quick health check (systemd timer, triggers extended health)
│   ├── gpu-monitor.sh              # Per-minute GPU sampler
│   ├── scan.sh / enumerate.sh      # Network scanning
│   ├── vulnscan.sh                 # Weekly vulnerability scan
│   ├── presence.sh                 # Phone presence detection
│   ├── syslog.sh                   # System health logger
│   ├── watchdog.py                 # Network security checker
│   ├── report.sh                   # Morning report rebuild
│   ├── profile.json                # Public interests + Signal config
│   ├── profile-private.json        # Career context (gitignored)
│   ├── watchlist.json              # Auto-evolving interest tracker
│   ├── digest-feeds.json           # Feed URLs (8 mailing lists)
│   ├── repo-feeds.json             # Repository endpoints
│   └── sensor-watchlist.json       # CSI sensor tracking list
├── systemd/
│   ├── queue-runner.service        # v7 — continuous loop + Signal chat
│   ├── queue-runner-nightly.service # Nightly batch trigger
│   ├── queue-runner-nightly.timer
│   ├── signal-cli.service          # Standalone JSON-RPC daemon
│   ├── bc250-health.service        # Health check timer
│   ├── bc250-health.timer
│   ├── ollama.service
│   ├── ollama-watchdog.service     # Ollama restart watchdog
│   ├── ollama-watchdog.timer
│   ├── ollama-proxy.service        # LAN proxy for Ollama API
│   └── ollama.service.d/
│       └── override.conf           # Vulkan + memory settings
├── scripts/
│   └── ollama-proxy.py             # Reverse proxy (injects think:false for qwen3)
├── generate-and-send.sh            → /opt/stable-diffusion.cpp/ (legacy EXEC pattern, intercepted by queue-runner)
└── generate-and-send-worker.sh     → legacy async worker (unused in v7, kept for EXEC pattern match)

Deployment

Local	→ bc250
`netscan/*`	`/opt/netscan/`
`systemd/queue-runner.service`	`/etc/systemd/system/queue-runner.service`
`systemd/signal-cli.service`	`/etc/systemd/system/signal-cli.service`
`systemd/ollama.*`	`/etc/systemd/system/ollama.*`
`generate-and-send*.sh`	`/opt/stable-diffusion.cpp/`

# Typical deploy workflow
scp netscan/queue-runner.py bc250:/tmp/
ssh bc250 'sudo cp /tmp/queue-runner.py /opt/netscan/ && sudo systemctl restart queue-runner'

10. Troubleshooting

▸ ROCm crashes in Ollama logs

Expected — Ollama tries ROCm, it crashes on GFX1013, falls back to Vulkan. No action needed.

▸ Only 7.9 GiB GPU memory instead of 14 GiB

GTT tuning not applied. Check: cat /sys/module/ttm/parameters/pages_limit (should be 4194304). See §3.3.

▸ 14B model loads but inference returns HTTP 500

TTM pages_limit bottleneck. Fix: echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit (see §3.3).

▸ Model loads on CPU instead of GPU

Check OLLAMA_VULKAN=1: sudo systemctl show ollama | grep Environment

▸ Context window OOM kills (the biggest gotcha on 16 GB)

Ollama allocates KV cache based on num_ctx. Many models default to 32K–40K context, which on a 14B Q4_K model means 14–16 GB just for the model — leaving nothing for the OS.

Symptoms: Gateway gets OOM-killed, Ollama journal shows 500 errors, dmesg shows oom-kill.

Root cause: The abliterated Qwen3 14B declares num_ctx 40960 → 16 GB total model memory.

Fix: Create a custom model with context baked in:

cat > /tmp/Modelfile.16k << 'EOF'
FROM huihui_ai/qwen3-abliterated:14b
PARAMETER num_ctx 16384
EOF
ollama create qwen3-14b-16k -f /tmp/Modelfile.16k

This drops memory from ~16 GB → ~11.1 GB. Do not rely on OLLAMA_CONTEXT_LENGTH — it doesn't reliably override API requests from the gateway.

▸ signal-cli not responding on port 8080

Check the service: systemctl status signal-cli. If it crashed, restart: sudo systemctl restart signal-cli. Verify JSON-RPC:

curl -s http://127.0.0.1:8080/api/v1/rpc \
  -d '{"jsonrpc":"2.0","method":"listAccounts","id":"1"}'

▸ zram competing with model for physical RAM

Fedora defaults to ~8 GB zram. zram compresses pages but stores them in physical RAM — directly competing with the model. On 16 GB systems running 14B models, disable or limit zram and use NVMe file swap instead:

sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf

▸ Python cron scripts produce no output

Stdout is fully buffered under cron (no TTY). Add at script start:

sys.stdout.reconfigure(line_buffering=True)
sys.stderr.reconfigure(line_buffering=True)

▸ Signal delivery from signal-cli

Signal JSON-RPC API at http://127.0.0.1:8080/api/v1/rpc:

curl -X POST http://127.0.0.1:8080/api/v1/rpc \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"send","params":{
    "account":"+<BOT>","recipient":["+<YOU>"],
    "message":"test"
  },"id":"1"}'

11. Known Limitations

Issue	Impact
Shared VRAM	Image gen requires stopping Ollama. Bot offline ~2–3 min (FLUX.2-klein-9B) or ~1 min (FLUX.2-klein-4B).
MoE context limit	35B-A3B MoE tops out at 16K context (weights = 10.3 GiB, KV fills rest). Use 9B for >16K.
Signal latency	Messages queue during job execution (typical job 2–15 min). Chat checked between every job.
sd-cli hangs on GFX1013	Vulkan cleanup bug → poll + kill workaround.
Cold start latency	30–60s after Ollama restart (model loading).
Chinese thinking leak	Qwen3 occasionally outputs Chinese reasoning. Cosmetic.
Prefill rate degrades with context	128 tok/s at 1.3K → 70 tok/s at 10K tokens (UMA bandwidth + attention scaling).
Gen speed degrades with context fill	27 tok/s empty → 13 tok/s at 30K tokens. Partial model offload at KV limit causes cliff drop.
Ollama caps KV auto-size at ~40K (Q4_0)	`num_ctx` > 40960 accepted but silently truncated. Actual limit = VRAM ÷ per-token KV size.
Speculative decoding blocked	Ollama 0.18 has no `--draft-model`. Dual-model loading evicts the draft model.
TTS not feasible	CPU-based TTS (Piper, Coqui) competes with GPU for the same 16 GB UMA pool. No Vulkan TTS exists.

12. Software Versions

Pinned versions as of March 2026. All components built/installed on Fedora 43.

Component	Version	Notes
OS	Fedora 43, kernel 6.18.9	Headless, `performance` governor
Ollama	0.18.0	Vulkan backend, `OLLAMA_FLASH_ATTENTION=1`
Mesa / RADV	25.3.4	Vulkan 1.4.328, `RADV GFX1013`
stable-diffusion.cpp	master-525 (`d6dd6d7`)	Built with `-DSD_VULKAN=ON`
whisper.cpp	v1.8.3-198 (`30c5194c`)	Built with Vulkan, large-v3-turbo model
signal-cli	0.13.24	Native binary, JSON-RPC at :8080
Qwen3.5-35B-A3B	IQ2_M (GGUF, 10.6 GB)	Primary MoE model, via unsloth
Qwen3.5:9b	Q4_K_M (GGUF, 6.1 GB)	Vision + long context model
FLUX.2-klein-9B	Q4_0 (GGUF, 5.3 GB)	Image generation, via leejet
ggml-large-v3-turbo	1.6 GB	Whisper model for audio transcription
ESRGAN	RealESRGAN_x4plus (64 MB)	4× image upscaling
Python	3.13	queue-runner, netscan scripts

13. References

Hardware & Drivers

Resource	URL
AMD BC-250 community docs (BIOS, setup)	https://elektricm.github.io/amd-bc250-docs/
LLVM AMDGPU processor table (GFX1013)	https://llvm.org/docs/AMDGPUUsage.html#processors
Mesa RADV Vulkan driver	https://docs.mesa3d.org/drivers/radv.html
Linux TTM memory manager	https://www.kernel.org/doc/html/latest/gpu/drm-mm.html

LLM Inference

Resource	URL
Ollama — local LLM runtime	https://github.com/ollama/ollama
Qwen3.5 model family (Alibaba)	https://huggingface.co/Qwen
Qwen3.5-35B-A3B GGUF (unsloth)	https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
Qwen3.5-9B (Ollama)	https://ollama.com/library/qwen3.5:9b
GGUF quantization format	https://github.com/ggerganov/llama.cpp/blob/master/docs/gguf.md

Image & Video Generation

Resource	URL
stable-diffusion.cpp (Vulkan)	https://github.com/leejet/stable-diffusion.cpp
FLUX.2-klein-9B GGUF	https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF
FLUX.2-klein-4B GGUF	https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF
FLUX.1-Kontext-dev (image editing)	https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev
Chroma (flash distilled)	https://huggingface.co/leejet/Chroma-GGUF
WAN 2.1 T2V (video generation)	https://huggingface.co/Wan-AI
Real-ESRGAN (image upscaling)	https://github.com/xinntao/Real-ESRGAN

Audio & Speech

Resource	URL
whisper.cpp (Vulkan STT)	https://github.com/ggerganov/whisper.cpp
Whisper large-v3-turbo model	https://huggingface.co/ggerganov/whisper-large-v3-turbo

Messaging & Integration

Resource	URL
signal-cli (Signal messenger CLI)	https://github.com/AsamK/signal-cli
Signal Protocol	https://signal.org/docs/

Appendix A — OpenClaw Archive

▸ Historical: OpenClaw gateway configuration (replaced in v7)

OpenClaw v2026.2.26 was used as the Signal ↔ Ollama gateway from project inception through queue-runner v6. It was a Node.js daemon that managed signal-cli as a child process, routed messages to the LLM, and provided an agent framework with tool dispatch.

Why it was replaced:

~700 MB RSS on a 16 GB system (4.4% of total RAM)
15+ second overhead per agent turn (system prompt injection, tool resolution)
Unreliable fallback chains caused "fetch failed" timeout cascades
Could not run scripts as direct subprocesses — everything went through the LLM agent
signal-cli children survived gateway OOM kills, holding port 8080 as orphans
9.6K system prompt that couldn't be reduced below ~4K without breaking tools

What replaced it: See §5 for the current architecture.

A.1 Installation (historical)

sudo dnf install -y nodejs npm
sudo npm install -g openclaw@latest

openclaw onboard \
  --non-interactive --accept-risk --auth-choice skip \
  --install-daemon --skip-channels --skip-skills --skip-ui --skip-health \
  --daemon-runtime node --gateway-bind loopback

A.2 Model configuration (historical)

~/.openclaw/openclaw.json:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434",
        "apiKey": "ollama-local",
        "api": "ollama",
        "models": [{
          "id": "qwen3-14b-16k",
          "name": "Qwen 3 14B (16K ctx)",
          "contextWindow": 16384,
          "maxTokens": 8192,
          "reasoning": true
        }]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3-14b-16k",
        "fallbacks": ["ollama/qwen3-14b-abl-nothink:latest", "ollama/mistral-nemo:12b"]
      },
      "thinkingDefault": "high",
      "timeoutSeconds": 1800
    }
  }
}

A.3 Tool optimization (historical)

{
  "tools": {
    "profile": "coding",
    "alsoAllow": ["message", "group:messaging"],
    "deny": ["browser", "canvas", "nodes", "cron", "gateway"]
  },
  "skills": { "allowBundled": [] }
}

A.4 Agent identity (historical)

Personality lived in workspace markdown files (~/.openclaw/workspace/):

File	Purpose	Size
`SOUL.md`	Core personality	1.0 KB
`IDENTITY.md`	Name/emoji	550 B
`USER.md`	Human info	1.7 KB
`TOOLS.md`	Tool commands	2.1 KB
`AGENTS.md`	Grounding rules	1.4 KB
`WORKFLOW_AUTO.md`	Cron bypass rules	730 B

A.5 Signal channel (historical)

{
  "channels": {
    "signal": {
      "enabled": true,
      "account": "+<BOT_PHONE>",
      "cliPath": "/usr/local/bin/signal-cli",
      "dmPolicy": "pairing",
      "allowFrom": ["+<YOUR_PHONE>"],
      "sendReadReceipts": true,
      "textChunkLimit": 4000
    }
  }
}

A.6 Service management (historical)

systemctl --user status openclaw-gateway   # status
openclaw logs --follow                     # live logs
openclaw doctor                            # diagnostics
openclaw channels status --probe           # signal health

The gateway service (openclaw-gateway.service) ran as a user-level systemd unit. It has been disabled and masked:

systemctl --user disable --now openclaw-gateway
systemctl --user mask openclaw-gateway

Artur Andrzejczak · [email protected] · March 2026

Development assisted by Claude Opus 4.6.

Code: AGPL-3.0 · Docs: CC BY-SA 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks		benchmarks
images		images
netscan		netscan
scripts		scripts
systemd		systemd
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
generate-and-send-worker.sh		generate-and-send-worker.sh
generate-and-send.sh		generate-and-send.sh

Folders and files

Latest commit

History

Repository files navigation

░░ Contents

PART I — Hardware & Setup

1. Hardware Overview

Unified memory is your friend (but needs tuning)

2. Driver & Compute Stack

3. Ollama + Vulkan Setup

3.1 Install and enable Vulkan

3.2 Tune GTT size

3.3 Tune TTM pages_limit ← unlocks 14B models

3.4 Context window — the main gotcha

3.5 Swap — NVMe-backed safety net

3.6 Verify

3.7 Disable GUI (saves ~1 GB)

3.8 CPU governor — lock to performance

Memory layout after tuning

4. Models & Benchmarks

4.1 Compatibility table

4.2 Benchmark visualization

4.3 Context window experiments

4.4 KV cache quantization — breaking the context ceiling

4.5 Prefill (prompt evaluation) benchmarks

4.6 Memory budget

4.7 Model recommendations

PART II — AI Stack

5. Signal Chat Bot

5.1 Why not OpenClaw

5.2 signal-cli service

5.3 Chat architecture

5.4 Tool use — EXEC

5.5 Image generation via chat

5.6 Personality — "Clawd"

5.7 Response times

5.8 Vision analysis

5.9 Audio transcription

Why large-v3-turbo, not large-v3?

5.10 Smart model routing

6. Image Generation

6.1 Models

6.2 Performance

6.2.1 Upgrade roadmap — beyond the current stack

SD3.5-medium benchmark details

WAN 2.1 T2V 1.3B benchmark details

ESRGAN 4× upscale benchmarks

Image/video pipeline timing

PART III — Monitoring & Intelligence

7. Netscan Ecosystem

7.1 Architecture — queue-runner v7

7.1.1 Queue ordering

7.1.2 GPU idle detection

7.2 Scripts

7.3 Job scheduling — queue-runner v7

7.4 Data flow & locations

7.5 Dashboard — 29 main pages + 101 host detail pages

7.6 GPU monitoring — 3-state

7.7 Configuration & state files

7.8 Resilience

8. Career Intelligence

8.1 Two-phase design

8.2 Alert thresholds

8.3 Salary tracker · salary-tracker.py

8.4 Company intelligence · company-intel.py

8.5 Patent watch · patent-watch.py

8.6 Event scout · event-scout.py

PART IV — Reference

9. Repository Structure

Deployment

10. Troubleshooting

11. Known Limitations

12. Software Versions

13. References

Hardware & Drivers

LLM Inference

Image & Video Generation

Audio & Speech

Messaging & Integration

Appendix A — OpenClaw Archive

`PART I` — Hardware & Setup

3.8 CPU governor — lock to `performance`

`PART II` — AI Stack

`PART III` — Monitoring & Intelligence

8.3 Salary tracker · `salary-tracker.py`

8.4 Company intelligence · `company-intel.py`

8.5 Patent watch · `patent-watch.py`

8.6 Event scout · `event-scout.py`

`PART IV` — Reference

Packages