██████╗ ██████╗ ██████╗ ███████╗ ██████╗
██╔══██╗██╔════╝ ╚════██╗██╔════╝██╔═████╗
██████╔╝██║ █████╗ █████╔╝███████╗██║██╔██║
██╔══██╗██║ ╚════╝██╔═══╝ ╚════██║████╔╝██║
██████╔╝╚██████╗ ███████╗███████║╚██████╔╝
╚═════╝ ╚═════╝ ╚══════╝╚══════╝ ╚═════╝
GPU-accelerated AI home server on an obscure AMD APU — Vulkan inference, autonomous intelligence, Signal chat
Zen 2 · RDNA 1.5 · 16 GB unified · Vulkan · 14B @ 27 tok/s · 330 autonomous jobs/cycle · 130 dashboard pages
The BC-250 powered by an ATX supply, cooled by a broken AIO radiator with 3 fans just sitting on top of it. Somehow runs 24/7 without issues so far.
A complete guide to running a 35B-parameter MoE LLM, FLUX.2 image generation, and 330 autonomous jobs on the AMD BC-250 — an obscure APU (Zen 2 CPU + Cyan Skillfish RDNA 1.5 GPU) found in Samsung's blockchain/distributed-ledger rack appliances. Not a "crypto mining GPU," not a PS5 prototype — it's a custom SoC that Samsung used for private DLT infrastructure, repurposed here as a headless AI server with a community-patched BIOS.
Qwen3.5-35B MoE at 38 tok/s, FLUX.2-klein-9B at best quality, hardware-specific driver workarounds, memory tuning notes, and real-world benchmarks on this niche hardware.
What makes this unusual: The BC-250's Cyan Skillfish GPU (
GFX1013) is one of the few documented cases of LLM inference on RDNA 1.5. ROCm doesn't support it. OpenCL doesn't expose it. The only viable compute path is Vulkan — and even that required working around two kernel memory bottlenecks (GTT cap + TTM pages_limit) before 14B models would run.
| § | Section | For | What you'll find |
|---|---|---|---|
PART I ─ HARDWARE & SETUP |
|||
| 1 | Hardware Overview | BC-250 owners | Specs, memory architecture, power |
| 2 | Driver & Compute Stack | BC-250 owners | What works (Vulkan), what doesn't (ROCm) |
| 3 | Ollama + Vulkan Setup | BC-250 owners | Install, GPU memory tuning (GTT + TTM) |
| 4 | Models & Benchmarks | LLM users | Model compatibility, speed, memory budget |
PART II ─ AI STACK |
|||
| 5 | Signal Chat Bot | Bot builders | Chat, vision analysis, audio transcription, smart routing |
| 6 | Image Generation | Creative users | FLUX.2-klein-9B, synchronous pipeline |
PART III ─ MONITORING & INTEL |
|||
| 7 | Netscan Ecosystem | Home lab admins | 330 jobs, queue-runner v7, 130-page dashboard |
| 8 | Career Intelligence | Job seekers | Two-phase scanner, salary, patents |
PART IV ─ REFERENCE |
|||
| 9 | Repository Structure | Contributors | File layout, deployment paths |
| 10 | Troubleshooting | Everyone | Common issues and fixes |
| 11 | Known Limitations | Maintainers | What's broken, what to watch out for |
| 12 | Software Versions | Everyone | Pinned versions of all components |
| 13 | References | Everyone | Links to all upstream projects and models |
| A | OpenClaw Archive | Historical | Original architecture, why we ditched it |
The AMD BC-250 is a custom APU originally designed for Samsung's blockchain/distributed-ledger rack appliances (not a traditional "mining GPU"). It's a full SoC — Zen 2 CPU and Cyan Skillfish RDNA 1.5 GPU on a single package, with 16 GB of on-package unified memory. Samsung deployed these in rack-mount enclosures for private DLT workloads; decommissioned boards now sell for ~$100–150 on the secondhand market, making them an affordable option for running 14B LLMs on dedicated hardware.
▸ Origin story — Samsung, 5G operators, and AliExpress
What it was built for: Samsung commissioned these custom AMD SoCs to build rack-mount servers for private DLT (Distributed Ledger Technology) infrastructure — not public cryptocurrency mining. The target customers were South Korean 5G operators (SK Telecom and others), who were early adopters of 5G deployment. Private blockchain solved several real problems for 5G telcos:
- IoT microtransactions: 5G networks connect millions of smart devices. DLT enables cheap, instant machine-to-machine contract settlement without overloading central databases.
- Digital identity & security: Operators used DLT registries for cryptographic customer authentication and digital identity wallets (e.g. Samsung Pay integration).
- Inter-operator settlement: Blockchain streamlined real-time roaming fee reconciliation and data exchange between telecom partners.
Who made the hardware: The SoC was designed by AMD (Zen 2 CPU + RDNA 1.5 GPU). Samsung designed the overall system and wrote the factory BIOS. The physical boards were manufactured by ASRock Rack (ASRock's server division) as an OEM contractor — Samsung rack enclosures typically held 12 BC-250 boards each. ASRock Rack is known for producing highly custom designs for large tech companies.
How they ended up on AliExpress: Classic corporate e-waste cycle. As 5G infrastructure evolved, entire Korean server racks were decommissioned. Specialized recycling centers (mostly near Shenzhen, China) buy pallets of retired servers in bulk — often by weight. Workers disassemble the racks, test individual boards, and list working BC-250 modules on AliExpress as all-in-one SBC platforms for $100–150.
Not a PlayStation 5. Despite superficial similarities (both use Zen 2 + 16 GB memory), the BC-250 has nothing to do with the PS5. The PS5's Oberon SoC is RDNA 2 (GFX10.3, gfx1030+); the BC-250's Cyan Skillfish is RDNA 1.5 (GFX10.1, gfx1013) — a hybrid architecture: GFX10.1 instruction set (RDNA 1) but with hardware ray tracing support (full
VK_KHR_ray_tracing_pipeline,VK_KHR_acceleration_structure,VK_KHR_ray_query). LLVM's AMDGPU processor table lists GFX1013 as product "TBA" under GFX10.1, confirming it was never a retail part. Samsung also licensed RDNA 2 for mobile (Exynos 2200 / Xclipse 920) — that's a completely separate deal.Why "RDNA 1.5"? GFX1013 doesn't fit cleanly into AMD's public RDNA generations. It has the RDNA 1 (GFX10.1) ISA and shader compiler target, but includes hardware ray tracing — a feature AMD only shipped publicly with RDNA 2 (GFX10.3). This makes Cyan Skillfish a transitional/custom design, likely built for Samsung's specific workload requirements. We call it "RDNA 1.5" as a practical label.
BIOS is not stock. The board ships with a minimal Samsung BIOS meant for rack operation. A community-patched BIOS (from AMD BC-250 docs) enables standard UEFI features (boot menu, NVMe boot, fan control).
| Component | Details |
|---|---|
| CPU | Zen 2 — 6c/12t @ 2.0 GHz |
| GPU | Cyan Skillfish — RDNA 1.5, GFX1013, 24 CUs (1536 SPs), ray tracing capable |
| Memory | 16 GB unified (16 × 1 GB on-package), shared CPU/GPU |
| VRAM | 512 MB BIOS-carved framebuffer (same physical UMA pool — see note below) |
| GTT | 16 GiB (tuned via ttm.pages_limit=4194304, default 7.4 GiB) |
| Vulkan total | 16.5 GiB after tuning |
| Storage | 475 GB NVMe |
| OS | Fedora 43, kernel 6.18.9, headless |
| TDP | 220W board (inference: 130–155W, between jobs: 55–60W, true idle w/o model: ~35W) |
| BIOS | Community-patched UEFI (not Samsung stock) — AMD BC-250 docs |
| CPU governor | performance (stock schedutil causes LLM latency spikes) |
CPU and GPU share the same 16 GB physical pool (UMA — Unified Memory Architecture). The 512 MB "dedicated framebuffer" reported by mem_info_vram_total is carved from the same physical memory — it's a BIOS reservation, not separate silicon. The rest is accessible as GTT (Graphics Translation Table).
UMA reality: On unified memory, "100% GPU offload" means the model weights and KV cache live in GTT-mapped pages that the GPU accesses directly — there's no PCIe copy. However, it's still the same physical RAM the CPU uses. "Fallback to CPU" on UMA isn't catastrophic like on discrete GPUs (no bus transfer penalty), but GPU ALUs are faster than CPU ALUs for matrix ops.
Two bottlenecks must be fixed:
- GTT cap —
amdgpudriver defaults to 50% of RAM (~7.4 GiB). The legacy fix wasamdgpu.gttsize=14336in kernel cmdline, but this is no longer needed. - TTM pages_limit — kernel TTM memory manager independently caps allocations at ~7.4 GiB. Fix:
ttm.pages_limit=4194304(16 GiB in 4K pages). This is the only tuning needed.
✅ GTT migration complete:
amdgpu.gttsizewas removed from kernel cmdline. Withttm.pages_limit=4194304alone, GTT grew from 14→16 GiB and Vulkan available from 14.0→16.5 GiB. The deprecated parameter was actually limiting the allocation.
After tuning: Vulkan sees 16.5 GiB — enough for 14B parameter models at 40K context with Q4_0 KV cache, all inference on GPU.
The BC-250's GFX1013 sits awkwardly between supported driver tiers.
| Layer | Status | Notes |
|---|---|---|
| amdgpu kernel driver | ✅ | Auto-detected, firmware loaded |
| Vulkan (RADV/Mesa) | ✅ | Mesa 25.3.4, Vulkan 1.4.328 |
| ROCm / HIP | ❌ | rocblas_abort() — GFX1013 not in GPU list |
| OpenCL (rusticl) | ❌ | Mesa's rusticl doesn't expose GFX1013 |
Why ROCm fails: GFX1013 is listed in LLVM as supporting rocm-amdhsa, but AMD's ROCm userspace (rocBLAS/Tensile) doesn't ship GFX1013 solution libraries. Vulkan is the only viable GPU compute path.
▸ Verification commands
vulkaninfo --summary
# → GPU0: AMD BC-250 (RADV GFX1013), Vulkan 1.4.328, INTEGRATED_GPU
cat /sys/class/drm/card1/device/mem_info_vram_total # → 536870912 (512 MB)
cat /sys/class/drm/card1/device/mem_info_gtt_total # → 15032385536 (14 GiB)curl -fsSL https://ollama.com/install.sh | sh
# Enable Vulkan backend (disabled by default)
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment=OLLAMA_VULKAN=1
Environment=OLLAMA_KEEP_ALIVE=30m
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_GPU_OVERHEAD=0
Environment=OLLAMA_CONTEXT_LENGTH=16384
Environment=OLLAMA_MAX_QUEUE=4
OOMScoreAdjust=-1000
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama
OOMScoreAdjust=-1000protects Ollama from the OOM killer — the model process must survive at all costs (see §3.4).
ROCm will crash during startup — expected and harmless. Ollama catches it and uses Vulkan.
✅ No longer needed. The
amdgpu.gttsizeparameter has been removed. Withttm.pages_limit=4194304alone, GTT allocates 16 GiB (more than the old 14 GiB). Verify:
cat /sys/class/drm/card1/device/mem_info_gtt_total # → 17179869184 (16 GiB)
# If you still have amdgpu.gttsize in cmdline, remove it:
sudo grubby --update-kernel=ALL --remove-args="amdgpu.gttsize=14336"This is the key fix. Without this fix, 14B models load fine but produce HTTP 500 during inference.
# Runtime (immediate)
echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit
echo 4194304 | sudo tee /sys/module/ttm/parameters/page_pool_size
# Persistent
echo "options ttm pages_limit=4194304 page_pool_size=4194304" | \
sudo tee /etc/modprobe.d/ttm-gpu-memory.conf
printf "w /sys/module/ttm/parameters/pages_limit - - - - 4194304\n\
w /sys/module/ttm/parameters/page_pool_size - - - - 4194304\n" | \
sudo tee /etc/tmpfiles.d/gpu-ttm-memory.conf
sudo dracut -fOllama allocates KV cache based on the model's declared context window. Without a cap, large models request more KV cache than the BC-250 can handle, causing TTM fragmentation, OOM kills, or deadlocks on this UMA system.
Fix: Set OLLAMA_CONTEXT_LENGTH=16384 in the Ollama systemd override (see §3.3). This caps all inference to 16K context by default — matching the MoE primary model's limit.
Individual requests can override with
{"options": {"num_ctx": 65536}}when usingqwen3.5:9b(which handles 65K). The cap only affects the default allocation.
History of context tuning:
| Date | Context Cap | Primary Model | Why |
|---|---|---|---|
| Feb 2026 | 40960 | qwen3:14b | Default — caused deadlocks (TTM fragmentation) |
| Feb 25 | 24576 | qwen3:14b | Sweet spot: ~27 tok/s, 26K was 10% slower, 28K+ deadlocked |
| Mar 14 | 16384 | qwen3.5-35b-a3b MoE | MoE maxes at 16K (KV cache exceeds VRAM at 24K+). 9B fallback can go to 65K per-request. |
Why 24K → 16K? The 35B MoE's total weight (11 GB GGUF) is larger than qwen3:14b (9.3 GB). At 24K+ context the KV cache can't fit alongside the MoE weights. 16K is the maximum stable context for the MoE with all layers on GPU. See §4.3 for detailed KV cache scaling.
With the model consuming 11+ GB on a 14 GB system, disk swap is essential for surviving inference peaks.
NVMe wear concern: Swap is a safety net, not an active paging target. In steady state, swap usage is ~400 MB (OS buffers pushed out to make room for model weights). SMART data after months of 24/7 operation: 3% wear, 25.4 TB total written. The model runs entirely in RAM — swap catches transient spikes during model load/unload transitions. Consumer NVMe drives rated for 300–600 TBW will last years at this rate.
# Create 16 GB swap file (btrfs requires dd, not fallocate)
sudo dd if=/dev/zero of=/swapfile bs=1M count=16384 status=progress
sudo chattr +C /swapfile # disable btrfs copy-on-write
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon -p 10 /swapfile
# Make permanent
echo '/swapfile none swap sw,pri=10 0 0' | sudo tee -a /etc/fstabDisable/reduce zram — zram compresses pages in physical RAM, competing with the model:
sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf
# Or disable entirely: zram-size = 0sudo journalctl -u ollama -n 20 | grep total
# → total="11.1 GiB" available="11.1 GiB" (with qwen3-14b-16k)
free -h
# → Swap: 15Gi total, ~1.4Gi usedsudo systemctl set-default multi-user.target && sudo rebootThe stock schedutil governor down-clocks during idle, causing 50–100ms latency spikes at inference start. Lock all cores to full speed:
# Runtime (immediate)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Persistent (systemd-tmpfiles)
echo 'w /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor - - - - performance' | \
sudo tee /etc/tmpfiles.d/cpu-governor.conf16 GB Unified Memory
| Region | Size | Notes |
|---|---|---|
| VRAM carveout | 512 MB | BIOS-reserved from UMA pool (not separate memory) |
| GTT | 16 GiB | Tuned via ttm.pages_limit=4194304 (default 7.4 GiB). amdgpu.gttsize removed — no longer needed. |
| TTM pages_limit | 16 GiB | ttm.pages_limit=4194304 — the only memory tuning parameter needed |
| Vulkan heap | Size |
|---|---|
| Device-local | 8.33 GiB |
| Host-visible | 8.17 GiB |
| Total | 16.5 GiB → 14B models fit, all inference on GPU (UMA — same physical pool) |
| Consumer | Usage | Notes |
|---|---|---|
| Model weights (qwen3:14b) | 8.2 GiB GPU + 0.4 GiB CPU | Q4_K_M quantization |
| KV cache (FP16 @ 24K) | 3.8 GiB | With Q4_0: only 1.8 GiB for 40K context |
| Compute graph | 0.17 GiB | GPU-side |
| signal-cli + queue-runner | ~1.0 GiB | System RAM |
| OS + services | ~0.9 GiB | Headless Fedora 43 |
| NVMe swap | 16 GiB (374 MB used) | Safety net |
| zram | 0 B (allocated, not active) | Device exists but disksize=0 |
| Total loaded | 12.5 GiB (FP16) / 10.6 GiB (Q4_0) | 3.9–5.9 GiB free |
Ollama 0.18.0 · Vulkan · RADV Mesa 25.3.4 · 16.5 GiB Vulkan · FP16 KV
| Model | Params | Quant | tok/s | Prefill | Max Ctx | VRAM @4K | Status |
|---|---|---|---|---|---|---|---|
| qwen3.5-35b-a3b-iq2m | 35B/3B | UD-IQ2_M | 38 | 233 | 16K | 12.3 GiB | 🏆 Primary — MoE |
| qwen3.5:9b | 9.7B | Q4_K_M | 32 | 230 | 65K | 8.6 GiB | 🏆 Best context+vision |
| qwen2.5:3b | 3.1B | Q4_K_M | 104 | 515 | 64K | 3.4 GiB | ✅ Fast, lightweight |
| qwen2.5:7b | 7.6B | Q4_K_M | 56 | 248 | 64K | 6.5 GiB | ✅ Great quality/speed |
| qwen2.5-coder:7b | 7.6B | Q4_K_M | 56 | 246 | 64K | 6.4 GiB | ✅ Code-focused |
| llama3.1:8b | 8.0B | Q4_K_M | 52 | 246 | 48K | 11.0 GiB | ✅ Fast 8B |
| mannix/llama3.1-8b-lexi | 8.0B | Q4_0 | 51 | 308 | 48K | 10.6 GiB | ✅ Uncensored 8B |
| huihui_ai/seed-coder-abliterate | 8.3B | Q4_K_M | 52 | 231 | 64K | 9.1 GiB | ✅ Code gen, uncensored |
| qwen3:8b | 8.2B | Q4_K_M | 44 | 251 | 64K | 9.8 GiB | ✅ Thinking mode |
| huihui_ai/qwen3-abliterated:8b | 8.2B | Q4_K_M | 46 | 250 | 64K | 9.7 GiB | ✅ Abliterated 8B |
| gemma2:9b | 9.2B | Q4_0 | 38 | 219 | 48K | 9.2 GiB | ✅ Fixed! (was 91% before GTT fix) |
| mistral-nemo:12b | 12.2B | Q4_0 | 34 | 137 | 24K | 10.8 GiB | |
| qwen3:14b | 14.8B | Q4_K_M | 27 | 131 | 24K | 13.5 GiB | ✅ Previous primary |
| huihui_ai/qwen3-abliterated:14b | 14.8B | Q4_K_M | 28 | 137 | 24K | 11.4 GiB | ✅ Abliterated |
| phi4:14b | 14.7B | Q4_K_M | 29 | 128 | 40K | 11.8 GiB | 🏆 Best 14B context |
| Qwen3-30B-A3B (Q2_K) | 30.5B | Q2_K | 61 | — | 16K | 11.5 GiB | |
| qwen3.5-27b-iq2m | 26.9B | IQ2_M | 0 | — | — | 13.5 GiB | ❌ Non-functional¹ |
All models run 100% on GPU after GTT tuning (16 GiB). Before the fix, gemma2:9b was only 91% GPU-offloaded (26 tok/s → 38 tok/s after fix).
¹ Why 27B dense fails: The dense architecture requires all 27B parameters in every forward pass. Without matrix cores (GFX1013 has none), each token requires ~27B multiplications through general-purpose shader cores. Result: 0 tokens generated in 5 minutes. The 35B MoE with only 3B active params per token avoids this entirely — compute is ~9× less per token despite having more total knowledge stored.
Prefill column: Measured at ~400 tokens prompt size (warm model, FP16 KV). Prefill rate depends on prompt length — see §4.5 for detailed sweep. Smaller models (3B) saturate the GPU compute and achieve higher prefill. Larger models (14B) are memory-bandwidth-limited at ~128–137 tok/s. MoE and 9B land between at ~230 tok/s — the MoE benefits from only loading 3B active expert weights per token during prefill. Qwen3-30B-A3B and qwen3.5-27b not measured (deprecated/non-functional).
March 14 — Qwen3.5 era: Ollama upgraded 0.16.1→0.18.0 (required for Qwen3.5). The qwen3.5-35b-a3b MoE (35B total, 3B active per token) at IQ2_M quantization is now the primary model on BC-250: 38 tok/s, 233 tok/s prefill, 16K context, multimodal (vision+tools+thinking). The qwen3.5:9b provides 65K context with vision when longer documents are needed. Both are Qwen3.5 architecture — a newer generation than Qwen3.
⚠️ IQ2_M quality tradeoff: The extreme quantization (~2.5 bits per parameter) is a significant quality compromise — perplexity increases and complex mathematical reasoning degrades compared to higher-precision quantizations. For everyday tasks (summarization, JSON extraction, tool use, chat) the quality is adequate. For tasks requiring precise reasoning, theqwen3.5:9bfallback (Q4_K_M, ~4.5 bits) provides substantially better accuracy. This is an informed tradeoff: more knowledge at lower precision vs less knowledge at higher precision.
Generation speed (tok/s) — higher is better:
Model tok/s Max Ctx ██ = 10 tok/s
─────────────────────────────────────────────────────────
qwen2.5:3b 104 64K ██████████▌
Qwen3-30B-A3B Q2_K 61 16K ██████▏
qwen2.5:7b 56 64K █████▌
qwen2.5-coder:7b 56 64K █████▌
llama3.1:8b 52 48K █████▏
seed-coder-abl:8b 52 64K █████▏
lexi-8b (uncensored) 51 48K █████
qwen3-abl:8b 46 64K ████▌
qwen3:8b 44 64K ████▍
★ qwen3.5-35b-a3b MoE 38 16K ███▊ ← PRIMARY (35B/3B)
gemma2:9b 38 48K ███▊
★ qwen3.5:9b 32 65K ███▏ ← best ctx + vision
mistral-nemo:12b 34 24K ███▍
phi4:14b 29 40K ██▉
qwen3-abl:14b 28 24K ██▊
qwen3:14b 27 24K ██▋
qwen3.5-27b (dense) 0 — ❌ non-functional
Context ceiling per model (FP16 KV, all GPU):
Model 16K 24K 32K 48K 64K
──────────────────────────────────────────
qwen2.5:3b ✅ ✅ ✅ ✅ ✅
qwen2.5:7b ✅ ✅ ✅ ✅ ✅
qwen2.5-coder:7b ✅ ✅ ✅ ✅ ✅
qwen3:8b ✅ ✅ ✅ ✅ ✅
qwen3-abl:8b ✅ ✅ ✅ ✅ ✅
seed-coder:8b ✅ ✅ ✅ ✅ ✅
★ qwen3.5:9b ✅ ✅ ✅ ✅ ✅
llama3.1:8b ✅ ✅ ✅ ✅ ❌
lexi-8b ✅ ✅ ✅ ✅ ❌
gemma2:9b ✅ ✅ ✅ ✅ —
mistral-nemo:12b ✅ ✅ ❌ — —
qwen3:14b ✅ ✅ ❌ — —
qwen3-abl:14b ✅ ✅ ❌ — —
phi4:14b ✅ ✅ ✅ — —
★ 35B-A3B iq2m ✅ ❌ — — —
30B-A3B Q2_K ✅ ❌ — — —
qwen3.5-27b iq2m ❌ — — — —
4K and 8K columns omitted — every model passes at those sizes.
✅ = works 100% GPU | ❌ = timeout/deadlock | — = not tested (too large)
Key insight: Speed is constant across context sizes with FP16 KV (speed only degrades when the context is actually filled — see §4.4). The context ceiling is purely a memory constraint: weights + KV cache + compute graph must fit in 16.5 GiB.
Graphical benchmarks:
| Generation Speed | Prefill Speed |
|---|---|
![]() |
![]() |
The context window directly controls KV cache size, and on 16 GB unified memory, every megabyte counts. After v7 (OpenClaw removal freed ~700 MB, GTT bumped to 14 GB), we re-tested all context sizes systematically:
Context window vs memory (qwen3:14b Q4_K_M, flash attention, 16 GB GTT)
| Context | RAM Used | Free | Swap | Speed | Status |
|---|---|---|---|---|---|
| 8192 | ~9.5 GB | 6.5 GB | — | ~27 t/s | ✅ Safe |
| 12288 | ~10.3 GB | 5.7 GB | — | ~27 t/s | ✅ Conservative |
| 16384 | ~11.1 GB | 4.9 GB | — | ~27 t/s | ✅ Comfortable |
| 18432 | ~13.2 GB | 2.7 GB | 0.9 GB | 26.8 t/s | ✅ Works |
| 20480 | ~13.7 GB | 2.3 GB | 0.9 GB | 26.8 t/s | ✅ Works |
| 22528 | ~14.0 GB | 2.0 GB | 0.9 GB | 26.7 t/s | ✅ Works |
| 24576 | ~14.4 GB | 1.5 GB | 0.9 GB | 26.7 t/s | ✅ Max for qwen3:14b |
| 26624 | ~14.6 GB | 1.3 GB | 1.0 GB | 23.9 t/s | |
| 28672 | ~14.2 GB | — | 1.7 GB | timeout | ❌ Deadlocks |
| 32768 | ~15.7 GB | 0.2 GB | 2.1 GB | timeout | ❌ Deadlocks |
| 40960 | ~16.0 GB | 0 | — | — | 💀 TTM fragmentation¹ |
24K is the sweet spot — full speed (~27 tok/s), leaves ~1.5 GB for OS/services with stable swap at 0.9 GB. 26K works but inference drops 10% due to swap pressure. 28K+ deadlocks under Vulkan.
¹ Why 40K fails isn't raw OOM. The math: 9.3 GB weights + 2 GB KV cache + 1 GB OS ≈ 12.3 GB < 16 GB available. The actual failure is TTM fragmentation — the kernel's TTM memory manager can't allocate a contiguous block large enough for the KV cache because physical pages are fragmented across GPU and CPU consumers. This is a UMA-specific problem: on discrete GPUs with dedicated VRAM, fragmentation doesn't cross the PCIe boundary.
History: The original 24K experiment (Feb 25) deadlocked because OpenClaw gateway consumed ~700 MB. After v7 removed OpenClaw and bumped GTT to 14 GB (Mar 5), 24K became stable. Flash attention (
OLLAMA_FLASH_ATTENTION=1) is essential — without it, 24K would not fit.
UPDATE: KV cache quantization WORKS on Vulkan. Our README previously stated it was a no-op — that was wrong. Tested on Ollama 0.16.1 + RADV Mesa 25.3.4:
| KV Type | 24K ctx | 32K ctx | 48K ctx | KV Cache Size @24K | Gen tok/s | Notes |
|---|---|---|---|---|---|---|
| FP16 (default) | ✅ | ❌ deadlock | ~3.8 GiB | 27.2 | Current production | |
| Q8_0 | ✅ | ✅ | ✅ | 2.0 GiB | 27.3 | Conservative upgrade |
| Q4_0 | ✅ | ✅ | ✅ | 1.1 GiB | 27.3 | ← recommended |
KV cache scaling (Q4_0): ~45 MiB per 1K tokens (16K=720M, 24K=1.1G, 40K=1.8G).
Extreme context tests (Q4_0): Ollama's scheduler auto-sizes KV to what fits in VRAM. With 14.5 GiB available, model weights 8.2 GiB, the maximum KV allocation is ~40K tokens (1.8 GiB). Requesting larger num_ctx is accepted but the runner silently caps and truncates prompts to the actual KV limit.
Generation speed degrades with context fill (Q4_0, all layers on GPU):
| Tokens in context | Gen tok/s | Prefill tok/s | Notes |
|---|---|---|---|
| ~100 (empty) | 27.2 | 58 | Headline number |
| 3,300 | 24.6 | 113 | Typical Signal chat |
| 10,000 | 20.7 | 70 | Long job output |
| 30,000 | 13.4 | 53 | Heavy document analysis |
| 40,960 (max fill) | ~10* | ~42 | Theoretical, near KV limit |
* Estimated from degradation curve. One test at 41K showed 1.2 tok/s, but that was caused by model partial offload (21/41 layers spilled to CPU), not normal operation.
Q8_0 ceiling: Fits up to ~64K context on GPU. At 80K, KV cache spills to CPU (7 tok/s — unusable). Non-deterministic — depends on memory state at load time.
Not deploying to production. MoE model (primary) is capped at 16K context — KV quantization provides no benefit (bottleneck is weight size, not KV). Potentially useful for the 9B fallback model at 40K+ context, but not worth the quality risk.
# If ever needed for 9B model at extreme context:
# Environment=OLLAMA_KV_CACHE_TYPE=q4_0
# in /etc/systemd/system/ollama.service.d/override.confCurrent production: FP16 KV (Ollama default). Context capped at 16K for MoE via
OLLAMA_CONTEXT_LENGTH=16384.
On UMA, both prefill and generation share memory bandwidth (~51 GB/s DDR4-3200). Prefill is the time the model spends "reading" the prompt before generating the first token.
For embedded engineers: Think of LLM inference as two phases — like a bootloader and a main loop. Prefill is the "bootloader": the model processes the entire input prompt in one burst (parallel, compute-bound — like DMA-ing a firmware image into SRAM). Token generation is the "main loop": the model produces output tokens one at a time, sequentially (memory-bandwidth-bound — like polling a UART at a fixed baud rate). MoE (Mixture of Experts) is like having 35 specialized ISRs but only routing to 3 of them per interrupt — you get the routing intelligence of knowing all 35, but only pay the execution cost of 3. That's why a 35B-parameter MoE runs faster than a 14B dense model on hardware without matrix cores.
Prefill rate vs prompt size — production models (FP16 KV cache, warm):
qwen3.5-35b-a3b-iq2m (MoE 35B/3B active, UD-IQ2_M):
| Prompt Size | Tokens | Prefill | Gen tok/s | TTFT (warm) |
|---|---|---|---|---|
| Tiny | 17 | 53 tok/s | 39.3 | 0.3s |
| Short | 42 | 68 tok/s | 39.6 | 0.6s |
| Medium | 384 | 231 tok/s | 38.5 | 1.7s |
| Long | 1,179 | 228 tok/s | 38.3 | 5.2s |
qwen3.5:9b (Q4_K_M, dense 9.7B):
| Prompt Size | Tokens | Prefill | Gen tok/s | TTFT (warm) |
|---|---|---|---|---|
| Tiny | 17 | 61 tok/s | 33.2 | 0.3s |
| Short | 42 | 118 tok/s | 33.0 | 0.4s |
| Medium | 384 | 229 tok/s | 33.0 | 1.7s |
| Long | 1,179 | 225 tok/s | 32.5 | 5.2s |
Observations: Both production models converge to ~230 tok/s prefill at medium-to-long prompts — the DDR4 bandwidth ceiling. At tiny prompts (<50 tokens), GPU compute overhead dominates and prefill drops to 53–61 tok/s. Generation rate is stable: MoE holds 38–39 tok/s, 9B holds 32–33 tok/s regardless of prompt size. TTFT scales linearly: at 384 tokens it's ~1.7s, at 1.2K tokens it's ~5.2s. For real-world Signal chat (3K system prompt + conversation), expect TTFT of ~15–20s on cold start, <2s when the model is warm (prompt cached via
OLLAMA_KEEP_ALIVE=30m).
Historical: qwen3:14b Q4_K_M (previous primary, 24K context)
| Prompt Size | Tokens | Prefill | Gen tok/s | TTFT (warm) |
|---|---|---|---|---|
| Tiny | 86 | 88 tok/s | 27.2 | ~1s |
| Short | 353 | 67 tok/s | 27.2 | ~5s |
| Medium | 1,351 | 128 tok/s | 26.1 | ~11s |
| Long | 3,354 | 113 tok/s | 24.6 | ~30s |
| XL | 6,686 | 88 tok/s | 22.5 | ~76s |
| Massive | 10,014 | 70 tok/s | 20.7 | ~143s |
Generation rate degrades with context: 27.2 tok/s @small → 20.7 tok/s @10K tokens.
Graphical: prefill rate and generation rate vs prompt size:
Model Landscape Bubble Chart — generation speed × prefill speed × max context (bubble size = context window, unique color per model):
qwen3.5-35b-a3b-iq2m · headless server (from Ollama logs)
| Component | MoE @4K ctx | MoE @16K ctx | Notes |
|---|---|---|---|
| Model weights (GPU) | 10.3 GiB | ~8.2 GiB | 41/41 layers on Vulkan0; spills to CPU at higher ctx |
| Model weights (CPU) | 0.3 GiB | ~0.4 GiB | Spilled layers + embeddings |
| KV cache (GPU) | 1.6 GiB | ~3.8 GiB | Grows ~0.4 GiB per 1K tokens |
| Compute graph | ~0.2 GiB | ~0.2 GiB | GPU-side |
| Ollama total | 12.3 GiB | ~12.5 GiB | Ollama dynamically spills weights to make room for KV |
| OS + services | ~0.9 GiB | ~0.9 GiB | Headless Fedora 43 |
| Free (of 16.5 Vulkan) | ~4.2 GiB | ~4.0 GiB | |
| NVMe swap | 16 GiB | Safety net |
MoE memory dynamics: As context grows, Ollama intelligently spills weight layers from GPU to CPU to maintain a ~12.5 GiB total. The MoE's total weight (11 GB GGUF) is larger than qwen3:14b (9.3 GB), but only 3B params activate per token — so CPU-spilled layers that aren't selected experts cause zero compute penalty. At 24K+ context, the KV cache exceeds what can fit alongside the weights, causing OOM or timeout.
Qwen3.5 is the latest generation — multimodal (vision + tools + thinking), Apache 2.0.
| Use Case | Recommended Model | tok/s | Max Ctx | Why |
|---|---|---|---|---|
| 🏆 General AI / primary | qwen3.5-35b-a3b-iq2m | 38 | 16K | 35B knowledge, 3B active, fastest reasoning |
| 🏆 Long context / vision | qwen3.5:9b | 32 | 65K | Multimodal, stable context scaling, vision |
| Long context (14B) | phi4:14b | 29 | 40K | Best 14B model for long context on this hardware |
| Fast batch jobs | qwen2.5:7b | 56 | 64K | 2× faster than 14B, 64K context |
| Code generation | qwen2.5-coder:7b | 56 | 64K | Same speed as base, code-specialized |
| Speed-critical | qwen2.5:3b | 104 | 64K | 4× faster, use for simple tasks |
| Previous primary | qwen3:14b (abliterated) | 28 | 24K | Replaced by Qwen3.5 models |
Production dual-model config:
qwen3.5-35b-a3b-iq2mas primary withOLLAMA_CONTEXT_LENGTH=16384. For tasks needing >16K context or vision (image analysis), switch toqwen3.5:9bwhich handles 65K context and can process images.The MoE wins over the 9B dense model in generation speed (38 vs 32 tok/s) because only 3B parameters activate per token on hardware without matrix cores — fewer multiplications wins. Both models achieve similar prefill rates (~230 tok/s at ~400 tokens), but the 9B wins in context capacity (65K vs 16K) because its smaller total weight leaves more room for KV cache.
# Primary model (35B MoE) — custom GGUF via Modelfile
# See tmp/Modelfile-qwen35-35b-a3b for setup
ollama create qwen3.5-35b-a3b-iq2m -f Modelfile-qwen35-35b-a3b
# High-context model (vision+65K, official Ollama)
ollama pull qwen3.5:9b
# Context is capped via OLLAMA_CONTEXT_LENGTH=16384 in systemd (see §3.3, §3.4)
# Individual requests can override with {"options": {"num_ctx": 65536}} when using 9bWhy not a bigger MoE? Even though only 3B params activate per token, all 35B params must reside in memory — the router decides per-token which experts to fire, so every weight must be loaded. At IQ2_M (~2.5 bits per parameter), 35B = 11 GB GGUF. The next MoE up — Qwen3-235B-A22B — would be ~44 GB at IQ2_M (2.7× too large). Mixtral 8×22B (141B) would be ~35 GB. Going below IQ2_M (e.g. IQ1_S at ~1.5 bits) causes quality collapse. The qwen3.5-35b-a3b at IQ2_M is the largest MoE that fits 16 GB with usable quantization on this hardware.
The BC-250 runs a personal AI assistant accessible via Signal messenger — no gateway, no middleware. signal-cli runs as a standalone systemd service exposing a JSON-RPC API, and queue-runner handles all LLM interaction directly.
Signal --> signal-cli (JSON-RPC :8080) --> queue-runner --> Ollama --> GPU (Vulkan)
Software: signal-cli v0.13.24 (native binary) · Ollama 0.18+ · queue-runner v7
OpenClaw was the original gateway (v2026.2.26, Node.js). It was replaced because:
| Problem | Impact |
|---|---|
| ~700 MB RSS | On a 16 GB system, that's 4.4% of RAM wasted on a routing layer |
| 15+ second overhead per job | Agent turn setup, tool resolution, system prompt injection — for every cron job |
| Unreliable model routing | Fallback chains and timeout cascades caused 5-min "fetch failed" errors |
| No subprocess support | Couldn't run Python/bash scripts directly — had to shell out through the agent |
| 9.6K system prompt | Couldn't be trimmed below ~4K tokens without breaking tool dispatch |
| Orphan processes | signal-cli children survived gateway OOM kills, holding port 8080 |
The replacement: queue-runner talks to signal-cli and Ollama directly via HTTP APIs. Zero middleware.
See Appendix A for the original OpenClaw configuration.
signal-cli runs as a standalone systemd daemon with JSON-RPC:
# /etc/systemd/system/signal-cli.service
[Unit]
Description=signal-cli JSON-RPC daemon
After=network.target
[Service]
Type=simple
ExecStart=/opt/signal-cli/bin/signal-cli --output=json \
-u +<BOT_PHONE> jsonRpc --socket http://127.0.0.1:8080
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetRegister a separate phone number for the bot via signal-cli register or signal-cli link.
Between every queued job, queue-runner.py polls the signal-cli journal for incoming messages. Messages are routed based on content type:
queue-runner v7 — continuous loop
job N → check Signal inbox → route message → job N+1
| |
v |
journalctl -u ┌──────┼──────┐
signal-cli │ │ │
audio image text
│ │ │
v v v
whisper qwen3.5 choose_model()
-cli :9b MoE or 9B
(Vulkan) vision ↓
│ │ Ollama /api/chat
│ │ │
v v v
signal-cli: send reply
Key parameters:
| Setting | Value | Purpose |
|---|---|---|
SIGNAL_CHAT_CTX |
16384 | MoE model context window |
VISION_MODEL |
qwen3.5:9b | Vision analysis model (multimodal) |
VISION_CTX |
4096 | Vision context (image tokens are large) |
ROUTING_TOKEN_THRESHOLD |
8000 | Switch to 9B for long prompts |
SIGNAL_CHAT_MAX_EXEC |
3 | Max shell commands per message |
SIGNAL_EXEC_TIMEOUT_S |
30 | Per-command timeout |
SIGNAL_MAX_REPLY |
1800 | Signal message character limit |
The LLM can request shell commands via EXEC(command) in its response. queue-runner intercepts these, runs them, feeds stdout back into the conversation, and lets the LLM synthesize a final answer:
User: "what's the disk usage?"
LLM: [thinking...] EXEC(df -h /)
Runner: executes → feeds output back
LLM: "Root is 67% full, 48G free on your 128GB NVMe."
Supported patterns: web search (ddgr), file reads (cat, head), system diagnostics (journalctl, systemctl, df, free), data queries (jq on JSON files). Up to 3 commands per turn.
When the LLM detects an image request, it emits EXEC(/opt/stable-diffusion.cpp/generate-and-send "prompt"). queue-runner intercepts this pattern and handles it synchronously:
- Stop Ollama (free GPU VRAM)
- Run sd-cli with FLUX.2-klein-9B (4 steps, 512×512, ~105s)
- Send image as Signal attachment
- Restart Ollama
Bot is offline during generation (~2–3 minutes total including model reload).
Image editing (Kontext): Send a photo to Signal with an edit instruction ("make it cyberpunk", "add a hat"). The LLM emits EXEC(/opt/stable-diffusion.cpp/edit-image "instruction"), queue-runner runs FLUX.1-Kontext-dev with the photo as reference, and sends back the edited image (~5–10 min @512²). Input images are automatically resized to 512×512. See §6.2 for a demo (Sonic → Shadow the Hedgehog).
Video generation: Ask for a video/animation. Uses WAN 2.1 T2V 1.3B (~38 min for 17 frames @480×320).
ESRGAN upscale: Every generated image is automatically upscaled 4× with RealESRGAN_x4plus (512²→2048² in ~25s). Both versions sent via Signal — thumbnail + full-res. Also available on-demand via chat.
⚠️ GFX1013 bug: sd-cli hangs after writing the output image (Vulkan cleanup). queue-runner polls for the file and kills the process.
The system prompt defines a cynical, darkly funny personality ("House MD meets a sysadmin lobster"). Key traits:
- Direct, no corporate speak, no disclaimers
- Dark humor about the hardware constraints
- Full access to
/opt/netscan/data/for grounded answers - Knows AK's professional context (embedded Linux, camera drivers, V4L2/libcamera)
- Uncensored creative generation (abliterated model, no safety filters)
The personality is baked into queue-runner.py's SYSTEM_PROMPT — no external workspace files needed.
| Scenario | Latency |
|---|---|
| Text reply (warm) | 10–30s |
| Complex reasoning with tool use | 30–90s |
| Image generation (FLUX.2-klein-9B 512²) | ~105s |
| Image generation + auto-upscale 4× | ~130s |
| Image editing (Kontext 512²) | ~5 min |
| Video generation (WAN 2.1 480×320) | ~38 min |
| ESRGAN 4× upscale (on-demand) | ~25s |
| Cold start (model reload) | 30–60s |
| Voice note transcription (≤40s) | 3–5s |
| Vision analysis (photo → description) | ~40–80s |
Send a photo to Signal without an edit keyword (no "draw", "generate", "create") and the bot analyzes it using qwen3.5:9b's native multimodal vision. The 9B model processes base64-encoded images via Ollama's /api/chat endpoint.
User: [photo of a circuit board] "what chip is this?"
Router: image + non-edit text → vision analysis (9B)
9B: "That's an STM32F407 — the LQFP-100 package, 168 MHz Cortex-M4."
How edit vs. analysis is decided:
| Input | Keywords detected | Action |
|---|---|---|
| Photo + "make it cyberpunk" | ✓ edit | → Kontext image editing (§5.5) |
| Photo + "what is this?" | ✗ | → qwen3.5:9b vision analysis |
| Photo (no text) | ✗ | → qwen3.5:9b vision analysis |
Key detail: qwen3.5:9b requires "think": false in the API call. With thinking enabled, the model produces only hidden thinking tokens and returns an empty visible response. Discovered via 7 iterative tests (tests 1–6 all returned empty content).
The MoE model (qwen3.5-35b-a3b-iq2m) has no vision capability — it returns HTTP 500 when given images. This is why model routing is essential.
Send a voice note to Signal and the bot transcribes it using whisper.cpp with Vulkan GPU acceleration:
User: [voice note, 15 seconds, Polish]
Router: audio/* → whisper-cli (auto language detection)
Whisper: "Hej, sprawdź mi pogodę na jutro" (pl, 15.2s audio)
Router: → feed transcription to LLM for response
LLM: "Jutro 18°C, częściowe zachmurzenie..."
Whisper setup on BC-250:
| Component | Value |
|---|---|
| Runtime | whisper.cpp (Vulkan, built from source) |
| Model | ggml-large-v3-turbo (1.6 GB) |
| Binary | /opt/whisper.cpp/build/bin/whisper-cli |
| Threads | 6 (all Zen 2 cores) |
| Language | Auto-detect (EN/PL confirmed) |
Both models were benchmarked with real English TTS speech (flite) at three durations. The speed difference is modest (~2×), but memory is the dealbreaker — the larger model doesn't fit alongside Ollama in 16 GB.
Speed comparison:
| Audio | large-v3-turbo | large-v3 | Speedup |
|---|---|---|---|
| 3.6s | 3.3s | 7.9s | 2.4× |
| 18.2s | 3.5s | 8.9s | 2.6× |
| 39.2s | 4.3s | 8.1s | 1.9× |
The memory problem:
The BC-250 has 16 GB total (UMA — shared between CPU and GPU). The Ollama MoE model takes 10.6 GB. OS and buffers need ~3.5 GB. That leaves the memory budget looking like this:
| Scenario | Ollama | Whisper | OS/buffers | Total | Fits 16 GB? |
|---|---|---|---|---|---|
| Ollama only | 10.6 GB | — | 3.5 GB | 14.1 GB | ✅ 1.9 GB free |
| + large-v3-turbo | 10.6 GB | 1.6 GB | 3.5 GB | 15.7 GB | ✅ 0.3 GB free |
| + large-v3 | 10.6 GB | 2.9 GB | 3.5 GB | 17.0 GB | ❌ 1.0 GB overflow → swap |
When the total exceeds 16 GB, the kernel pushes pages to NVMe swap. This shows up as a measurable swap delta:
large-v3 pushes ~1 GB into swap on first load. large-v3-turbo causes zero swap. Once pages are evicted, subsequent large-v3 runs may show 0 swap delta (the 39s test) because those pages were already swapped out by earlier runs — but the damage (swap pressure, latency spikes) already happened.
Quality is comparable. Both models tested on a 39s embedded-systems passage (flite TTS). Both made the same synthesis artifacts ("kilobots" for "kilobytes", "Wipcomer" for "libcamera"). Neither is clearly better on robotic TTS.
Verdict: large-v3-turbo — 2× faster, 45% smaller, zero swap pressure. The quality tradeoff is negligible on BC-250's memory budget.
queue-runner automatically selects the best model for each message based on content:
def choose_chat_model(user_text, has_image=False):
if has_image:
return "qwen3.5:9b", 4096 # only model with vision
if estimate_tokens(user_text) > 8000:
return "qwen3.5:9b", 16384 # 9B handles 65K context
return "qwen3.5-35b-a3b-iq2m", 16384 # MoE — faster, smarter| Route | Model | Speed | When |
|---|---|---|---|
| Default | qwen3.5-35b-a3b MoE | 37.7 tok/s | Normal chat (most messages) |
| Vision | qwen3.5:9b | 31.8 tok/s | Photo attached (no edit keywords) |
| Long context | qwen3.5:9b | 31.8 tok/s | Prompt > 8K tokens |
The MoE activates only 3B of its 35B parameters per token, giving it faster generation than the dense 9B despite being a "larger" model. Both models are Qwen3.5-family and produce comparable text quality for short exchanges. The 9B is reserved for tasks that require vision or long context — capabilities the MoE lacks.
Stable Diffusion via stable-diffusion.cpp with native Vulkan backend.
▸ Build from source
sudo dnf install -y vulkan-headers vulkan-loader-devel glslc git cmake gcc g++ make
cd /opt && sudo git clone --recursive https://github.com/leejet/stable-diffusion.cpp.git
sudo chown -R $(whoami) /opt/stable-diffusion.cpp && cd stable-diffusion.cpp
mkdir -p build && cd build && cmake .. -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)FLUX.2-klein-9B — recommended, best quality, Apache 2.0:
mkdir -p /opt/stable-diffusion.cpp/models/flux2 && cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (9B, Q4_0, 5.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF/resolve/main/flux-2-klein-9b-Q4_0.gguf"
# Qwen3-8B text encoder (Q4_K_M, 4.7 GB)
curl -L -o qwen3-8b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-8B-GGUF/resolve/main/Qwen3-8B-Q4_K_M.gguf"
# FLUX.2 VAE (321 MB) — different from FLUX.1 VAE!
curl -L -o flux2-vae.safetensors "https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-4b/resolve/main/split_files/vae/flux2-vae.safetensors"Memory: 5.3 GB VRAM (diffusion) + 6.2 GB VRAM (Qwen3-8B encoder) + 95 MB (VAE) = ~11.8 GB total. Stresses the 16.5 GB Vulkan pool properly. Best quality of all tested models.
FLUX.2-klein-4B — fast alternative, Apache 2.0:
cd /opt/stable-diffusion.cpp/models/flux2
# Diffusion model (4B, Q4_0, 2.3 GB)
curl -L -O "https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF/resolve/main/flux-2-klein-4b-Q4_0.gguf"
# Qwen3-4B text encoder (Q4_K_M, 2.4 GB)
curl -L -o qwen3-4b-Q4_K_M.gguf "https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf"
# Reuses same flux2-vae.safetensors from aboveMemory: 2.3 GB VRAM (diffusion) + 3.6 GB VRAM (Qwen3-4B encoder) + 95 MB (VAE) = ~6 GB total. 7× faster than 9B but lower quality. Good for quick previews.
FLUX.1-schnell — previous default, Apache 2.0:
mkdir -p /opt/stable-diffusion.cpp/models/flux && cd /opt/stable-diffusion.cpp/models/flux
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/flux1-schnell-q4_k.gguf"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/ae.safetensors"
curl -L -O "https://huggingface.co/second-state/FLUX.1-schnell-GGUF/resolve/main/clip_l.safetensors"
curl -L -O "https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/resolve/main/t5-v1_1-xxl-encoder-Q4_K_M.gguf"Memory: 6.5 GB VRAM (diffusion) + 2.9 GB RAM (T5-XXL Q4_K_M) = ~10 GB total.
Chroma flash Q4_0 — alternative, open-source:
cd /opt/stable-diffusion.cpp/models/flux
curl -L -o chroma-unlocked-v47-flash-Q4_0.gguf "https://huggingface.co/leejet/Chroma-GGUF/resolve/main/chroma-unlocked-v47-flash-Q4_0.gguf"
# Reuses existing T5-XXL and FLUX.1 ae.safetensors from aboveMemory: 5.1 GB VRAM (diffusion) + 3.2 GB RAM (T5-XXL) = ~8.4 GB total.
SD-Turbo — fast fallback, lower quality:
cd /opt/stable-diffusion.cpp/models
curl -L -o sd-turbo.safetensors \
"https://huggingface.co/stabilityai/sd-turbo/resolve/main/sd_turbo.safetensors"Benchmarked 2026-03-14, sd.cpp master-525-d6dd6d7, Vulkan GFX1013 (16.5 GiB), Ollama stopped.
Important: FLUX GGUF files must use
--diffusion-modelflag, not-m. The-mflag fails with "get sd version from file failed" because GGUF metadata is empty after tensor name conversion. This applies to all sd.cpp versions.
🏆 FLUX.2-klein-9B Q4_0 — new default (best quality):
| Resolution | Steps | Time | s/step | Notes |
|---|---|---|---|---|
| 512×512 | 4 | 104s | 15.4 | Default, ~11.8 GB VRAM total |
| 768×768 | 4 | 129s | 21.3 | Best balance of quality vs time |
FLUX.2-klein-9B uses a Qwen3-8B LLM as text encoder — richer prompt understanding and finer detail than the 4B variant. Stresses the 16.5 GB Vulkan pool properly (11.8 GB used). The
--offload-to-cpuflag is essential (manages UMA allocation pools).
FLUX.2-klein-4B Q4_0 — fast alternative:
| Resolution | Steps | Time | s/step | Notes |
|---|---|---|---|---|
| 512×512 | 4 | 20s | 3.95 | Fast preview, ~6 GB VRAM total |
| 512×512 | 8 | 26s | 2.66 | Better quality, GPU warm |
| 768×768 | 4 | 30s | 5.43 | Great quality, no tiling |
| 1024×1024 | 4 | 63s | 10.18 | VAE tiling required |
| 1024×1024 | 4 | ❌ FAIL | — | Without --vae-tiling (VAE OOM) |
7× faster than 9B but noticeably less detailed. Good for quick previews or batch generation.
FLUX.1-schnell Q4_K — previous default:
| Resolution | Steps | Time | Notes |
|---|---|---|---|
| 512×512 | 4 | 30s | ~10 GB VRAM (6.5 diffusion + 3.4 encoders) |
| 768×768 | 4 | 91s | VAE tiling kicks in |
| 1024×1024 | 4 | 146s | VAE tiling, good quality |
| 512×512 | 8 | 77s | More steps, marginal improvement |
Chroma flash Q4_0 — quality alternative (reuses T5+VAE from FLUX.1):
| Resolution | Steps | Time | Notes |
|---|---|---|---|
| 512×512 | 4 | 85s | Sampling 46s + encoder 37s |
| 512×512 | 8 | 130s | Sampling 96s |
| 768×768 | 8 | 240s | Sampling 195s |
Chroma uses cfg-based guidance (like FLUX.1-dev) but is fully open. Quality is better than schnell per step, but 4× slower than FLUX.2-klein.
FLUX.1-dev Q4_K_S — high-quality, slow (city96/FLUX.1-dev-gguf, 6.8 GB):
| Resolution | Steps | Time | Notes |
|---|---|---|---|
| 512×512 | 20 | 279s | Sampling 253s (12.65 s/step), ~6.6 GB VRAM |
| 768×768 | 20 | ❌ FAIL | Guidance model compute graph exceeds VRAM |
SD-Turbo — fast fallback:
| Resolution | Steps | Time | Notes |
|---|---|---|---|
| 512×512 | 1 | 11s | Minimum viable, ~2 GB VRAM |
| 768×768 | 4 | 21s | Decent for quick previews |
Head-to-head comparison (same prompt, same hardware, back-to-back):
| Model | 512² @4s | 768² @4s | VRAM | Diffusion | Encoder |
|---|---|---|---|---|---|
| FLUX.2-klein-9B | 104s | 129s | 11.8 GB | 5.3 GB | Qwen3-8B (4.7 GB) |
| FLUX.2-klein-4B | 20s | 30s | 6 GB | 2.3 GB | Qwen3-4B (2.4 GB) |
| FLUX.1-schnell | 30s | 91s | 10 GB | 6.5 GB | CLIP+T5 (3.4 GB) |
| Chroma flash | 85s | 240s⁸ | 8.4 GB | 5.1 GB | T5 (3.2 GB) |
| FLUX.1-dev | 279s²⁰ | ❌ | 10 GB | 6.8 GB | CLIP+T5 (3.4 GB) |
| SD-Turbo | 11s¹ | 21s | 2 GB | 2 GB | (built-in) |
FLUX.2-klein-9B is the quality winner — more detail, better text understanding, and it actually stresses the 16.5 GB GPU properly (11.8 GB used vs 6 GB for 4B). The 4B version is 7× faster but leaves 10 GB unused.
🔬 Quality shootout — same prompt, same seed (42), 512×512 @4 steps:
All models tested back-to-back on the same prompt: "a cyberpunk cityscape at sunset with neon lights reflecting on wet streets, highly detailed"
| Model | Time | s/step | VRAM | File Size | Quality |
|---|---|---|---|---|---|
| FLUX.2-klein-9B | 104s | 15.4 | 11.8 GB | 709 KB | ★★★★ — finest detail, best reflections |
| FLUX.2-klein-4B | 15s | 2.7 | 6.0 GB | 704 KB | ★★★ — good but less detail |
| FLUX.1-schnell | 31s | 6.5 | 10.1 GB | 609 KB | ★★ — decent, less coherent |
| Chroma flash (8 steps) | 120s | 14.1 | 8.4 GB | 204 KB | ★★ — artistic but softer |
Example outputs (same prompt, same seed 42, 512×512):
| FLUX.2-klein-9B (★★★★) | FLUX.2-klein-4B (★★★) |
|---|---|
![]() |
![]() |
| 104s, 11.8 GB VRAM | 15s, 6.0 GB VRAM |
| FLUX.1-schnell (★★) | Chroma flash (★★) |
|---|---|
![]() |
![]() |
| 31s, 10.1 GB VRAM | 120s, 8.4 GB VRAM |
The 9B model produces visibly more detail in fine structures (neon reflections, wet streets, building facades). The 4B is the speed champion but sacrifices detail. Chroma has a distinctive artistic style but outputs smaller, softer images. FLUX.1-schnell sits in the middle.
Summary: recommended settings for production
| Use case | Model | Resolution | Steps | Time |
|---|---|---|---|---|
| Default (Signal) | FLUX.2-klein-9B | 512×512 | 4 | ~105s |
| High quality | FLUX.2-klein-9B | 768×768 | 4 | ~130s |
| Quick preview | FLUX.2-klein-4B | 512×512 | 4 | ~20s |
| Poster/wallpaper | FLUX.2-klein-4B | 1024×1024 | 4 | ~63s |
| Best quality (slow) | Chroma flash | 512×512 | 8 | ~130s |
# FLUX.2-klein-9B — recommended production command:
/opt/stable-diffusion.cpp/build/bin/sd-cli \
--diffusion-model models/flux2/flux-2-klein-9b-Q4_0.gguf \
--vae models/flux2/flux2-vae.safetensors \
--llm models/flux2/qwen3-8b-Q4_K_M.gguf \
-p "your prompt here" \
--cfg-scale 1.0 --steps 4 -H 512 -W 512 \
--offload-to-cpu --diffusion-fa -v \
-o output.pngsd.cpp (master-525+) supports more models. The BC-250 has ~16.5 GB with Ollama stopped (post-GTT migration). All models use --offload-to-cpu (UMA — no PCIe penalty).
Image generation — tested models:
| Model | Params | GGUF Size | Total RAM¹ | Steps | Quality | Status |
|---|---|---|---|---|---|---|
| FLUX.2-klein-9B Q4_0 | 9B | 5.3 GB | ~11.8 GB | 4 | ★★★★ | ✅ Current default, 104s @512² |
| FLUX.2-klein-4B Q4_0 | 4B | 2.3 GB | ~6 GB | 4 | ★★★ | ✅ Fast alternative, 20s @512² |
| FLUX.1-schnell Q4_K | 12B | 6.5 GB | ~10 GB | 4 | ★★ | ✅ Previous default, 30s @512² |
| Chroma flash Q4_0 | 12B | 5.1 GB | ~8.4 GB | 4–8 | ★★★ | ✅ Tested — 85s @512², better quality |
| FLUX.1-dev Q4_K_S | 12B | 6.8 GB | ~10 GB | 20 | ★★★★ | ✅ Tested — 279s @512², ❌768²+ |
| SD-Turbo | 1.1B | ~2 GB | ~2.5 GB | 1–4 | ★ | ✅ Fast preview, 11s @512² |
| SD3.5-medium Q4_0 | 2.5B | 1.7 GB | ~6 GB | 28 | ★★★ | ✅ Tested — 49s @512², needs clip_g+clip_l+T5+F16 VAE³ |
¹ Total RAM includes diffusion model + text encoder(s) + VAE.
³ BF16 VAE gotcha — see SD3.5 section below.
Video generation — tested models:
| Model | Params | GGUF Size | Total RAM¹ | Frames | Time | Status |
|---|---|---|---|---|---|---|
| WAN 2.1 T2V 1.3B Q4_0 | 1.3B | 826 MB | ~5 GB | 17 @480×320 | ~38 min | ✅ Works on BC-250 |
WAN requires umt5-xxl text encoder (3.5 GB Q4_K_M) + WAN VAE (243 MB). Outputs raw AVI (MJPEG). No matrix cores = slow but works.
Video generation — tested (OOM):
| Model | Params | GGUF Size | Total RAM¹ | Notes |
|---|---|---|---|---|
| WAN 2.2 TI2V 5B Q4_0 | 5B | 2.9 GB | ~9 GB | ❌ OOM crash at Q4_0. Model (2.9G) + VAE (1.4G) + T5 (4.7G) = 9 GB — exceeds UMA budget during video denoising. May work with Q2_K model + Q2_K T5 (~6 GB) but untested. |
Image editing — FLUX.1-Kontext-dev:
| Model | Params | GGUF Size | Total RAM¹ | Status |
|---|---|---|---|---|
| FLUX.1-Kontext-dev Q4_0 | 12B | 6.8 GB | ~10 GB | ✅ Tested — 316s @512² (no swap). 1024² causes swap pressure (40+ min). Uses -r flag, reuses FLUX.1 T5/CLIP/VAE |
Kontext is a dedicated image editing model by Black Forest Labs. It takes a reference image via
-rand a text instruction to produce an edited version. Uses existing FLUX.1 encoders (T5-XXL, CLIP_L) and VAE (ae.safetensors) from/opt/stable-diffusion.cpp/models/flux/.# Edit an existing image with Kontext: sd-cli --diffusion-model models/flux/flux1-kontext-dev-Q4_0.gguf \ --vae models/flux/ae.safetensors --clip_l models/flux/clip_l.safetensors \ --t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf --clip-on-cpu \ -r input.png -p "change the sky to sunset" --cfg-scale 3.5 --steps 28 \ --sampling-method euler --offload-to-cpu --diffusion-fa -o output.png
Kontext demo — "turn Sonic into Shadow the Hedgehog":
| Input (1200×1600 → resized to 512×512) | Output (512×512, 647s) | Output + ESRGAN 4× (2048×2048, +25s) |
|---|---|---|
![]() |
![]() |
![]() |
The 4× upscaled version (right) is generated automatically by the ESRGAN auto-upscale pipeline — every generated/edited image gets a 2048×2048 version sent alongside the 512×512 original. Total overhead: ~25s with tile 192. See ESRGAN benchmarks below.
Timing breakdown (512×512, 28 steps, seed 42):
| Phase | Time | Notes |
|---|---|---|
| CLIP + T5 encoding | 3.5s | clip_l + clip_g + t5-v1_1-xxl Q4_K_M |
| Diffusion sampling | 43s | 28 steps × 1.5s/it (mmdit 2.1 GB on Vulkan) |
| VAE decode | 2.3s | F16-converted VAE (94.6 MB) |
| Total | 49s |
Model stack on disk:
| Component | File | Size |
|---|---|---|
| Diffusion | sd3.5_medium-q4_0.gguf | 1.7 GB |
| CLIP-L | clip_l.safetensors (shared with FLUX) | 246 MB |
| CLIP-G | clip_g.safetensors | 1.3 GB |
| T5-XXL | t5-v1_1-xxl-encoder-Q4_K_M.gguf (shared with FLUX) | 2.9 GB |
| VAE | sd3_vae_f16.safetensors (converted from BF16) | 160 MB |
| Total on disk | ~6.3 GB |
# SD3.5-medium generation command:
sd-cli --diffusion-model models/sd3/sd3.5_medium-q4_0.gguf \
--vae models/sd3/sd3_vae_f16.safetensors \
--clip_l models/flux/clip_l.safetensors \
--clip_g models/sd3/clip_g.safetensors \
--t5xxl models/flux/t5-v1_1-xxl-encoder-Q4_K_M.gguf \
-p "prompt" --cfg-scale 4.5 --sampling-method euler --steps 28 \
-W 512 -H 512 --diffusion-fa --offload-to-cpu -o output.png⚠ BF16 VAE gotcha: The upstream SD3 VAE (
diffusion_pytorch_model.safetensors) uses BF16 tensors. GFX1013 Vulkan has no BF16 support — the output is a solid blue/yellow rectangle. Fix: convert to F16 withpython3 convert_vae_bf16_to_f16.py input.safetensors output.safetensors(script in/tmp/).
Timing breakdown (480×320, 17 frames, 50 steps, seed 42):
| Phase | Time | Notes |
|---|---|---|
| umt5-xxl encoding | ~4s | 3.5 GB Q4_K_M text encoder |
| Diffusion sampling | ~35 min | 17 frames × 50 steps. No matrix cores → pure scalar Vulkan |
| VAE decode | ~30s | WAN VAE (243 MB), decodes all 17 frames |
| Total | ~38 min |
Model stack on disk:
| Component | File | Size |
|---|---|---|
| Diffusion | Wan2.1-T2V-1.3B-Q4_0.gguf | 826 MB |
| Text encoder | umt5-xxl-encoder-Q4_K_M.gguf | 3.5 GB |
| VAE | wan_2.1_vae.safetensors | 243 MB |
| Total on disk | ~4.5 GB |
# WAN 2.1 text-to-video generation:
sd-cli -M vid_gen \
--diffusion-model models/wan/Wan2.1-T2V-1.3B-Q4_0.gguf \
--vae models/wan/wan_2.1_vae.safetensors \
--t5xxl models/wan/umt5-xxl-encoder-Q4_K_M.gguf \
-p "A cat walking across a sunny garden" \
--cfg-scale 6.0 --sampling-method euler \
-W 480 -H 320 --diffusion-fa --offload-to-cpu \
--video-frames 17 --flow-shift 3.0 -o output.mp4Output format: sd.cpp produces raw AVI (MJPEG) regardless of the
-oextension. The 17-frame clip plays at 16 fps (~1 second). Quality is recognizable but noisy — expected at Q4_0 with scalar-only Vulkan compute.Why so slow? Each video frame is a full diffusion pass through the 1.3B model. With 17 frames × 50 steps × no matrix cores, every multiply is scalar. A GPU with tensor/matrix units (RDNA3+, Turing+) would be 5–10× faster.
WAN 2.1 demo — "A cat walking across a sunny garden":
17 frames @480×320, 50 steps, Q4_0 quantization, EUR scheduler, cfg-scale 6.0. Generated in ~38 minutes on GFX1013 scalar Vulkan — no matrix/tensor cores. The BC-250 rendered every frame through pure ALU compute. Noisy but recognizable — a real video from a 1.3B parameter model on a secondhand BC-250.
All generated images are automatically upscaled with RealESRGAN_x4plus (64 MB model, 4× scaling). Runs immediately after generation while Ollama is still stopped — zero extra GPU-swap cost.
ESRGAN tile size benchmark (512² input → 2048² output):
| Tile Size | Time | Output | Notes |
|---|---|---|---|
| 128 (default) | 15s | 2048×2048, 5.1 MB | Fastest, visible seams possible |
| 192 (production) | 25s | 2048×2048, 5.1 MB | Best quality/speed tradeoff |
| 256 | 41s | 2048×2048, 5.1 MB | Smoothest seams, 2.7× slower |
| 128 ×2 passes (16×!) | 4m 50s | 8192×8192, 67 MB | 512²→8192² in under 5 min |
Production uses tile 192: larger tiles mean fewer seam boundaries → cleaner upscale. The 16× mode (two ESRGAN passes) produces 67-megapixel images from 512² input — available on-demand via
EXEC(upscale ...)but not automatic (too large for Signal).
End-to-end timing for all generation modes on BC-250:
Phase breakdown — where the time goes in each pipeline:
FLUX.1-schnell resolution scaling — time vs pixel count:
A research, monitoring, and data collection system with 330 autonomous jobs running on a GPU-constrained single-board computer. Dashboard at http://<LAN_IP>:8888 — 29 main pages + 101 per-host detail pages.
The BC-250 has 16 GB GTT shared with the CPU — only one LLM job can run at a time. queue-runner.py (systemd service) orchestrates all 330 jobs in a continuous loop, with Signal chat between every job:
queue-runner v7 -- Continuous Loop + Signal Chat
Cycle N:
330 jobs sequential, ordered by category:
scrape -> infra -> lore -> academic -> repo -> company -> career
-> think -> csi -> meta -> market -> report
HA observations interleaved every 50 jobs
Signal inbox checked between EVERY job
Chat processed with LLM (EXEC tool use + image gen)
Crash recovery: resumes from last completed job
Cycle N+1:
Immediately starts -- no pause, no idle windows
No nightly/daytime distinction
Key design decisions (v5 → v7):
| v5 (OpenClaw era) | v7 (current) |
|---|---|
| Nightly batch + daytime fill | Continuous loop, no distinction |
| 354 jobs (including duplicates) | 330 jobs (deduped, expanded) |
LLM jobs routed through openclaw cron run |
All jobs run as direct subprocesses |
| Signal via OpenClaw gateway (~700 MB) | signal-cli standalone (~100 MB) |
| Chat only when gateway available | Chat between every job |
| Async SD pipeline (worker scripts, 45s delay) | Synchronous SD (stop Ollama → generate → restart) |
| GPU idle detection for user chat preemption | No preemption needed — chat is interleaved |
All jobs run as direct subprocesses — subprocess.Popen for Python/bash scripts, no LLM agent routing. This is 3–10× faster than the old openclaw cron run path and eliminates the gateway dependency entirely.
The queue prioritizes data diversity — all dashboard tabs get fresh data even if the cycle is interrupted. See §7.3 for the full category breakdown with GPU times. HA observations are interleaved every 50 jobs, and Signal chat is checked between every job.
GPU idle detection is used for legacy --daytime mode and Ollama health checks:
# Three-tier detection:
# 1. Ollama /api/ps → no models loaded → definitely idle
# 2. sysfs pp_dpm_sclk → clock < 1200 MHz → model loaded but not computing
# 3. Ollama expires_at → model about to unload → idle for 3+ minIn continuous loop mode (default), GPU detection is only used for pre-flight health checks — not for yielding to user chat, since chat is interleaved between jobs.
GPU jobs (queue-runner — sequential, one at a time):
| Script | Purpose | Jobs |
|---|---|---|
career-scan.py |
Two-phase career scanner (§8) | 1 |
career-think.py |
Per-company career deep analysis | 65 |
salary-tracker.py |
Salary intel — NoFluffJobs, career-scan extraction | 1 |
company-intel.py |
Deep company intel — GoWork, DDG news, layoffs (43 entities) | 1 |
company-think-* |
Focused company deep-dives | 106 |
patent-watch.py |
IR/RGB camera patent monitor — Google Patents, EPO OPS, DuckDuckGo | 1 |
event-scout.py |
Meetup/conference tracker — Poland, Europe | 1 |
leak-monitor.py |
CTI: 11 OSINT sources — HIBP, Hudson Rock, GitHub dorks, Ahmia dark web, CISA KEV, ransomware, Telegram | 1 |
idle-think.sh |
Research brain — 8 task types → JSON notes | 34 |
ha-journal.py |
Home Assistant analysis (climate, sensors, anomalies) | 2 |
ha-correlate.py |
HA cross-sensor correlation | 2 |
city-watch.py |
SkyscraperCity local construction tracker | 1 |
csi-sensor-watch.py |
CSI camera sensor patent/news monitor | 1 |
csi-think.py |
CSI camera domain analysis (drivers, ISP, GMSL) | 6 |
lore-digest.sh |
Kernel mailing list digests (8 feeds) | 8 |
repo-watch.sh |
Upstream repos (GStreamer, libcamera, v4l-utils, FFmpeg, LinuxTV) | 8 |
repo-think.py |
LLM analysis of repo changes | 26 |
market-think.py |
Market sector analysis + synthesis | 19 |
life-think.py |
Cross-domain life advisor | 2 |
system-think.py |
GPU/security/health system intelligence | 3 |
radio-scan.py |
Radio hobbyist forum tracker | 1 |
career-digest.py |
Weekly career digest → Signal (Sunday) | 1 |
daily-summary.py |
End-of-cycle summary → dashboard + Signal | 2 |
academic-watch.py |
Academic publication monitor (4 topics × 3 types) | 12 |
book-watch.py |
Book/publication tracker (11 subjects) | 11 |
news-watch.py |
Tech news aggregation + RSS | 2 |
weather-watch.py |
Weather forecast + HA sensor correlation | 2 |
car-tracker.py |
GPS car tracker (SinoTrack API) | 1 |
frost-guard.py |
Frost/freeze risk alerter | 1 |
CPU jobs (system crontab — independent of queue-runner):
| Script | Frequency | Purpose |
|---|---|---|
gpu-monitor.sh + .py |
1 min | GPU utilization sampling (3-state) |
presence.sh |
5 min | Phone presence tracker |
syslog.sh |
5 min | System health logger |
watchdog.py |
30 min (live), 06:00 (full) | Network security — ARP, DNS, TLS, vulnerability scoring |
scan.sh + enumerate.sh |
04:00 | Network scan + enumeration (nmap) |
vulnscan.sh |
Weekly (Sun) | Vulnerability scan |
repo-watch.sh |
08:00, 14:00, 18:00 | Upstream repo data collection |
report.sh |
08:30 | Morning report rebuild |
generate-html.py |
After each queue-runner job | Dashboard HTML builder (6900+ lines) |
gpu-monitor.py chart |
22:55 | Daily GPU utilization chart |
Job categories (auto-classified by name pattern):
| Category | Jobs | Typical GPU time | Examples |
|---|---|---|---|
scrape |
29 | 0.1h | career-scan, salary, patents, book-watch, repo-scan (no LLM) |
infra |
6 | 0.6h | leak-monitor, netscan, watchdog, frost-guard, radio-scan |
lore |
8 | 0.5h | lore-digest per mailing list feed |
academic |
12 | — | academic-watch per topic × type |
repo |
27 | 0.3h | LLM analysis of repo changes + weekly digest |
company |
107 | 0.9h | company-intel + competitive/financial/strategy deep-dives |
career |
66 | 1.9h | career-think per company + weekly digest |
think |
34 | 2.0h | research, trends, crawl, crossfeed |
csi |
6 | 0.3h | CSI camera domain analysis |
meta |
5 | — | life-think, system-think |
market |
19 | 0.9h | market-think per asset + synthesis |
ha |
4 | 1.0h | ha-correlate, ha-journal (interleaved) |
report |
4 | — | daily-summary, news + weather analysis |
weekly |
3 | — | vulnscan, csi-sensor-discover/improve |
| Total | 330 | ~9h |
Data flow:
jobs.json (330 jobs)
|
v
queue-runner.py
|
|-- All jobs -> subprocess.Popen -> python3/bash /opt/netscan/...
| |
| JSON results <--------------------+
| |
| |-- /opt/netscan/data/{category}/*.json
| |
| +-- generate-html.py -> /opt/netscan/web/*.html -> nginx :8888
|
|-- Signal chat (between every job)
| via JSON-RPC http://127.0.0.1:8080/api/v1/rpc
|
+-- Signal alerts (career matches, leaks, events, daily summary)
All paths relative to /opt/netscan/:
| Data | Path | Source |
|---|---|---|
| Research notes | data/think/note-*.json + notes-index.json |
idle-think.sh |
| Career scans | data/career/scan-*.json + latest-scan.json |
career-scan.py |
| Career analysis | data/career/think-*.json |
career-think.py |
| Salary | data/salary/salary-*.json (180-day history) |
salary-tracker.py |
| Company intel | data/intel/intel-*.json + company-intel-deep.json |
company-intel.py |
| Patents | data/patents/patents-*.json + patent-db.json |
patent-watch.py |
| Events | data/events/events-*.json + event-db.json |
event-scout.py |
| Leaks / CTI | data/leaks/leak-intel.json |
leak-monitor.py |
| City watch | data/city/city-watch-*.json |
city-watch.py |
| CSI sensors | data/csi-sensors/csi-sensor-*.json |
csi-sensor-watch.py |
| HA correlations | data/correlate/correlate-*.json |
ha-correlate.py |
| HA journal | data/ha-journal-*.json |
ha-journal.py |
| Mailing lists | data/{lkml,soc,jetson,libcamera,dri,usb,riscv,dt}/ |
lore-digest.sh |
| Repos | data/repos/ |
repo-watch.sh, repo-think.py |
| Market | data/market/ |
market-think.py |
| Academic | data/academic/ |
academic-watch (LLM) |
| GPU load | data/gpu-load.tsv |
gpu-monitor.sh |
| System health | data/syslog/health-*.tsv (30-day retention) |
syslog.sh |
| Network hosts | data/hosts-db.json |
scan.sh |
| Presence | data/presence-state.json |
presence.sh |
| Radio | data/radio/ |
radio-scan.py |
| Queue state | data/queue-runner-state.json |
queue-runner.py |
Served by nginx at :8888, generated by generate-html.py (6900+ lines):
| Page | Content | Data source |
|---|---|---|
index.html |
Overview — hosts, presence, latest notes, status | aggregated |
home.html |
Home Assistant — climate, energy, anomalies | ha-journal, ha-correlate |
career.html |
Career intelligence — matches, trends | career-scan, career-think |
market.html |
Market analysis — sectors, commodities, crypto | market-think |
advisor.html |
Life advisor — cross-domain synthesis | life-think |
notes.html |
Research brain — all think notes | idle-think |
leaks.html |
CTI / leak monitor | leak-monitor |
issues.html |
Upstream issue tracking | repo-think |
events.html |
Events calendar — Poland, Europe | event-scout |
lkml.html |
Linux Media mailing list digest | lore-digest (linux-media) |
soc.html |
SoC bringup mailing list | lore-digest (soc-bringup) |
jetson.html |
Jetson/Tegra mailing list | lore-digest (jetson-tegra) |
libcamera.html |
libcamera mailing list | lore-digest (libcamera) |
dri.html |
DRI-devel mailing list | lore-digest (dri-devel) |
usb.html |
Linux USB mailing list | lore-digest (linux-usb) |
riscv.html |
Linux RISC-V mailing list | lore-digest (linux-riscv) |
dt.html |
Devicetree mailing list | lore-digest (devicetree) |
academic.html |
Academic publications | academic-watch |
hosts.html |
Network device inventory | scan.sh |
security.html |
Host security scoring | vulnscan.sh |
presence.html |
Phone detection timeline | presence.sh |
load.html |
GPU utilization heatmap + schedule | gpu-monitor |
radio.html |
Radio hobbyist activity | radio-scan.py |
car.html |
Car tracker | car-tracker |
weather.html |
Weather forecast + HA sensor correlation | weather-watch.py |
news.html |
Tech news aggregation + RSS | news-watch.py |
health.html |
System health assessment (services, data freshness, LLM quality) | bc250-extended-health.py |
history.html |
Changelog | — |
log.html |
Raw scan logs | — |
host/*.html |
Per-host detail pages (101 hosts) | scan.sh, enumerate.sh |
Mailing list feeds are configured in
digest-feeds.json— 8 feeds fromlore.kernel.org, each with relevance scoring keywords.
Per-minute sampling via pp_dpm_sclk:
| State | Clock | Temp | Meaning |
|---|---|---|---|
generating |
2000 MHz | ~77°C | Active LLM inference |
loaded |
1000 MHz | ~56°C | Model in VRAM, idle |
idle |
1000 MHz | <50°C | No model loaded |
| File | Purpose |
|---|---|
profile.json |
Public interests — tracked repos, keywords, technologies |
profile-private.json |
Career context — target companies, salary expectations (gitignored) |
watchlist.json |
Auto-evolving interest tracker |
digest-feeds.json |
Mailing list feed URLs (8 feeds from lore.kernel.org) |
repo-feeds.json |
Repository API endpoints |
sensor-watchlist.json |
CSI camera sensor tracking list |
queue-runner-state.json |
Cycle count, resume index (in data/) |
/opt/netscan/data/jobs.json |
All 330 job definitions |
| Mechanism | Details |
|---|---|
| Systemd watchdog | WatchdogSec=14400 (4h) — queue-runner pings every 30s during job execution |
| Crash recovery | State file records nightly batch progress; on restart, resumes from last completed job |
| Midnight crossing | Resume index valid for both today and yesterday's date (batch starts 23:00 day N, may crash after midnight day N+1) |
| Atomic state writes | Write to .tmp file, fsync(), then rename() — survives SIGABRT/power loss |
| Ollama health checks | Pre-flight check before each job; exponential backoff wait if unhealthy |
| Network down | Detects network loss, waits with backoff up to 10min |
| GPU deadlock protection | If GPU busy for > 60min continuously, breaks and moves on |
| OOM protection | Ollama OOMScoreAdjust=-1000, 16 GB NVMe swap, zram limited to 2 GB |
| Signal delivery | --best-effort-deliver flag — delivery failures don't mark job as failed |
Automated career opportunity scanner with a two-phase anti-hallucination architecture.
HTML page
+-> Phase 1: extract jobs (NO candidate profile) -> raw job list
|
Candidate Profile + single job ---------------------------+
+-> Phase 2: score match -> repeat per job
+-> aggregate -> JSON + Signal alerts
Phase 1 extracts jobs from raw HTML without seeing the candidate profile — prevents the LLM from inventing matching jobs. Phase 2 scores each job individually against the profile.
| Category | Score | Alert? |
|---|---|---|
| ⚡ Hot match | ≥70% | ✅ (up to 5/scan) |
| 🌍 Worth checking | 55–69% + remote | ✅ (up to 2/scan) |
| Good / Weak | <55% | Dashboard only |
Software houses (SII, GlobalLogic, Sysgo…) appear on the dashboard but never trigger alerts.
Nightly at 01:30. Sources: career-scan extraction, NoFluffJobs API, JustJoinIT, Bulldogjob. Tracks embedded Linux / camera driver compensation in Poland. 180-day rolling history.
Nightly at 01:50. Deep-dives into 43 tracked companies across 8 sources: GoWork.pl reviews, DuckDuckGo news, Layoffs.fyi, company pages, 4programmers.net, Reddit, SemiWiki, Hacker News. LLM-scored sentiment (-5 to +5) with cross-company synthesis.
GoWork.pl: New Next.js SPA breaks scrapers. Scanner uses the old
/opinie_czytaj,{entity_id}URLs (still server-rendered).
Nightly at 02:10. Monitors 6 search queries (MIPI CSI, IR/RGB dual camera, ISP pipeline, automotive ADAS, sensor fusion, V4L2/libcamera) across Google Patents, EPO OPS, and DuckDuckGo. Scored by relevance keywords × watched assignee bonus.
Nightly at 02:30. Discovers tech events with geographic scoring (local 10, nearby 8, Poland 5, Europe 3, Online 9). Sources: Crossweb.pl, Konfeo, Meetup, Eventbrite, DDG, 14 known conference sites.
▸ Full tree
bc250/
├── README.md ← you are here
├── netscan/ → /opt/netscan/
│ ├── queue-runner.py # v7 — continuous loop + Signal chat (330 jobs)
│ ├── career-scan.py # Two-phase career scanner
│ ├── career-think.py # Per-company career analysis
│ ├── salary-tracker.py # Salary intelligence
│ ├── company-intel.py # Company deep-dive
│ ├── company-think.py # Per-entity company analysis
│ ├── patent-watch.py # Patent monitor
│ ├── event-scout.py # Event tracker
│ ├── city-watch.py # SkyscraperCity local construction monitor
│ ├── leak-monitor.py # CTI: 11 OSINT sources + Ahmia dark web
│ ├── ha-journal.py # Home Assistant journal
│ ├── ha-correlate.py # HA cross-sensor correlation
│ ├── ha-observe.py # Quick HA queries
│ ├── csi-sensor-watch.py # CSI camera sensor patent/news
│ ├── csi-think.py # CSI camera domain analysis
│ ├── radio-scan.py # Radio hobbyist forum tracker
│ ├── market-think.py # Market sector analysis
│ ├── life-think.py # Cross-domain life advisor
│ ├── system-think.py # GPU/security/health system intelligence
│ ├── career-digest.py # Weekly career digest → Signal (Sunday)
│ ├── daily-summary.py # End-of-cycle Signal summary
│ ├── frost-guard.py # Frost/freeze risk alerter
│ ├── repo-think.py # LLM analysis of repo changes
│ ├── academic-watch.py # Academic publication monitor
│ ├── news-watch.py # Tech news aggregation + RSS feeds
│ ├── book-watch.py # Book/publication tracker
│ ├── weather-watch.py # Weather forecast + HA sensor correlation
│ ├── car-tracker.py # GPS car tracker (SinoTrack API, trip/stop detection)
│ ├── bc250-extended-health.py # System health assessment (services, data freshness, LLM quality)
│ ├── llm_sanitize.py # LLM output sanitizer (thinking tags, JSON repair)
│ ├── generate-html.py # Dashboard builder (6900+ lines, 29 main + 101 host pages)
│ ├── gpu-monitor.py # GPU data collector
│ ├── idle-think.sh # Research brain (8 task types)
│ ├── repo-watch.sh # Upstream repo monitor
│ ├── lore-digest.sh # Mailing list digests (8 feeds)
│ ├── bc250-health-check.sh # Quick health check (systemd timer, triggers extended health)
│ ├── gpu-monitor.sh # Per-minute GPU sampler
│ ├── scan.sh / enumerate.sh # Network scanning
│ ├── vulnscan.sh # Weekly vulnerability scan
│ ├── presence.sh # Phone presence detection
│ ├── syslog.sh # System health logger
│ ├── watchdog.py # Network security checker
│ ├── report.sh # Morning report rebuild
│ ├── profile.json # Public interests + Signal config
│ ├── profile-private.json # Career context (gitignored)
│ ├── watchlist.json # Auto-evolving interest tracker
│ ├── digest-feeds.json # Feed URLs (8 mailing lists)
│ ├── repo-feeds.json # Repository endpoints
│ └── sensor-watchlist.json # CSI sensor tracking list
├── systemd/
│ ├── queue-runner.service # v7 — continuous loop + Signal chat
│ ├── queue-runner-nightly.service # Nightly batch trigger
│ ├── queue-runner-nightly.timer
│ ├── signal-cli.service # Standalone JSON-RPC daemon
│ ├── bc250-health.service # Health check timer
│ ├── bc250-health.timer
│ ├── ollama.service
│ ├── ollama-watchdog.service # Ollama restart watchdog
│ ├── ollama-watchdog.timer
│ ├── ollama-proxy.service # LAN proxy for Ollama API
│ └── ollama.service.d/
│ └── override.conf # Vulkan + memory settings
├── scripts/
│ └── ollama-proxy.py # Reverse proxy (injects think:false for qwen3)
├── generate-and-send.sh → /opt/stable-diffusion.cpp/ (legacy EXEC pattern, intercepted by queue-runner)
└── generate-and-send-worker.sh → legacy async worker (unused in v7, kept for EXEC pattern match)
| Local | → bc250 |
|---|---|
netscan/* |
/opt/netscan/ |
systemd/queue-runner.service |
/etc/systemd/system/queue-runner.service |
systemd/signal-cli.service |
/etc/systemd/system/signal-cli.service |
systemd/ollama.* |
/etc/systemd/system/ollama.* |
generate-and-send*.sh |
/opt/stable-diffusion.cpp/ |
# Typical deploy workflow
scp netscan/queue-runner.py bc250:/tmp/
ssh bc250 'sudo cp /tmp/queue-runner.py /opt/netscan/ && sudo systemctl restart queue-runner'▸ ROCm crashes in Ollama logs
Expected — Ollama tries ROCm, it crashes on GFX1013, falls back to Vulkan. No action needed.
▸ Only 7.9 GiB GPU memory instead of 14 GiB
GTT tuning not applied. Check: cat /sys/module/ttm/parameters/pages_limit (should be 4194304). See §3.3.
▸ 14B model loads but inference returns HTTP 500
TTM pages_limit bottleneck. Fix: echo 4194304 | sudo tee /sys/module/ttm/parameters/pages_limit (see §3.3).
▸ Model loads on CPU instead of GPU
Check OLLAMA_VULKAN=1: sudo systemctl show ollama | grep Environment
▸ Context window OOM kills (the biggest gotcha on 16 GB)
Ollama allocates KV cache based on num_ctx. Many models default to 32K–40K context, which on a 14B Q4_K model means 14–16 GB just for the model — leaving nothing for the OS.
Symptoms: Gateway gets OOM-killed, Ollama journal shows 500 errors, dmesg shows oom-kill.
Root cause: The abliterated Qwen3 14B declares num_ctx 40960 → 16 GB total model memory.
Fix: Create a custom model with context baked in:
cat > /tmp/Modelfile.16k << 'EOF'
FROM huihui_ai/qwen3-abliterated:14b
PARAMETER num_ctx 16384
EOF
ollama create qwen3-14b-16k -f /tmp/Modelfile.16kThis drops memory from ~16 GB → ~11.1 GB. Do not rely on OLLAMA_CONTEXT_LENGTH — it doesn't reliably override API requests from the gateway.
▸ signal-cli not responding on port 8080
Check the service: systemctl status signal-cli. If it crashed, restart: sudo systemctl restart signal-cli. Verify JSON-RPC:
curl -s http://127.0.0.1:8080/api/v1/rpc \
-d '{"jsonrpc":"2.0","method":"listAccounts","id":"1"}'▸ zram competing with model for physical RAM
Fedora defaults to ~8 GB zram. zram compresses pages but stores them in physical RAM — directly competing with the model. On 16 GB systems running 14B models, disable or limit zram and use NVMe file swap instead:
sudo mkdir -p /etc/systemd/zram-generator.conf.d
echo -e '[zram0]\nzram-size = 2048' | sudo tee /etc/systemd/zram-generator.conf.d/small.conf▸ Python cron scripts produce no output
Stdout is fully buffered under cron (no TTY). Add at script start:
sys.stdout.reconfigure(line_buffering=True)
sys.stderr.reconfigure(line_buffering=True)▸ Signal delivery from signal-cli
Signal JSON-RPC API at http://127.0.0.1:8080/api/v1/rpc:
curl -X POST http://127.0.0.1:8080/api/v1/rpc \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"send","params":{
"account":"+<BOT>","recipient":["+<YOU>"],
"message":"test"
},"id":"1"}'| Issue | Impact |
|---|---|
| Shared VRAM | Image gen requires stopping Ollama. Bot offline ~2–3 min (FLUX.2-klein-9B) or ~1 min (FLUX.2-klein-4B). |
| MoE context limit | 35B-A3B MoE tops out at 16K context (weights = 10.3 GiB, KV fills rest). Use 9B for >16K. |
| Signal latency | Messages queue during job execution (typical job 2–15 min). Chat checked between every job. |
| sd-cli hangs on GFX1013 | Vulkan cleanup bug → poll + kill workaround. |
| Cold start latency | 30–60s after Ollama restart (model loading). |
| Chinese thinking leak | Qwen3 occasionally outputs Chinese reasoning. Cosmetic. |
| Prefill rate degrades with context | 128 tok/s at 1.3K → 70 tok/s at 10K tokens (UMA bandwidth + attention scaling). |
| Gen speed degrades with context fill | 27 tok/s empty → 13 tok/s at 30K tokens. Partial model offload at KV limit causes cliff drop. |
| Ollama caps KV auto-size at ~40K (Q4_0) | num_ctx > 40960 accepted but silently truncated. Actual limit = VRAM ÷ per-token KV size. |
| Speculative decoding blocked | Ollama 0.18 has no --draft-model. Dual-model loading evicts the draft model. |
| TTS not feasible | CPU-based TTS (Piper, Coqui) competes with GPU for the same 16 GB UMA pool. No Vulkan TTS exists. |
Pinned versions as of March 2026. All components built/installed on Fedora 43.
| Component | Version | Notes |
|---|---|---|
| OS | Fedora 43, kernel 6.18.9 | Headless, performance governor |
| Ollama | 0.18.0 | Vulkan backend, OLLAMA_FLASH_ATTENTION=1 |
| Mesa / RADV | 25.3.4 | Vulkan 1.4.328, RADV GFX1013 |
| stable-diffusion.cpp | master-525 (d6dd6d7) |
Built with -DSD_VULKAN=ON |
| whisper.cpp | v1.8.3-198 (30c5194c) |
Built with Vulkan, large-v3-turbo model |
| signal-cli | 0.13.24 | Native binary, JSON-RPC at :8080 |
| Qwen3.5-35B-A3B | IQ2_M (GGUF, 10.6 GB) | Primary MoE model, via unsloth |
| Qwen3.5:9b | Q4_K_M (GGUF, 6.1 GB) | Vision + long context model |
| FLUX.2-klein-9B | Q4_0 (GGUF, 5.3 GB) | Image generation, via leejet |
| ggml-large-v3-turbo | 1.6 GB | Whisper model for audio transcription |
| ESRGAN | RealESRGAN_x4plus (64 MB) | 4× image upscaling |
| Python | 3.13 | queue-runner, netscan scripts |
| Resource | URL |
|---|---|
| AMD BC-250 community docs (BIOS, setup) | https://elektricm.github.io/amd-bc250-docs/ |
| LLVM AMDGPU processor table (GFX1013) | https://llvm.org/docs/AMDGPUUsage.html#processors |
| Mesa RADV Vulkan driver | https://docs.mesa3d.org/drivers/radv.html |
| Linux TTM memory manager | https://www.kernel.org/doc/html/latest/gpu/drm-mm.html |
| Resource | URL |
|---|---|
| Ollama — local LLM runtime | https://github.com/ollama/ollama |
| Qwen3.5 model family (Alibaba) | https://huggingface.co/Qwen |
| Qwen3.5-35B-A3B GGUF (unsloth) | https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF |
| Qwen3.5-9B (Ollama) | https://ollama.com/library/qwen3.5:9b |
| GGUF quantization format | https://github.com/ggerganov/llama.cpp/blob/master/docs/gguf.md |
| Resource | URL |
|---|---|
| stable-diffusion.cpp (Vulkan) | https://github.com/leejet/stable-diffusion.cpp |
| FLUX.2-klein-9B GGUF | https://huggingface.co/leejet/FLUX.2-klein-9B-GGUF |
| FLUX.2-klein-4B GGUF | https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF |
| FLUX.1-Kontext-dev (image editing) | https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev |
| Chroma (flash distilled) | https://huggingface.co/leejet/Chroma-GGUF |
| WAN 2.1 T2V (video generation) | https://huggingface.co/Wan-AI |
| Real-ESRGAN (image upscaling) | https://github.com/xinntao/Real-ESRGAN |
| Resource | URL |
|---|---|
| whisper.cpp (Vulkan STT) | https://github.com/ggerganov/whisper.cpp |
| Whisper large-v3-turbo model | https://huggingface.co/ggerganov/whisper-large-v3-turbo |
| Resource | URL |
|---|---|
| signal-cli (Signal messenger CLI) | https://github.com/AsamK/signal-cli |
| Signal Protocol | https://signal.org/docs/ |
▸ Historical: OpenClaw gateway configuration (replaced in v7)
OpenClaw v2026.2.26 was used as the Signal ↔ Ollama gateway from project inception through queue-runner v6. It was a Node.js daemon that managed signal-cli as a child process, routed messages to the LLM, and provided an agent framework with tool dispatch.
Why it was replaced:
- ~700 MB RSS on a 16 GB system (4.4% of total RAM)
- 15+ second overhead per agent turn (system prompt injection, tool resolution)
- Unreliable fallback chains caused "fetch failed" timeout cascades
- Could not run scripts as direct subprocesses — everything went through the LLM agent
- signal-cli children survived gateway OOM kills, holding port 8080 as orphans
- 9.6K system prompt that couldn't be reduced below ~4K without breaking tools
What replaced it: See §5 for the current architecture.
sudo dnf install -y nodejs npm
sudo npm install -g openclaw@latest
openclaw onboard \
--non-interactive --accept-risk --auth-choice skip \
--install-daemon --skip-channels --skip-skills --skip-ui --skip-health \
--daemon-runtime node --gateway-bind loopback~/.openclaw/openclaw.json:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434",
"apiKey": "ollama-local",
"api": "ollama",
"models": [{
"id": "qwen3-14b-16k",
"name": "Qwen 3 14B (16K ctx)",
"contextWindow": 16384,
"maxTokens": 8192,
"reasoning": true
}]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3-14b-16k",
"fallbacks": ["ollama/qwen3-14b-abl-nothink:latest", "ollama/mistral-nemo:12b"]
},
"thinkingDefault": "high",
"timeoutSeconds": 1800
}
}
}{
"tools": {
"profile": "coding",
"alsoAllow": ["message", "group:messaging"],
"deny": ["browser", "canvas", "nodes", "cron", "gateway"]
},
"skills": { "allowBundled": [] }
}Personality lived in workspace markdown files (~/.openclaw/workspace/):
| File | Purpose | Size |
|---|---|---|
SOUL.md |
Core personality | 1.0 KB |
IDENTITY.md |
Name/emoji | 550 B |
USER.md |
Human info | 1.7 KB |
TOOLS.md |
Tool commands | 2.1 KB |
AGENTS.md |
Grounding rules | 1.4 KB |
WORKFLOW_AUTO.md |
Cron bypass rules | 730 B |
{
"channels": {
"signal": {
"enabled": true,
"account": "+<BOT_PHONE>",
"cliPath": "/usr/local/bin/signal-cli",
"dmPolicy": "pairing",
"allowFrom": ["+<YOUR_PHONE>"],
"sendReadReceipts": true,
"textChunkLimit": 4000
}
}
}systemctl --user status openclaw-gateway # status
openclaw logs --follow # live logs
openclaw doctor # diagnostics
openclaw channels status --probe # signal healthThe gateway service (openclaw-gateway.service) ran as a user-level systemd unit. It has been disabled and masked:
systemctl --user disable --now openclaw-gateway
systemctl --user mask openclaw-gatewayArtur Andrzejczak · [email protected] · March 2026
Development assisted by Claude Opus 4.6.
Code: AGPL-3.0 · Docs: CC BY-SA 4.0






















