65-87 t/s local LLM inference on a $3,299 mini PC. Within 5% of the $4,699 DGX Spark. No cloud, no subscription.
If this guide saves you time, consider giving it a star -- it helps others find it.
You are here What you'll get
+-----------+ +---------------------------+
| Strix | 30 min | 87 t/s on 30B MoE models |
| Halo | ==============> | 65 t/s on 35B models |
| mini PC | this guide | 70B+ models on one device |
+-----------+ | No cloud. No subscription |
+---------------------------+
One-Command Setup | Quick Start | Benchmarks | Which Model? | Which Backend? | What NOT To Do | Glossary
A complete guide for running local LLMs on AMD Ryzen AI MAX+ 395 (Strix Halo) with llama.cpp, Ollama, Vulkan, and ROCm. Several Strix Halo guides exist. This one is different:
- Every number is measured on this machine. No theoretical estimates, no copy-pasted specs. Every benchmark was run on a Beelink GTR9 Pro with timestamps.
- We document what does NOT work. Most guides only tell you what to enable. We tested optimizations that turned out to be regressions, driver versions that crash, and parameters that do nothing. That info is harder to find and more valuable.
- We track the moving target. Strix Halo support changes rapidly. This guide is updated with each change, noting what broke and what improved.
- We compare backends with data. Vulkan (RADV vs AMDVLK) vs ROCm HIP vs vLLM -- each has strengths. We measured them all.
- We explain everything. New to local LLMs? See the Glossary. Not sure which model to pick? See the Model Guide.
Built on findings from: kyuz0/amd-strix-halo-toolboxes (1.2k stars, community standard), lhl/strix-halo-testing (deepest research), and our own extensive testing.
If you've already set your BIOS (UMA = 512MB, IOMMU = off) and installed Ubuntu 24.04:
curl -fsSL https://raw.githubusercontent.com/hogeheer499-commits/strix-halo-guide/main/setup.sh | bashThis installs everything, configures Ollama with Vulkan, pulls a model, and runs a benchmark. Takes ~10 minutes (plus model download time). For manual step-by-step setup, see Quick Start.
- Hardware
- What You Can Run
- Benchmark Results
- Backend Decision Guide
- Quick Start (6 Steps)
- Phase 1: BIOS Configuration
- Phase 2: Ubuntu 24.04 Installation
- Phase 3: Kernel Configuration
- Phase 4: Performance Tuning
- Phase 5: Ollama Setup (Vulkan)
- Phase 6: Benchmarking
- Phase 7: ROCm with llama.cpp (Containers)
- Phase 8: vLLM Serving
- Phase 9: Multi-Node Clustering (RDMA)
- Phase 10: SSH and Remote Access
- Vulkan Driver Comparison
- Key Findings and Corrections
- Known Issues
- Troubleshooting
- Kernel and ROCm Compatibility
- Testing Checklist
- Model Recommendation Guide
- Cost: Local vs Cloud
- Buying Guide
- Glossary
- FAQ
- Community Resources
- Credits and References
- Contributing
- Changelog
- License
| System | CPU | GPU | RAM | Notes |
|---|---|---|---|---|
| Beelink GTR9 Pro | Ryzen AI MAX+ 395 | Radeon 8060S (40 CU) | 128GB LPDDR5X-8000 | This guide's primary test system |
| Framework Desktop 13 | Ryzen AI MAX+ 395 | Radeon 8060S (40 CU) | 128GB LPDDR5X-8000 | Used by kyuz0, lhl |
| GMKtec EVO-X2 | Ryzen AI MAX+ 395 | Radeon 8060S (40 CU) | 128GB LPDDR5X-8000 | pablo-ross guide |
| HP ZBook Ultra G1a | Ryzen AI MAX+ 395 | Radeon 8060S (40 CU) | 128GB LPDDR5X-8000 | Workstation laptop |
| Component | Spec |
|---|---|
| CPU | AMD Ryzen AI MAX+ 395 (32 cores / 64 threads, Zen 5) |
| GPU | Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs) |
| RAM | 128GB unified LPDDR5X-8000 (~215 GB/s measured, 256 GB/s theoretical) |
| NPU | RyzenAI-npu5 (XDNA 2) |
Why this hardware? 128GB unified memory shared between CPU and GPU means you can run 70B+ models entirely on the GPU -- something an RTX 4090 (24GB VRAM) cannot do. You trade raw bandwidth (~215 GB/s vs ~1 TB/s) for the ability to run much larger, smarter models at a lower price ($3,299 vs $4,699 for the DGX Spark).
Real-world generation speeds measured on the Beelink GTR9 Pro (Vulkan RADV). Speeds marked with * are via llama-bench direct; others are via Ollama.
| Model | Size | Type | Generation Speed | Use Case |
|---|---|---|---|---|
| Qwen3-0.6B (Q8_0) | 0.8 GB | Dense | 266 t/s * | Ultra-fast tiny model |
| Llama 2 7B | 3.8 GB | Dense | 48-52 t/s | Testing, lightweight tasks |
| Qwen2.5-VL 7B | 6.0 GB | Vision | 21.4 t/s | Image understanding |
| Gemma 4 26B-A4B (UD-Q4_K_M) | 15.7 GB | MoE | 48.5 t/s * | Google's latest MoE, strong reasoning |
| Qwen3-Coder 30B-A3B (UD-Q4_K_XL) | 16.5 GB | MoE | 87 t/s * | Best speed/quality ratio |
| Qwen3.6 35B-A3B (Q4_K_M) | 20 GB | MoE | 64 t/s * | Best all-rounder, drop-in upgrade from 3.5 |
| Qwen3.5 35B-A3B | 23 GB | MoE | 48-65 t/s | General purpose, coding (65 with latest llama.cpp) |
| Qwen3-Coder 30B-A3B (Q8_0) | 32 GB | MoE | 51 t/s | Coding (highest quality MoE) |
| Qwen3-Coder-Next | 51 GB | Dense | 38-39 t/s | Large dense model |
| Llama 3.1 70B (Q4_K_M) | 42 GB | Dense | 4.7-4.9 t/s | 70B intelligence, doesn't fit on RTX 4090 |
| Llama 4 Scout 109B (Q4_K_M) | 61 GB | MoE | 18.3 t/s * | 109B params on a mini PC -- RTX 4090 can't even load this |
| gpt-oss-120b | ~70 GB | MoE | ~34-38 t/s | Largest practical model |
| Qwen3-Next 80B-A3B (UD-Q4_K_XL) | 42.9 GB | MoE | 55 t/s * | 80B model, 256K context -- faster than dense 51B |
| Kimi K2.5 1T (4-node cluster) | ~500 GB | MoE | distributed | AMD technical article |
All benchmarks run on 2026-03-20 and 2026-03-21. System: Beelink GTR9 Pro, kernel 6.19.4, tuned accelerator-performance active.
Qwen3.6-35B-A3B (Q4_K_M, ~20GB, MoE -- Ollama 0.21.2):
| Prompt Tokens | Prompt Eval | Generation | Notes |
|---|---|---|---|
| 20 | 163 t/s | 45.6 t/s | ~30% slower than llama-bench direct (64 t/s) |
| 22 | 174 t/s | 45.4 t/s | Ollama overhead is real but acceptable |
Qwen3.5-35B-A3B (Q4_K_M, ~23GB, MoE -- Ollama 0.20.4):
| Prompt Tokens | Prompt Eval | Generation | vs Previous (Mesa 26.0.1) |
|---|---|---|---|
| 14 | 121.3 t/s | 48.0 t/s | tg +4.8% |
| 23 | 182.3 t/s | 47.5 t/s | tg +4.4% |
| 122 | 456.7 t/s | 47.4 t/s | tg +4.2% |
Qwen3-Coder 30B-A3B (Q8_0, ~32GB, MoE):
| Prompt Tokens | Prompt Eval | Generation | Notes |
|---|---|---|---|
| 12 | 118.3 t/s | 51.4 t/s | Fastest via Ollama |
| 21 | 205.2 t/s | 51.3 t/s | Higher quality than Q4_K_M |
Qwen3-Coder-Next (~51GB, dense):
| Prompt Tokens | Prompt Eval | Generation | vs Previous |
|---|---|---|---|
| 12 | 90.7 t/s | 39.1 t/s | tg +2.9% |
| 21 | 129.5 t/s | 38.4 t/s | tg +3.8% |
| 120 | 301.2 t/s | 37.9 t/s | NEW |
Other Models:
| Model | Size | Prompt Tokens | pp (t/s) | tg (t/s) |
|---|---|---|---|---|
| Llama 2 7B | 3.8 GB | 24 | 384.6 | 52.0 |
| Qwen2.5-VL 7B | 6.0 GB | 23 | 81.7 | 21.4 |
| Qwen3.5 35B (no-think) | 23 GB | 14 | 127.1 | 47.4 |
Llama 3.1 70B (Q4_K_M, 42GB, Dense -- the "doesn't fit on RTX 4090" showcase):
| Prompt Tokens | Prompt Eval | Generation | Notes |
|---|---|---|---|
| 14 | 22.1 t/s | 4.9 t/s | Cold start |
| 23 | 36.8 t/s | 4.8 t/s | Realistic chat |
| 122 | 79.6 t/s | 4.7 t/s | Long prompt |
Why so slow? This is a 42GB dense model -- every token reads all 42GB of weights. At ~215 GB/s bandwidth, the theoretical maximum is 215/42 = 5.1 t/s. We hit 4.8 t/s = 94% of the theoretical ceiling. The model is slow not because of poor optimization, but because it's massive. An RTX 4090 (24GB VRAM) cannot run this model at all. This is the Strix Halo advantage: running models that don't fit on consumer GPUs.
What improved? Mesa 26.0.1 to 26.0.2 plus enabling the
tuned accelerator-performanceprofile gave a consistent +4-5% generation speed improvement across all models.
UPDATE (2026-03-21): Updating llama.cpp from b8298 to b8460 gave +25% on both pp and tg for MoE models. The new build includes a Vulkan Flash Attention refactor (PR #19625), graphics queue optimization for AMD (PR #20551), and GDN shader support for Qwen3.5 (PR #20334).
Important caveats:
- The +25% improvement is specific to MoE models on Vulkan due to the Wave32 FA refactor and graphics queue change. Dense models (Llama 2 7B, Llama 3.1 70B) showed minimal change (<2%) because they were already at the memory bandwidth ceiling.
- If you use kyuz0's containers, you get these updates automatically -- the containers rebuild on every llama.cpp master update. kyuz0's toolboxes remain the easiest way to stay current. Our finding here validates the importance of their approach.
- WARNING: AMDVLK silently overrides RADV. If AMDVLK is installed, its
/etc/vulkan/icd.d/amd_icd64.jsontakes priority over RADV. This halves your pp speed (1080 → 660 pp512) without any visible error. Always setAMD_VULKAN_ICD=RADVor uninstall AMDVLK entirely:sudo dpkg -r amdvlk && sudo rm -f /etc/vulkan/icd.d/amd_icd64.json. Check your driver: RADV shows(RADV STRIX_HALO) (radv)withshared memory: 65536in llama-bench output. AMDVLK shows(AMD open-source driver)withshared memory: 32768. We originally reported this as a llama.cpp regression -- it wasn't.
Qwen3.5-35B-A3B (Q4_K_M, 19.9GB, MoE) -- the biggest improvement:
| Build | Driver | pp128 | pp512 | tg128 | vs old RADV |
|---|---|---|---|---|---|
| b8460 (latest) | RADV | 623 | 1080 | 64.85 | pp +24%, tg +25% |
| b8460 (latest) | AMDVLK | 521 | 663 | 64.10 | pp -24%, tg +23% |
| b8298 (kyuz0) | RADV | 583 | 868 | 52.06 | baseline |
| b8298 (kyuz0) | AMDVLK | 479 | 576 | 56.08 |
RADV now wins on EVERYTHING. The old AMDVLK tg advantage (+7.7%) is gone. With the latest build, RADV is faster on both pp (+63% over AMDVLK) and tg (+1.2% over AMDVLK). Use RADV. AMDVLK is discontinued -- uninstall it to avoid silent ICD hijacking.
Extended context scaling (latest build, RADV):
| pp512 | pp2048 | pp4096 | pp8192 | Drop at 8K |
|---|---|---|---|---|
| 1080 | 1057 | 1049 | 1049 | -3% |
pp is virtually flat from 512 to 8192 tokens. Only 3% drop at 8K context.
Qwen3-Coder 30B-A3B (UD-Q4_K_XL, 16.5GB, MoE):
| Build | Driver | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8460 (latest) | RADV | 1342 | 87.11 | Already at bandwidth ceiling |
| b8298 (kyuz0) | RADV | 1350 | 86.81 | ~same (model was already at ceiling) |
The 30B model shows minimal improvement because it was already hitting the memory bandwidth ceiling at 87 t/s. The 35B model had more headroom, which the new build exploited.
Gemma 4 26B-A4B (UD-Q4_K_M, 15.7GB, MoE) -- tested on b8933 (earliest build with Gemma 4 support):
| Build | Driver | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8933 | RADV | 1142 | 48.46 | Google's latest MoE |
Gemma 4 is architecturally slower on tg than Qwen MoE models despite similar size. The reason: head_dim 256/512 (vs Qwen's 128) makes flash attention less efficient, mixed sliding-window/full attention adds overhead, and 3.8B active params vs Qwen's 3.3B. This is not a llama.cpp issue -- it's inherent to the model design. 48.5 t/s is still 3x human reading speed and very usable for interactive chat.
WARNING: Gemma 4 is extremely sensitive to KV cache quantization. Using q8_0 KV cache causes 3.5x worse quality degradation compared to Qwen models. Stick with f16 KV cache for Gemma 4. Do NOT use
--cache-type-k q4_0.
Llama 4 Scout 109B (Q4_K_M, 60.9GB, MoE -- 109B total params, 17B active):
| Build | Driver | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8933 | RADV | 331 | 18.32 | 109B model running on a mini PC |
A 109 billion parameter model running at 18.3 t/s on a $3,299 mini PC. An RTX 4090 (24GB VRAM) cannot even load this model. The speed is bandwidth-limited at 17B active parameters -- theoretical max is ~25 t/s at 215 GB/s, we hit 73% of that ceiling.
Qwen3-Next 80B-A3B (UD-Q4_K_XL, 42.9GB, MoE -- 80B total params, 3B active, 256K context):
| Build | Driver | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8933 | RADV | 657 | 54.92 | 80B model at 55 t/s |
80 billion parameters running at 55 t/s on a mini PC. This is the largest Qwen3-family MoE model -- 80B total with only 3B active parameters and a 256K context window. Despite being 42.9 GB on disk, the MoE routing keeps only 3B params active per token, making it faster than the 51B dense Qwen3-Coder-Next (38 t/s).
Qwen3.6-35B-A3B (Q4_K_M, 19.9GB, MoE -- drop-in upgrade from Qwen3.5, released April 2026):
| Build | Driver | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8460 | RADV | 1064 | 63.76 | Same speed as Qwen3.5 |
| b8933 | RADV | 1040 | 63.66 | No regression between builds |
Qwen3.6 is a drop-in replacement for Qwen3.5 with significantly improved coding and reasoning quality (same architecture, same active parameters, identical speed). Use Q4_K_M, not UD-Q4_K_M -- Unsloth Dynamic quantization costs 13% tg speed (56.6 vs 64.1 t/s) due to mixed-precision layers, with minimal quality benefit at this quant level.
ROCm HIP -- now working on kernel 6.19.4!
We discovered that HSA_OVERRIDE_GFX_VERSION=11.5.1 + HSA_ENABLE_SDMA=0 fixes the ROCm segfault on kernel 6.19.x. We also rebuilt ROCm with the same b8460 source to make the comparison fair:
| Build | pp128 | pp512 | tg128 | Notes |
|---|---|---|---|---|
| b8460 (latest, kernel 6.19.4) | 547 | 1047 | 54.67 | tg +14% vs b8301 |
| b8301 (self-compiled, kernel 6.19.4) | 542 | 1059 | 47.87 | old build |
| b8301 (self-compiled, kernel 6.18.14) | 488 | 996 | 48.80 | previous best |
ROCm also improved with the latest build: tg went from 47.87 to 54.67 (+14%) thanks to generic llama.cpp optimizations. But Vulkan RADV is still faster on both pp and tg: RADV 1080 vs ROCm 1047 pp512 (+3%), RADV 64.85 vs ROCm 54.67 tg128 (+19%). The +25% Vulkan improvement was ~14% generic (ROCm got this too) plus ~11% Vulkan-specific (FA refactor, graphics queue). ROCm's remaining advantage is hipBLASLt and rocWMMA at very long context (32K+).
Build version matters enormously:
| What we tested | pp512 | tg128 | Lesson |
|---|---|---|---|
| Ollama Vulkan RADV (b8298) | ~457 (via API) | 47.4 | Ollama adds overhead |
| llama-bench RADV (b8298) | 868 | 52.06 | Eliminating Ollama helps |
| llama-bench RADV (b8460) | 1080 | 64.85 | Updating llama.cpp = +25% |
| ROCm HIP (b8301, HSA fix) | 1059 | 47.87 | Old build, unfair comparison |
| ROCm HIP (b8460, HSA fix) | 1047 | 54.67 | ROCm got +14% tg from same update |
The single biggest optimization you can make is updating llama.cpp to the latest build. It gave us more improvement (+25% on MoE models) than all kernel tuning, batch size sweeps, and driver comparisons combined. This is counter-intuitive -- people spend hours on kernel parameters, GRUB flags, and Mesa versions, while
git pull && cmake --builddelivers more than everything else put together. Note: this applies to MoE models specifically. Dense models were already at the bandwidth ceiling and show <2% change.
Batch size and ubatch tuning results (b8298, for reference):
We swept batch sizes 64-2048 and ubatch sizes 32-1024. Result: default 512 is optimal. No headroom via tuning -- the improvement came from updating the build.
How to build the latest llama.cpp with Vulkan:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CC=/usr/bin/gcc CXX=/usr/bin/g++ cmake -B build -S . \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-G "Unix Makefiles"
cmake --build build -j$(nproc)
# Benchmark
AMD_VULKAN_ICD=RADV ./build/bin/llama-bench \
-m ~/models/your-model.gguf \
-fa 1 -ngl 999 -mmp 0 -p 512 -n 128ROCm on kernel 6.19.x (the fix):
# Add these environment variables before running llama-bench:
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0
export ROCBLAS_USE_HIPBLASLT=1Llama 2 7B (Q4_K_M, 3.8GB, Dense):
| Driver | pp128 | pp512 | pp1024 | tg128 |
|---|---|---|---|---|
| RADV | 1154 | 1377 | 1356 | 48.12 |
| AMDVLK | 335 | 327 | 325 | 48.02 |
AMDVLK is 3-4X slower on pp for dense models (2 GiB buffer limit). Use RADV.
Qwen3-0.6B (Q8_0, 762MB, Dense) -- maximum throughput:
| Driver | pp128 | pp512 | tg128 |
|---|---|---|---|
| RADV | 10,313 | 13,112 | 266 |
NOTE (March 2026): Kernel 6.19.x misidentifies gfx1151 as gfx1100 for ROCm, but this is fixable with
HSA_OVERRIDE_GFX_VERSION=11.5.1andHSA_ENABLE_SDMA=0. See ROCm on kernel 6.19.x for the full fix. Without these environment variables, ROCm containers will segfault.
Previous results on kernel 6.18.14 (for reference -- these worked):
| Build | Model | pp128 | pp512 | tg128 |
|---|---|---|---|---|
| Self-compiled b8301, FA on, -mmp 0 | Qwen3.5-35B-A3B Q4_K_M | 488 | 996 | 48.8 |
| kyuz0 b8298, FA on | Qwen3.5-35B-A3B Q4_K_M | 306 | 520 | 55.3 |
| kyuz0 b8298, FA off | Qwen3.5-35B-A3B Q4_K_M | 352 | 524 | 53.8 |
| kyuz0 b8189, FA + hipBLASLt | Llama 2 7B Q4_K_M | 1163 | 1261 | 45.07 |
Vulkan llama-bench Direct (kyuz0 containers, b8298) -- March 2026:
| Driver | Model | pp128 | pp256 | pp512 | pp1024 | tg128 |
|---|---|---|---|---|---|---|
| RADV | Qwen3.5-35B-A3B Q4_K_M | 503.67 | - | 858.88 | - | 52.15 |
| AMDVLK | Qwen3.5-35B-A3B Q4_K_M | 477.28 | - | 575.59 | - | 55.54 |
| RADV | Llama 2 7B Q4_K_M | 1153.53 | 1364.45 | 1377.18 | 1355.88 | 48.12 |
| AMDVLK | Llama 2 7B Q4_K_M | 334.50 | 337.96 | 327.35 | 325.33 | 48.02 |
Critical finding (b8298): AMDVLK has a 2 GiB single buffer allocation limit that cripples pp on dense models (3-4X slower on Llama 2 7B). On MoE models, AMDVLK was slightly faster on tg (+6.5%) with b8298, but this advantage disappeared with b8460 -- see the latest benchmarks where RADV wins on both pp and tg.
Vulkan RADV vs ROCm HIP (same build b8460, Qwen3.5-35B-A3B):
| Metric | Ollama (b8298) | Vulkan RADV (b8460) | ROCm HIP (b8460) | Best |
|---|---|---|---|---|
| pp512 | ~457 | 1080 | 1047 | Vulkan RADV |
| tg128 | 47.4 | 64.85 | 54.67 | Vulkan RADV |
Vulkan RADV wins on both pp and tg with the latest llama.cpp build. ROCm works on kernel 6.19.x with the HSA override fix but is no longer the fastest backend for MoE models. Use
llama-benchorllama-serverdirectly instead of Ollama to avoid the ~35% overhead.
Based on our measurements and lhl's detailed testing:
| Backend | Best For | pp (relative) | tg (relative) | Context Scaling | Setup Difficulty |
|---|---|---|---|---|---|
| Ollama + Vulkan RADV | General use, chat | Good | Good | Degrades at 8K+ | Easiest |
| llama.cpp + Vulkan RADV (container) | Max speed, no overhead | Best | Best (short ctx) | Degrades at 8K+ | Easy |
| llama.cpp + Vulkan AMDVLK | Not recommended | Slower than RADV on b8460 | Slower on dense (2 GiB limit) | Degrades at 8K+ | Easy |
| ROCm HIP | Batch processing | Excellent | Good | Poor at 32K+ | Medium (needs HSA fix on 6.19.x) |
| ROCm + rocWMMA (tuned) | Long context | Excellent | Best at 32K | Best scaling | Very hard |
| vLLM (TheRock) | API serving | Good | Good | Good | Hard |
| Hardware | Bandwidth | tg (MoE ~30B) | Max Model Size | Price |
|---|---|---|---|---|
| RTX 4090 | ~1008 GB/s | 100-122 t/s | 24 GB | ~$1600 GPU only |
| RTX 3090 | ~936 GB/s | 100-112 t/s | 24 GB | ~$800 used |
| Apple Mac Studio M4 Max 128GB | ~546 GB/s | ~100 t/s (MLX) | 128 GB | $3,699 |
| Beelink GTR9 Pro | ~215 GB/s | 65-87 t/s | 120+ GB | $3,299 |
| NVIDIA DGX Spark | ~273 GB/s | 52-56 t/s (120B) | 128 GB | $4,699 |
Apples-to-apples (gpt-oss-120b, same model, both platforms): Strix Halo gets 50-53 t/s vs DGX Spark's 52-56 t/s -- within 5-10% on the same workload, while costing $1,400 less ($3,299 vs $4,699). On smaller MoE models (Qwen3-30B), Strix Halo hits 87 t/s. The DGX Spark wins on prompt processing (3-5X faster) and long context (23%+ faster at 32K). Source: Framework Community, lhl.
Based on lhl's measurements with gpt-oss-120b (tg32):
| Context | Vulkan AMDVLK | ROCm Standard | ROCm rocWMMA-tuned |
|---|---|---|---|
| 2K | 50.05 t/s | 46.56 t/s | 48.97 t/s |
| 4K | 46.11 t/s | 38.25 t/s | 45.42 t/s |
| 8K | 43.15 t/s | 32.65 t/s | 43.55 t/s |
| 16K | 38.46 t/s | 25.50 t/s | 40.91 t/s |
| 32K | 31.54 t/s | 17.82 t/s | 36.43 t/s |
At 32K context, standard ROCm drops to 17.82 t/s. Vulkan holds at 31.54 t/s (1.8X faster). But lhl's tuned rocWMMA branch is the overall winner at 36.43 t/s -- 2X faster than standard ROCm and 15% faster than Vulkan at 32K.
At extreme context (130K tokens, from strixhalo.wiki):
| Backend | pp512 (t/s) | tg128 (t/s) |
|---|---|---|
| Vulkan RADV | 17 | 13 |
| ROCm | 41 | 5 |
| ROCm rocWMMA-tuned | 51 | 13 |
Which backend should I use?
|
Do you need long context (>32K)?
/ \
NO YES
| |
Just want it easy? ROCm + rocWMMA-tuned
/ \ (lhl's branch)
YES NO Best for 32K+ context
| |
Ollama + Build latest
Vulkan RADV llama.cpp yourself
| |
"It just llama-server +
works" Vulkan RADV
48 t/s 65 t/s
For those who want to get running as fast as possible:
- BIOS: Set UMA Frame Buffer to 512MB, disable IOMMU
- Install Ubuntu 24.04 LTS, switch to X11
- Kernel params: Add
amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=31457280to GRUB - Performance: Install tuned, set
accelerator-performanceprofile, upgrade Mesa via kisak PPA - Ollama: Install, configure Vulkan backend with
OLLAMA_VULKAN=1andHIP_VISIBLE_DEVICES=-1 - Test:
ollama run qwen3.5:35b-a3b-- expect ~48 t/s generation
Each step is detailed in the phases below.
Do this BEFORE installing the OS.
Navigate to Integrated Graphics then UMA Frame Buffer Size and set to 512MB.
Why? By default, the BIOS reserves ~97GB for GPU VRAM, leaving only ~31GB visible to the OS. Setting it to 512MB lets the OS see ~125GB RAM. This does NOT reduce GPU performance -- Vulkan uses GTT (system memory) anyway, so the GPU still has access to all 128GB for LLM inference. We benchmarked before and after: zero speed difference.
Find the IOMMU setting and set to Disabled.
Why? lhl's memory bandwidth testing shows
amd_iommu=offgives ~6% better memory reads compared to default (234 vs 221 GB/s).iommu=pt(pass-through, recommended by some guides) gives no benefit over default. We useamd_iommu=offin the kernel command line as well, but disabling in BIOS ensures it's completely off. Only re-enable if you need VFIO/GPU passthrough or RDMA clustering.
Install Ubuntu 24.04 LTS Desktop with default settings. After installation:
sudo apt update && sudo apt upgrade -yWayland causes issues with RustDesk, Zoom screen sharing, and some GPU monitoring tools.
sudo tee -a /etc/gdm3/custom.conf > /dev/null << 'EOF'
WaylandEnable=false
EOFIf the line already exists (commented out), uncomment it instead. Reboot to apply.
Ubuntu 26.04 LTS (released April 2026) ships with Linux 7.0, Mesa 26.0, and native
apt install rocm. However, 26.04 is Wayland-only (X11 switch above does not work) and the performance-relevant components (kernel, Mesa RADV) are already available on 24.04 via the kisak PPA and mainline kernel PPA. Upgrading is not needed for LLM performance. This guide stays on 24.04 LTS.
CRITICAL: Kernel version matters enormously for Strix Halo.
- Kernel 6.18.4+ is the minimum stable version (older kernels have gfx1151 stability bugs)
- Kernel 6.19.x misidentifies gfx1151 as gfx1100 for ROCm -- fixable with
HSA_OVERRIDE_GFX_VERSION=11.5.1(see Known Issues)- Recommended: Kernel 6.18.6+ or 6.19.x (6.19.x needs HSA override for ROCm)
Check your kernel:
uname -rsudo tee /tmp/grub_update.txt << 'EOF'
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=31457280 amdgpu.cwsr_enable=0"
EOFThen edit /etc/default/grub and replace the GRUB_CMDLINE_LINUX_DEFAULT line with the content above.
| Parameter | Purpose | Impact |
|---|---|---|
amd_iommu=off |
Disable IOMMU completely | +6% memory bandwidth (lhl) |
amdgpu.gttsize=131072 |
Set GTT (GPU-accessible system memory) to 128GB | Required for large models |
ttm.pages_limit=31457280 |
Set TTM page limit to ~120GB | Required for large models |
amdgpu.cwsr_enable=0 |
Disable compute wave save/restore | Not needed for LLM inference |
Note: kyuz0's toolboxes use
iommu=ptinstead ofamd_iommu=off. We useoffbased on lhl's benchmark data showing ~6% better memory bandwidth. The difference is documented in kyuz0 issue #66. If you need RDMA clustering, useiommu=ptinstead (RDMA NICs require IOMMU for DMA remapping).
Apply:
sudo update-grubsudo tee /etc/modprobe.d/amdgpu_llm_optimized.conf > /dev/null << 'EOF'
options amdgpu gttsize=122800
options ttm pages_limit=31457280
options ttm page_pool_size=31457280
EOFUpdate initramfs:
sudo update-initramfs -u -k allsudo tee /etc/udev/rules.d/99-amd-kfd.rules > /dev/null << 'EOF'
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOFIMPORTANT: The
renderD[0-9]*rule is critical. Without it, you getHSA_STATUS_ERROR_OUT_OF_RESOURCESerrors with ROCm.
Add your user to GPU groups:
sudo usermod -aG render $USER
sudo usermod -aG video $USERReload and reboot:
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo rebootsudo apt install tuned -y
sudo systemctl enable --now tuned
sudo tuned-adm profile accelerator-performanceVerify:
tuned-adm active
# Expected: Current active profile: accelerator-performanceImpact: +5-8% overall performance improvement. Memory bandwidth improves from ~221 GB/s to ~234 GB/s write. We measured +4-5% token generation improvement when tuned was running vs not running.
WARNING: tuned may not survive reboots on some systems. Add a check to your
.bashrcor create a systemd service to verify it's running after boot.
The default Mesa on Ubuntu 24.04 is significantly slower. Upgrade to 26.0.2+:
sudo add-apt-repository ppa:kisak/kisak-mesa
sudo apt update
sudo apt upgrade -yVerify:
vulkaninfo --summary 2>&1 | grep driverInfo
# Expected: driverInfo = Mesa 26.0.2 - kisak-mesa PPAImpact: Mesa 25.2.8 to 26.0.1 gave +9% prompt eval (87 to 96 t/s). Mesa 26.0.1 to 26.0.2 gave an additional small improvement.
Note: You may see DKMS errors about
mt76-mt7925during the upgrade. These are harmless -- see Troubleshooting.
The GPU should run at its maximum clock speed (2900 MHz) during inference:
cat /sys/class/drm/card*/device/pp_dpm_sclk
# Expected: 2: 2900Mhz * (asterisk on highest clock)GPU Clock Bug: On some kernel/firmware combinations, the GPU gets stuck at 900 MHz, causing ~8% performance loss. If your GPU is not at 2900 MHz during load, see Troubleshooting.
dpkg -l | grep linux-firmware | head -5CRITICAL: Do NOT install
linux-firmware-20251125. It breaks ROCm support on Strix Halo (confirmed by kyuz0 toolboxes). Symptoms: instability, crashes, ROCm containers failing to start. The safe versions are20240318or20260110+. If you're on 20251125, downgrade immediately:# Check your version dpkg -l | grep linux-firmware # If 20251125, hold the package to prevent auto-updates pulling it back sudo apt-mark hold linux-firmware
Ollama is the easiest way to run LLMs on Strix Halo. With the right configuration, it works great.
curl -fsSL https://ollama.com/install.sh | shUpdate (April 2026): Ollama ROCm now works on gfx1151 with
HSA_OVERRIDE_GFX_VERSION=11.5.1(ollama/ollama#14855). However, Vulkan is still ~9% faster on token generation (46.6 vs 42.4 t/s on Qwen3.5-35B). We recommend Vulkan for best performance. If you need ROCm (for vLLM compatibility or other reasons), addHSA_OVERRIDE_GFX_VERSION=11.5.1andHSA_ENABLE_SDMA=0to your Ollama environment instead of the Vulkan variables below.
sudo systemctl edit ollamaAdd between the comment lines:
[Service]
Environment="OLLAMA_VULKAN=1"
Environment="HIP_VISIBLE_DEVICES=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="AMD_VULKAN_ICD=RADV"
Environment="VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json"
Environment="OLLAMA_NUM_BATCH=512"
Environment="OLLAMA_NUM_PARALLEL=1"Restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama| Variable | Purpose |
|---|---|
OLLAMA_VULKAN=1 |
Force Vulkan backend (9% faster than ROCm on Strix Halo) |
HIP_VISIBLE_DEVICES=-1 |
Disable HIP device enumeration (avoids ROCm fallback) |
OLLAMA_FLASH_ATTENTION=1 |
Enable flash attention (+13% prompt processing) |
OLLAMA_CONTEXT_LENGTH=8192 |
Limit context to prevent OOM (increase if needed) |
AMD_VULKAN_ICD=RADV |
Force RADV driver (faster than AMDVLK for general use) |
VK_ICD_FILENAMES=... |
Explicitly point to RADV ICD file |
OLLAMA_NUM_BATCH=512 |
Larger batch size for better throughput |
OLLAMA_NUM_PARALLEL=1 |
Single request at a time (maximizes single-request speed) |
# Fast MoE model, great for general use and coding (~23GB)
ollama pull qwen3.5:35b-a3b
# Higher quality MoE, Q8_0 quantization (~32GB)
ollama pull qwen3-coder:30b-a3b-q8_0
# Google's MoE model, strong reasoning (~16GB)
ollama pull gemma4:26b-a4b
# Large dense model for complex tasks (~51GB)
ollama pull qwen3-coder-nextollama run qwen3.5:35b-a3bYou should see responses generating at ~48 t/s.
tee ~/bench-ollama.sh > /dev/null << 'SCRIPT'
#!/bin/bash
MODEL="${1:-qwen3.5:35b-a3b}"
PROMPT="${2:-hello how are you}"
echo "Model: $MODEL"
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
curl -s http://localhost:11434/api/generate -d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":false}" | python3 -c "
import sys,json
d=json.load(sys.stdin)
pp=d['prompt_eval_count']/d['prompt_eval_duration']*1e9
tg=d['eval_count']/d['eval_duration']*1e9
print(f'Prompt eval: {pp:.1f} t/s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation: {tg:.1f} t/s ({d[\"eval_count\"]} tokens)')
print(f'Total time: {d[\"total_duration\"]/1e9:.2f}s')
"
SCRIPT
chmod +x ~/bench-ollama.shUsage:
# Default (qwen3.5:35b-a3b, short prompt)
bash ~/bench-ollama.sh
# Specific model with custom prompt
bash ~/bench-ollama.sh qwen3-coder-next "explain backpropagation in simple terms"tee ~/bench-ollama-long.sh > /dev/null << 'SCRIPT'
#!/bin/bash
MODEL="${1:-qwen3.5:35b-a3b}"
echo "Model: $MODEL (long prompt)"
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
curl -s http://localhost:11434/api/generate -d "{\"model\":\"$MODEL\",\"prompt\":\"You are an expert software architect. I need you to review and refactor the following Python code for a web application that handles user authentication, session management, database connections, API rate limiting, error handling, logging, caching with Redis, background job processing with Celery, WebSocket connections for real-time updates, file upload handling with S3 integration, email notification service, payment processing with Stripe, and search functionality with Elasticsearch. Please provide a comprehensive architecture review covering separation of concerns, SOLID principles, design patterns, security best practices, performance optimization, and scalability considerations.\",\"stream\":false}" | python3 -c "
import sys,json
d=json.load(sys.stdin)
pp=d['prompt_eval_count']/d['prompt_eval_duration']*1e9
tg=d['eval_count']/d['eval_duration']*1e9
print(f'Prompt eval: {pp:.1f} t/s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation: {tg:.1f} t/s ({d[\"eval_count\"]} tokens)')
print(f'Total time: {d[\"total_duration\"]/1e9:.2f}s')
"
SCRIPT
chmod +x ~/bench-ollama-long.shPrompt processing speed scales with prompt length due to GPU parallelism:
| Prompt Tokens | pp (qwen3.5:35b-a3b) | pp (qwen3-coder-next) |
|---|---|---|
| 12-14 | 121 t/s | 91 t/s |
| 21-23 | 182 t/s | 130 t/s |
| 120-122 | 457 t/s | 301 t/s |
For maximum prompt processing performance, use llama.cpp with ROCm via kyuz0 containers.
NOTE: On kernel 6.19.x, ROCm requires
HSA_OVERRIDE_GFX_VERSION=11.5.1andHSA_ENABLE_SDMA=0to work. Without these, it segfaults. See ROCm on kernel 6.19.x.
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo shNote: Ubuntu 24.04 does not include
toolboxin its repos. Use Distrobox instead. The defaulttoolboxon Ubuntu also breaks GPU access.
distrobox create llama-rocm-72 \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2 \
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined"distrobox enter llama-rocm-72
rocm-smi # Should show your gfx1151 GPUThe container comes with pre-built, optimized llama.cpp binaries:
export ROCBLAS_USE_HIPBLASLT=1
llama-bench -m ~/models/your-model.gguf -fa 1 -ngl 999 -mmp 0 -p 128,512 -n 128Critical flags:
| Flag | Impact | Notes |
|---|---|---|
-fa 1 |
+13% prompt processing | Always use on Strix Halo |
-mmp 0 (--no-mmap) |
+22% pp128, more stable | Always use on Strix Halo |
ROCBLAS_USE_HIPBLASLT=1 |
+8% token generation | Set in environment |
-ngl 999 |
Full GPU offload | Use all available VRAM |
The kyuz0 pre-built binary includes the critical compiler flag
--amdgpu-unroll-threshold-local=600which works around the LLVM compiler regression in ROCm 7+. Self-compiled binaries without this flag may be significantly slower.
If you need the latest llama.cpp features or want to use lhl's rocWMMA patches:
# Inside a ROCm container
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Standard build (without rocWMMA)
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
-DCMAKE_BUILD_TYPE=Release
# With rocWMMA (for long context, use lhl's tuned branch)
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151" \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)WARNING: Do NOT enable
GGML_HIP_ROCWMMA_FATTN=ONon upstream llama.cpp without lhl's patches. ROCm 7.2 has a 73% performance regression with rocWMMA FA enabled. lhl's custom rocm-wmma-tune branch fixes this and delivers 2X better performance at 32K context.
kyuz0's vLLM toolboxes enable API serving on gfx1151.
distrobox create vllm-gfx1151 \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined"Known vLLM issues on gfx1151:
- Qwen3.5 block_size validation (issue #28): Hybrid mamba/attention models compute
block_size=1056which gets rejected by a hardcoded whitelist. Fix available in the issue. - MIOpen encoder hang (issue #30): Vision models hang during kernel search because MIOpen lacks pre-compiled solver DBs for gfx1151. Workaround: disable encoder profiling.
Tested models on vLLM:
| Model | Max Context |
|---|---|
| Llama-3.1-8B | 128K |
| Gemma-3-12b | 128K |
| Qwen3-Coder-30B-A3B (GPTQ 4-bit) | 256K |
| gpt-oss-120b | 128K |
| Qwen3-Next-80B-A3B (GPTQ Int4) | 256K |
For models that exceed 128GB, you can cluster multiple Strix Halo machines using RDMA.
From kyuz0's vLLM clustering guide:
Hardware needed:
- 2x Strix Halo machines (e.g., Framework Desktop)
- 2x Intel E810-CQDA1 100GbE NICs
- 1x DAC cable (direct attach copper, no switch needed for 2 nodes)
Performance:
- ~50 Gbps bandwidth, ~5 us latency (vs ~70-100 us TCP/IP)
- TP=2 across machines = 256GB unified memory
- Enables trillion-parameter model inference (AMD article)
Additional kernel parameter for clustering:
pci=realloc
Network configuration:
# Set MTU to 9000 (jumbo frames)
sudo ip link set <interface> mtu 9000sudo apt install openssh-server fail2ban -ysudo sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
sudo systemctl restart sshfail2ban starts automatically and blocks IPs after repeated failed login attempts. We found 68 brute-force attempts on our system within hours of enabling SSH -- fail2ban is essential.
We tested both Vulkan drivers via llama-bench. Results depend heavily on the llama.cpp build version:
With kyuz0 containers (b8298):
| Driver | Model | pp512 | tg128 |
|---|---|---|---|
| RADV | Qwen3.5-35B-A3B | 859 | 52.15 |
| AMDVLK | Qwen3.5-35B-A3B | 576 | 55.54 |
| RADV | Llama 2 7B | 1377 | 48.12 |
| AMDVLK | Llama 2 7B | 327 | 48.02 |
With latest llama.cpp (b8460) -- AMDVLK advantage is gone:
| Driver | Model | pp512 | tg128 |
|---|---|---|---|
| RADV | Qwen3.5-35B-A3B | 1080 | 64.85 |
| AMDVLK | Qwen3.5-35B-A3B | 663 | 64.10 |
AMDVLK is discontinued. Uninstall it -- even inactive, its ICD file silently hijacks Vulkan and halves your pp speed. See AMDVLK warning above.
Our recommendation: Use RADV. AMDVLK is discontinued (last release April 2025) -- RADV is now AMD's only supported open-source Vulkan driver. Even before discontinuation, RADV won on both pp and tg with latest llama.cpp. AMDVLK also had a 2 GiB buffer limit that caused 3-4X slower pp on dense models. Don't install AMDVLK.
Optimal ubatch sizes per driver (from lhl's testing):
- AMDVLK:
-ub 512 - RADV:
-ub 1024 - ROCm HIP:
-ub 2048
These findings correct several common recommendations found in other Strix Halo guides.
| Issue | Common Advice | Reality | What Happens If You Try |
|---|---|---|---|
Fixed in Ollama 0.20+ with HSA_OVERRIDE_GFX_VERSION=11.5.1. Works but ~9% slower tg than Vulkan |
Use Vulkan for best speed, ROCm if you need vLLM compatibility | ||
iommu=pt for speed |
"Use pass-through for performance" | No benefit over default (lhl) | Same speed as iommu=on, wastes a kernel param |
| AMDVLK for all workloads | "AMDVLK is fastest" | Project discontinued (last release April 2025). RADV beats AMDVLK on both pp (+63%) and tg. Worse: even if you don't use AMDVLK, its ICD file (/etc/vulkan/icd.d/amd_icd64.json) silently hijacks Vulkan and halves your pp speed. You won't see an error -- just mysteriously slow prompt processing |
Uninstall it completely: sudo dpkg -r amdvlk && sudo rm -f /etc/vulkan/icd.d/amd_icd64.json. Verify with llama-bench: RADV shows (RADV STRIX_HALO) with shared memory: 65536. AMDVLK shows (AMD open-source driver) with shared memory: 32768 |
| rocWMMA on upstream llama.cpp | "Enable for 2x speed" | 73% regression on ROCm 7.2 | Massively slower prompt processing |
| BIOS VRAM increase for speed | "More GPU VRAM = faster" | Zero speed difference, but you lose OS-visible RAM and GTT capacity. Set to 512MB or your system is crippled (31GB usable instead of 125GB). | OS sees only 31GB RAM, large models won't load at all |
| ROCm 7.0 RC | "Use ROCm 7 RC" | Segfaults on kernel 6.18.14+ | HSA_STATUS_ERROR crash |
| Kernel 6.19.x with ROCm (without fix) | "Just use latest kernel" | GPU misidentified as gfx1100 without HSA override | Segfaults unless you set HSA_OVERRIDE_GFX_VERSION=11.5.1 |
| linux-firmware-20251125 | Auto-update | Breaks ROCm on Strix Halo | Instability, crashes |
| PyTorch / HuggingFace Transformers | "Just load the model" | 92-95% of decode time is hipMemcpy, not compute. ~1.5 t/s on 70B vs llama.cpp's 4.8 t/s | PyTorch doesn't handle UMA correctly -- use llama.cpp or Ollama |
| Optimization | Impact | How |
|---|---|---|
| Mesa 25.2.8 to 26.0.2 | +9-10% pp | sudo add-apt-repository ppa:kisak/kisak-mesa |
| Flash Attention | +13% pp | -fa 1 or OLLAMA_FLASH_ATTENTION=1 |
--no-mmap (disable mmap) |
+22% pp128 | -mmp 0 in llama.cpp, always use on Strix Halo |
| hipBLASLt | +8% tg | ROCBLAS_USE_HIPBLASLT=1 (ROCm only) |
| tuned accelerator-performance | +5-8% overall | sudo tuned-adm profile accelerator-performance |
| RADV over AMDVLK | +63% pp, +1.2% tg | Uninstall AMDVLK entirely (see above). AMD_VULKAN_ICD=RADV works too but is easy to forget |
amd_iommu=off |
+6% memory bandwidth | GRUB parameter |
| BIOS VRAM to 512MB | OS sees 125GB vs 31GB, GTT gets full 128GB | No speed change, but required -- without this, models >31GB won't load |
HIP_VISIBLE_DEVICES=-1 |
Fixes Ollama crash | Required for Vulkan-only mode |
| LLVM unroll workaround | Restores ROCm 7+ perf | -mllvm --amdgpu-unroll-threshold-local=600 |
| lhl's rocWMMA-tuned | 2X tg at 32K context | Custom branch, requires manual build |
| Updating llama.cpp | +25% pp and tg (MoE) | git pull && cmake --build -- biggest single optimization |
| HSA_OVERRIDE_GFX_VERSION=11.5.1 | Fixes ROCm on kernel 6.19.x | Required for ROCm on 6.19.x, +6% pp vs 6.18.x |
Symptoms: Without the fix, ROCm containers segfault. ggml_cuda_init reports gfx1100 (0x1100) instead of gfx1151.
Fix: Set these environment variables before running any ROCm binary:
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0With this fix, ROCm works on kernel 6.19.4 and actually performs +6% better on pp than it did on kernel 6.18.14. See benchmarks for numbers.
Qwen3.5 ROCm Hang Bug (ROCm #6027)
Symptoms: Qwen3.5 models (35B-A3B and 27B) hang during load_tensors on ROCm. CPU pegs at 99.9%.
Status: Open. AMD confirmed working with TheRock 7.13.0a20260316+ nightlies.
Workaround: Use very conservative flags: --batch-size 128 --ubatch-size 32 --flash-attn off --n-gpu-layers 1
Symptoms: GPU stays at 900 MHz instead of 2900 MHz, causing ~8% performance loss.
Check:
cat /sys/class/drm/card*/device/pp_dpm_sclk
# Should show: 2: 2900Mhz *Fix: Force highest performance level:
echo high | sudo tee /sys/class/drm/card*/device/power_dpm_force_performance_levelNewer kernels (6.18.4+) recognize gfx1151's 1.5X VGPR capacity compared to standard gfx11 chips. This enables better occupancy for compute shaders. If you're on an older kernel, you may not be getting full performance.
DKMS mt7925 WiFi Errors During apt install
You'll see this on every apt install:
Error! Bad return status for module build on kernel: 6.18.14-061814-generic
dkms autoinstall failed for mt76-mt7925(10)
This is harmless. WiFi works fine via the kernel driver. To permanently silence:
sudo dkms remove mt76-mt7925/1.5.0 --allOllama "Out of Memory" Even with Small Models
This happens when Ollama tries to use HIP/ROCm instead of Vulkan:
# Check current Ollama environment
systemctl show ollama | grep Environment
# Fix: ensure these are set
sudo systemctl edit ollama
# Add: OLLAMA_VULKAN=1, HIP_VISIBLE_DEVICES=-1
sudo systemctl daemon-reload
sudo systemctl restart ollamaROCm Container Segfaults (Kernel 6.19.x)
If your ROCm containers crash immediately with segfaults on kernel 6.19.x:
# Fix: set these BEFORE running any ROCm binary
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0
export ROCBLAS_USE_HIPBLASLT=1
# Then run llama-bench or llama-server as normal
llama-bench -m model.gguf -fa 1 -ngl 999 -mmp 0 -p 512 -n 128The GPU is misidentified as gfx1100 instead of gfx1151 on kernel 6.19.x. The HSA_OVERRIDE_GFX_VERSION forces correct identification. This is a kernel/ROCm compatibility issue that will likely be fixed in future ROCm releases.
Verifying GPU Memory Configuration
# Check TTM pages limit
cat /sys/module/ttm/parameters/pages_limit
# Check GTT size
cat /sys/module/amdgpu/parameters/gttsize
# Check Vulkan driver
vulkaninfo --summary 2>&1 | grep -E "driverName|driverInfo"
# Check OS-visible RAM
free -h
# Check GPU memory allocation
for file in /sys/class/drm/card*/device/mem_info*; do
echo "$file: $(cat $file)"
donerocm-smi Shows Wrong VRAM
For APUs with unified memory, mem_info_vram_total showing ~1GB is normal. The actual compute memory is in GTT, which should show ~128GB.
tuned Not Running After Reboot
# Check status
tuned-adm active
# If not running:
sudo systemctl enable --now tuned
sudo tuned-adm profile accelerator-performance
# Verify it persists
tuned-adm activeGPU Stuck at Low Clock Speed
# Check current clock
cat /sys/class/drm/card*/device/pp_dpm_sclk
# If not on highest (2900Mhz):
echo high | sudo tee /sys/class/drm/card*/device/power_dpm_force_performance_level
# To make persistent, add to /etc/rc.local or a udev ruleBased on community testing and our own findings:
| Kernel | ROCm 6.4.4 | ROCm 7.2 | ROCm 7 Nightly | Vulkan (Ollama) |
|---|---|---|---|---|
| 6.17.7 | Works (with right firmware) | Unknown | Works | Works |
| 6.18.4-6.18.14 | Works (patched) | Works | Works | Works |
| 6.19.4 | Works (HSA fix) | Works (HSA fix) | Unknown | Works |
Key rules:
- Kernel 6.18.4+ has a fix that breaks ALL older ROCm versions
- Kernel 6.19.x misidentifies gfx1151 as gfx1100, fixable with
HSA_OVERRIDE_GFX_VERSION=11.5.1 - linux-firmware-20251125 breaks ROCm regardless of kernel
- linux-firmware-20260110+ is safe
Our current recommendation (March 2026): Kernel 6.19.x works for both Vulkan and ROCm (ROCm requires
HSA_OVERRIDE_GFX_VERSION=11.5.1). Kernel 6.18.6-6.18.14 works without the HSA workaround.
After completing setup, verify each item:
-
free -hshows ~124GB total RAM -
vulkaninfo --summaryshows RADV Mesa 26.0.2+ -
tuned-adm activeshowsaccelerator-performance -
cat /sys/class/drm/card*/device/pp_dpm_sclkshows 2900Mhz with asterisk -
cat /sys/module/ttm/parameters/pages_limitshows 31457280 -
ollama --versionreturns without error -
ollama run qwen3.5:35b-a3b "hello"generates at 45+ t/s -
systemctl show ollama | grep EnvironmentincludesOLLAMA_VULKAN=1 -
cat /etc/default/grub | grep CMDLINEincludesamd_iommu=off -
uname -rshows 6.18.x+ (ROCm on 6.19.x requires HSA override -- see Known Issues) -
dpkg -l | grep linux-firmwaredoes NOT show 20251125
- kyuz0/amd-strix-halo-toolboxes -- Community standard containers for llama.cpp (1.2k+ stars)
- kyuz0/amd-strix-halo-vllm-toolboxes -- vLLM serving + RDMA clustering
- kyuz0/amd-strix-halo-gfx1151-toolboxes -- Meta repository with all toolboxes
- kyuz0 Backend Benchmarks Dashboard -- Interactive benchmark comparison
- lhl/strix-halo-testing -- Deep performance research and rocWMMA patches
- strixhalo.wiki -- Community wiki
- llm-tracker.info -- GPU performance comparison
- Level1Techs Forum -- Community benchmark results
- Framework Community -- Framework Desktop discussions
- ROCm Strix Halo Optimization Guide -- Official AMD guide
Not sure which model to run? Here's what we recommend based on use case:
| I want to... | Model | Size | Speed | Why |
|---|---|---|---|---|
| Code (best speed) | Qwen3-Coder 30B-A3B (UD-Q4_K_XL) | 16.5 GB | 87 t/s | Fastest coding model, MoE architecture |
| Code (best quality) | Qwen3-Coder 30B-A3B (Q8_0) | 32 GB | 51 t/s | Same model, higher fidelity quantization |
| Chat (general) | Qwen3.6 35B-A3B (Q4_K_M) | 20 GB | 64 t/s | Best all-rounder, successor to 3.5 |
| Chat (no thinking) | Qwen3.6 35B-A3B (no-think) | 20 GB | 64 t/s | Same speed, direct answers |
| Code (best quality, 256K ctx) | Qwen3-Next 80B-A3B | 42.9 GB | 55 t/s | 80B MoE, only 3B active, 256K context |
| Chat (smartest possible) | Qwen3-Coder-Next | 51 GB | 38 t/s | Dense 51B model, slower but smarter |
| Reasoning | Gemma 4 26B-A4B | 15.7 GB | 48.5 t/s | Google's latest MoE, strong reasoning |
| Analyze images | Qwen2.5-VL 7B | 6 GB | 21 t/s | Vision-language model |
| Maximum intelligence | Llama 3.3 70B (Q4) | ~40 GB | ~5 t/s | Slow but very capable |
| "Can it run?" | Llama 4 Scout 109B | 61 GB | 18 t/s | 109B model on a mini PC. RTX 4090 can't |
| Process documents | Qwen3.6 35B-A3B (Q4_K_M) | 20 GB | 64 t/s | Fast enough for RAG pipelines |
| Learn / experiment | Llama 2 7B | 3.8 GB | 52 t/s | Small, fast, well-documented |
| Throughput testing | Qwen3-0.6B (Q8_0) | 0.8 GB | 266 t/s | Speed ceiling benchmark |
How to install any model:
# Via Ollama (easiest)
ollama pull qwen3.6:35b-a3b
# For llama-bench direct (need GGUF file)
# Download from huggingface.co, place in ~/models/ Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
| | | | | | |
| | | | | | +-- Quantization (see Glossary)
| | | | | +-- "Unsloth Dynamic" quant method
| | | | +-- Fine-tuned for instructions
| | | +-- 3B Active parameters (MoE)
| | +-- 30B Total parameters
| +-- Optimized for coding
+-- Model family (by Alibaba)
Assumptions: Qwen3.6-35B-A3B level intelligence, 1000 tokens per query, 50 queries per day.
| Option | Monthly Cost | Speed | Privacy | Offline |
|---|---|---|---|---|
| ChatGPT Plus | $20/mo | Fast | No | No |
| Claude Pro | $20/mo | Fast | No | No |
| OpenAI API (gpt-4o, 50 queries/day) | ~$15/mo | Fast | No | No |
| Anthropic API (Claude Sonnet, 50 queries/day) | ~$12/mo | Fast | No | No |
| Strix Halo (after purchase) | ~$8/mo electricity | 48-87 t/s | Yes | Yes |
Break-even calculation:
| Scenario | System Cost | Monthly Savings | Break-even |
|---|---|---|---|
| vs ChatGPT Plus | ~$3,299 | $12/mo | ~23 years |
| vs API heavy use (200 queries/day) | ~$3,299 | ~$50/mo | ~5.5 years |
| vs API power use (1000+ queries/day) | ~$3,299 | ~$200/mo | ~16 months |
The real value is not cost savings. It's running AI with no rate limits, no content filters, no data leaving your machine, and no internet required. If you value privacy, unrestricted use, or offline capability, local LLM pays for itself immediately.
Power consumption:
- Idle: ~30W
- Under inference load: 120-140W
- Monthly electricity (8 hours/day inference): ~$8 at $0.15/kWh
Ollama provides an OpenAI-compatible API. Point any coding tool at it:
# For Cursor, Continue.dev, or any OpenAI-compatible client:
# Base URL: http://localhost:11434/v1
# Model: qwen3.5:35b-a3b (or qwen3-coder-next for max quality)
# API Key: ollama (or leave empty)For Claude Code specifically:
ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen3.5:35b-a3bAt 48-87 t/s, local inference feels instant for code completion and review.
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:mainOpen http://localhost:3000. You get conversation history, document upload, multi-model support, and built-in RAG -- all local, no cloud.
For querying your own documents locally:
# 1. Pull an embedding model
ollama pull nomic-embed-text
# 2. Use Open WebUI's built-in RAG (easiest)
# or set up LangChain + ChromaDB for custom pipelineskyuz0's ComfyUI toolboxes provide ROCm containers for Flux, Wan 2.2, and Hunyuan on gfx1151. For Vulkan-only: stable-diffusion.cpp works with the RADV driver.
Qwen3-TTS and Chatterbox TTS both run on Strix Halo with GPU acceleration. lhl's voicechat2 provides a complete local AI voice chat system.
All current Strix Halo mini PCs use the same AMD Ryzen AI MAX+ 395 APU with 128GB LPDDR5X-8000. The differentiators are form factor, cooling, ports, and price.
| System | Price (Apr 2026) | Cooling | Networking | Key Differentiator |
|---|---|---|---|---|
| GMKtec EVO-X2 | ~$2,349 | Air (blower) | 2.5GbE | Best value, most popular |
| Bosgame M5 | $2,399 | Air (blower) | 2.5GbE | Budget option |
| Framework Desktop 13 | ~$2,599 | Air (optimized) | Modular | Best community/support, quietest, DIY kit (no SSD/OS) |
| Beelink GTR9 Pro | $3,299 | Air (Mac Studio) | Dual 10GbE | Best for clustering (this guide's test system) |
| Corsair AI Workstation 300 | $3,399 | Liquid cooled | 2.5GbE | Brand reputation, quiet under load |
| Minisforum MS-S1 MAX | $3,039 | Air | Dual 10GbE, USB4 v2 | PCIe x16 slot (x4 speed) |
| HP ZBook Ultra G1a | ~$4,049+ | Air (laptop) | WiFi/1GbE | Only portable option, 14" OLED |
Note: Prices have increased significantly since launch due to global LPDDR5X memory shortages and tariffs. The DGX Spark went from $3,999 to $4,699 in Feb 2026. Strix Halo systems are up $500-1,000+ from launch prices (Corsair jumped $699 in one month). Check current availability before buying.
WARNING (Beelink GTR9 Pro): The v1 motherboard has a fatal NIC stability issue that cannot be fixed in software. Verify you are getting board revision v2.2 (with Realtek NICs) before purchasing. Beelink offers free replacement for v1 boards. Contact their support with your serial number.
Recommendation tiers:
- Best value: GMKtec EVO-X2 (~$2,349)
- Best overall: Framework Desktop 13 (~$2,599) -- best cooling, community, repairability, used by kyuz0 and lhl
- Best for clustering: Beelink GTR9 Pro v2.2 ($3,299) or Minisforum MS-S1 MAX ($3,039) -- dual 10GbE for RDMA
- Only if you need portability: HP ZBook Ultra G1a ($4,049+)
Important: ~90% of Chinese mini PCs (Bosgame, GMKtec, Beelink) use the same Sixunited platform internally. Performance is identical. Pick based on price, ports, and cooling preference.
| Feature | Linux (recommended) | Windows |
|---|---|---|
| LLM performance | Baseline (fastest) | ~20-40% slower |
| Max model size | ~120 GB | ~64 GB (known limitation) |
| ROCm/HIP | Supported (kernel 6.18.x) | Very limited |
| vLLM serving | Works | Not supported |
| Image generation | Works (ComfyUI) | Limited |
| Setup effort | Higher (this guide helps) | Lower (but slower) |
Linux is strongly recommended for Strix Halo LLM work. Windows works for casual use with Ollama but leaves significant performance on the table.
New to local LLMs? Here's what the technical terms mean.
Click to expand glossary
APU -- Accelerated Processing Unit. AMD's term for a chip that combines CPU and GPU on one die. Strix Halo's APU shares 128GB of memory between CPU and GPU, which is why it can run large models.
GGUF -- GPT-Generated Unified Format. The file format used by llama.cpp to store AI models. A .gguf file contains the model weights and metadata needed to run inference.
Quantization -- Reducing the precision of model weights to use less memory and run faster. Common types:
- Q4_K_M -- 4-bit quantization, medium quality. Good balance of size and quality.
- Q8_0 -- 8-bit quantization. Better quality, ~2x the size of Q4.
- UD-Q4_K_XL -- Unsloth Dynamic 4-bit. Uses higher precision for important layers.
- BF16 -- Full precision (16-bit). Best quality, largest size.
MoE (Mixture of Experts) -- A model architecture where only a subset of parameters are active for each token. A "30B-A3B" model has 30 billion total parameters but only activates 3 billion per token, making it much faster than a dense 30B model while retaining most of the intelligence.
Dense Model -- A model where all parameters are used for every token. Slower but potentially smarter per parameter count. A dense 7B model uses all 7 billion parameters for every token.
Token -- The basic unit of text for LLMs. Roughly 3/4 of a word in English. "Hello, how are you?" is about 6 tokens.
Prompt Processing (pp) -- How fast the model reads your input. Measured in tokens/second. Higher is better. A pp of 800 t/s means the model can read ~600 words per second.
Token Generation (tg) -- How fast the model writes its response. Measured in tokens/second. This is the speed you "feel" when chatting. 50 t/s feels instant. 5 t/s feels slow.
Unified Memory -- Memory shared between CPU and GPU. Unlike discrete GPUs (RTX 4090 has separate 24GB VRAM), Strix Halo's GPU uses the same 128GB as the CPU. This means you can load models up to ~120GB.
GTT (Graphics Translation Table) -- The portion of system memory that the GPU can access via Vulkan. On Strix Halo, you configure this to ~128GB so the GPU can use all available memory.
Vulkan -- A graphics/compute API. On Strix Halo, Vulkan is the most reliable backend for LLM inference via Ollama.
ROCm -- AMD's GPU compute platform (like NVIDIA's CUDA). Provides HIP backend for llama.cpp. On kernel 6.19.x, requires HSA_OVERRIDE_GFX_VERSION=11.5.1 to work. With the latest llama.cpp, Vulkan RADV is now faster than ROCm on both pp and tg for MoE models.
RADV -- Mesa's open-source Vulkan driver for AMD GPUs. AMD's only supported open-source Vulkan driver since AMDVLK was discontinued. Fastest backend for LLM inference on Strix Halo.
AMDVLK -- AMD's former open-source Vulkan driver. Discontinued (last release April 2025). Uninstall it -- even inactive, its ICD file silently hijacks Vulkan and halves pp speed.
Ollama -- A tool that makes running LLMs as easy as ollama run model-name. Handles model downloading, GPU acceleration, and provides an API. Uses Vulkan on Strix Halo.
llama.cpp -- The open-source C++ library that powers most local LLM inference. Supports Vulkan, ROCm/HIP, and CPU backends.
Flash Attention -- An optimized attention algorithm that reduces memory usage and improves speed. Always enable it on Strix Halo (-fa 1 or OLLAMA_FLASH_ATTENTION=1).
tuned -- A Linux daemon that applies system performance profiles. The accelerator-performance profile gives +5-8% LLM speed on Strix Halo.
What is the difference between Ollama and llama.cpp? Why is llama.cpp faster?
They are not two different programs. Ollama is a wrapper around llama.cpp. It adds model management (ollama pull), a simple API, and easy commands (ollama run). Under the hood, it runs the same llama.cpp inference engine.
So why is llama.cpp direct 35% faster? Two reasons:
-
Wrapper overhead. Ollama adds layers between you and the GPU: model loading, API translation, memory management. This costs ~8-15% on token generation.
-
Bundled version. Ollama ships with a specific llama.cpp version baked in. As of March 2026, Ollama bundles an older build that misses recent Vulkan optimizations (Flash Attention refactor, graphics queue on AMD, GDN shaders). These optimizations gave us +25% on MoE models. Ollama will catch up eventually, but there's always a lag.
Think of it like a web browser: Ollama is Chrome (easy to use, auto-updates, but bundles a specific engine version). llama.cpp direct is building Chromium from source (more work, but you get the latest engine immediately).
What should you use?
| Use case | Recommendation |
|---|---|
| Just want it to work | Ollama -- install and go, 48 t/s is still fast |
| Want maximum speed | llama-server (from latest llama.cpp) -- 65 t/s, same API as Ollama |
| Using kyuz0 containers | kyuz0 -- they auto-rebuild on llama.cpp updates, best of both worlds |
| Benchmarking | llama-bench -- eliminates all overhead, pure GPU measurement |
How to run llama-server (Ollama replacement with full speed):
# Start llama-server with your model (OpenAI-compatible API on port 8080)
cd ~/llama-cpp-latest
AMD_VULKAN_ICD=RADV ./build-vulkan/bin/llama-server \
-m ~/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
-ngl 999 -fa --no-mmap -c 8192 \
--host 0.0.0.0 --port 8080Then point your tools at http://localhost:8080/v1 instead of http://localhost:11434/v1. Same API, 35% faster.
Can I run ChatGPT-level intelligence locally?
Yes. Qwen3.6-35B-A3B runs at 64 t/s via llama-server and is comparable to GPT-4o-mini for most tasks. For coding, Qwen3-Coder 30B-A3B runs at 87 t/s and is competitive with commercial coding assistants. For maximum intelligence, you can run 70B+ dense models at ~5 t/s -- slower but very capable.
Do I need Linux? Can I use Windows?
Linux (Ubuntu 24.04) gives the best performance and is the only way to use ROCm. Windows works for basic inference via Ollama with Vulkan, and AMD's Adrenalin 25.8.1+ drivers added Variable Graphics Memory support for up to 96GB VGM. However, Windows performance is typically 10-20% lower and community tooling is less mature.
Is 128GB enough for the biggest models?
128GB unified memory lets you run models up to ~120GB (some memory reserved for OS and GPU overhead). This covers all 70B Q4 models and most 120B MoE models. For larger models, you can cluster two Strix Halo systems via RDMA for 256GB unified memory. AMD demonstrated a 4-node cluster running a 1 trillion parameter model.
How does this compare to a Mac Studio?
The Mac Studio M4 Max (128GB) costs $3,699 and gets ~100 t/s via MLX with ~546 GB/s bandwidth. The Beelink GTR9 Pro costs $3,299 and gets 50-87 t/s via Vulkan (model-dependent) with ~215 GB/s bandwidth. The Mac is faster per-model due to higher bandwidth, but costs $400 more. The Mac has better software polish (MLX is excellent). The Strix Halo offers better value, Linux flexibility, and ROCm/vLLM ecosystem access.
Why is my speed lower than the guide says?
Common causes:
- tuned not running -- Run
tuned-adm active. Should showaccelerator-performance. This alone is worth +5-8%. - Old Mesa drivers -- Check
vulkaninfo --summary | grep driverInfo. Should be Mesa 26.0.2+. - Using Ollama instead of llama-bench -- Ollama has ~8% overhead. The 87 t/s number is via llama-bench direct.
- GPU clock stuck low -- Check
cat /sys/class/drm/card*/device/pp_dpm_sclk. Should show 2900Mhz with asterisk. - Wrong BIOS VRAM setting -- Check
free -h. Should show ~124GB. If only 31GB, set UMA Frame Buffer to 512MB in BIOS. - Different model/quantization -- The 87 t/s is specifically Qwen3-Coder-30B-A3B UD-Q4_K_XL via RADV. Larger or denser models are slower.
Can I use this for AI coding assistants like Cursor or Continue.dev?
Yes. Ollama provides an OpenAI-compatible API at http://localhost:11434/v1. You can point any tool that supports OpenAI API to your local Ollama:
# In Continue.dev, Cursor, or any OpenAI-compatible client:
# Base URL: http://localhost:11434/v1
# Model: qwen3.5:35b-a3b
# API Key: (leave empty or use "ollama")At 48 t/s, local inference feels instant for code completion and review tasks.
Can I run image generation (Stable Diffusion, Flux)?
Yes. kyuz0's ComfyUI toolboxes provide ROCm containers for image and video generation on gfx1151, supporting Flux, Wan 2.2, and Hunyuan models.
Can I fine-tune models on this hardware?
Yes, with limitations. QLoRA fine-tuning of 7B-30B models works via kyuz0's fine-tuning toolbox. Full fine-tuning of large models is not practical due to memory bandwidth constraints compared to datacenter GPUs.
- kyuz0 -- Maintainer of the Strix Halo toolbox ecosystem, community standard containers
- lhl -- Deep performance research, rocWMMA patches, IOMMU/bandwidth testing
- pablo-ross -- Original GMKtec EVO-X2 setup guide
- TechnigmaAI / Hardware Corner -- Alternative optimization guide
- AMD -- Trillion-parameter LLM clustering article
- Lychee-Technology -- Pre-built llama.cpp binaries for gfx1151
- kisak-mesa PPA -- Latest Mesa drivers for Ubuntu
- GPUOpen-Drivers/AMDVLK -- AMD open-source Vulkan driver
Found something that's wrong, outdated, or missing?
- Open an issue with your hardware, kernel version, and benchmark results
- PRs welcome -- especially from other Strix Halo systems (Framework, GMKtec, HP ZBook)
- If you find a new optimization, include before/after benchmarks
- AMDVLK ICD hijacking discovered: All "pp regression" findings (b8460 vs b8933, Mesa 26.0.2 vs 26.0.5) were caused by AMDVLK's
/etc/vulkan/icd.d/amd_icd64.jsonsilently overriding RADV. No actual regression exists. Corrected on #22375. All benchmarks re-verified on actual RADV - Qwen3.6-35B-A3B benchmark: 64 t/s tg, 1064 pp512 via Vulkan RADV. Drop-in replacement for Qwen3.5 with better coding/reasoning quality, identical speed. Use Q4_K_M -- UD-Q4_K_M costs 13% speed (56.6 t/s)
- Qwen3-Next 80B-A3B benchmark: 55 t/s tg, 657 pp512 via Vulkan RADV (b8933). 80B MoE (3B active) with 256K context window. Faster than the 51B dense Qwen3-Coder-Next (38 t/s)
- Gemma 4 26B-A4B benchmark: 48.5 t/s tg, 1142 pp512 via Vulkan RADV (b8933). First Strix Halo benchmark for this model. Includes KV cache quantization warning (3.5x worse quality degradation vs Qwen at q8_0)
- Llama 4 Scout 109B benchmark: 18.3 t/s tg, 331 pp512 via Vulkan RADV (b8933). 109B parameter model running on a mini PC -- RTX 4090 can't load this
- Merged PR #1: vulkan-tools install check in setup.sh (thanks @ignasivt)
- Updated all prices: Beelink $2,999 to $3,299, Corsair $2,700 to $3,399, GMKtec $2,199 to ~$2,349
- Added linux-firmware-20251125 source attribution and downgrade instructions
- Added Ubuntu 26.04 LTS note (Wayland-only, testing in progress)
- Ollama upgraded to 0.21.2: FA now enabled by default. Qwen3.6 via Ollama: 45.5 t/s (vs 64 t/s llama-bench direct, ~30% overhead)
- Ollama ROCm confirmed working on gfx1151 with
HSA_OVERRIDE_GFX_VERSION=11.5.1(Ollama 0.20.4). Benchmarked: 42.4 t/s tg vs Vulkan's 46.6 t/s (-9%). Vulkan still recommended for speed
Performance discoveries:
- llama.cpp b8298 to b8460 = +25% tg and +24% pp on MoE models (52 to 65 t/s tg, 868 to 1080 pp512)
- Key PRs: #19625 (FA refactor), #20551 (graphics queue), #20334 (GDN shader)
- +25% breaks down as ~14% generic (both backends got this) + ~11% Vulkan-specific
- Dense models show <2% change (already at bandwidth ceiling)
- RADV now beats AMDVLK on both pp AND tg with latest build (old AMDVLK tg advantage gone)
- Exceeded theoretical tg ceiling: measured 65 t/s vs calculated max of ~57 t/s. The standard formula (bandwidth / active_model_size) underestimates MoE performance because it ignores caching and memory access optimizations in newer llama.cpp builds. The real ceiling is a moving target.
- RADV now beats ROCm on both pp (1080 vs 1047) and tg (65 vs 55) on same b8460 build
- ROCm works on kernel 6.19.4 with
HSA_OVERRIDE_GFX_VERSION=11.5.1+HSA_ENABLE_SDMA=0 - ROCm b8460 got +14% tg from generic improvements (47.87 to 54.67)
- Batch/ubatch sweep: default 512 is optimal, no tuning headroom left
New benchmarks:
- Llama 3.1 70B (4.8 t/s, 94% of theoretical ceiling, doesn't fit on RTX 4090)
- Qwen3-Coder-30B UD-Q4_K_XL (87 t/s tg via RADV)
- Qwen3-0.6B (266 t/s tg, 13,112 pp512)
- Extended context scaling (pp flat from 512 to 8K, only 3% drop)
Beginner content:
- Ollama vs llama.cpp FAQ with browser analogy and llama-server setup
- Model recommendation guide (10 use cases)
- Cost comparison (local vs cloud with break-even analysis)
- Buying guide (7 systems with March 2026 verified prices, Beelink v1 board warning)
- Glossary (20+ terms for beginners)
- FAQ (8 common questions)
- Use cases (Claude Code, Cursor, RAG, image gen, TTS)
- Windows vs Linux comparison
Infrastructure:
- One-command setup script (
setup.sh) - Auto-update script for llama.cpp (
update-and-build.sh) - CONTRIBUTING.md and 3 GitHub issue templates
- GitHub release v1.0.0
- 19 topics for discoverability
- GitHub stars + last-commit badges
Fixes:
- All prices verified against current retail (March 2026 snapshot)
- DGX Spark comparison is now apples-to-apples (same model, same context)
- Fixed 12 outdated "ROCm broken on 6.19.x" references
- BIOS VRAM 512MB is mandatory, not just speed-neutral
- Vulkan Driver Comparison updated with b8460 data
- RADV_PERFTEST env vars (cswave32, nogttspill) tested and found to be -10% slower. Don't use.
- Posted findings on llama.cpp Vulkan discussion
- Complete rewrite with live benchmarks on current system
- Added: Kernel 6.19.x ROCm fix (HSA_OVERRIDE_GFX_VERSION=11.5.1)
- Added: Mesa 26.0.2 results (+4-5% tg improvement over 26.0.1)
- Added: qwen3-coder:30b-a3b-q8_0 benchmarks (51.4 t/s -- fastest model)
- Added: Long context performance data from lhl (Vulkan vs ROCm at 32K)
- Added: rocWMMA status update (upstream broken, lhl's tuned branch works)
- Added: vLLM setup and known issues
- Added: RDMA clustering section
- Added: Kernel/ROCm compatibility matrix
- Added: linux-firmware-20251125 warning
- Added: LLVM compiler regression workaround
- Added: Qwen3.5 ROCm hang bug (ROCm #6027)
- Added: Backend decision guide
- Added: Testing checklist
- Added: Collapsible troubleshooting sections
- Updated: ROCm HIP works on kernel 6.19.4 with HSA override (even +6% faster pp than 6.18.14)
- Updated: All benchmark numbers re-measured
- Updated: Replaced
nanoinstructions withteefor copy-paste ready commands - Corrected: rocWMMA is no longer blanket "don't use" -- lhl's tuned branch is best for long context
- Corrected:
iommu=pthas no benefit -- useamd_iommu=offinstead
- Basic setup guide based on pablo-ross' GMKtec guide
- Ollama Vulkan configuration
- ROCm container setup
- Gorgon Halo (Ryzen AI Max 400, Q4 2026): Same architecture, higher clocks
- Medusa Halo (Ryzen AI Max 500): LPDDR6, ~80% more memory bandwidth
- Lemonade 10.0 (March 2026): First Linux NPU support for LLM inference via FastFlowLM
- AMD Variable Graphics Memory (Windows): Up to 128B parameter models in Vulkan llama.cpp
MIT
Found this guide useful? Give it a star on GitHub -- it helps other Strix Halo owners find it. Found something wrong? Open an issue.