AMD Strix Halo Local LLM Guide

65-87 t/s local LLM inference on a $3,299 mini PC. Within 5% of the $4,699 DGX Spark. No cloud, no subscription.

If this guide saves you time, consider giving it a star -- it helps others find it.

   You are here                  What you'll get
   +-----------+                 +---------------------------+
   | Strix     |    30 min       | 87 t/s on 30B MoE models  |
   | Halo      | ==============> | 65 t/s on 35B models      |
   | mini PC   |   this guide    | 70B+ models on one device |
   +-----------+                 | No cloud. No subscription |
                                 +---------------------------+

Why This Guide Exists

A complete guide for running local LLMs on AMD Ryzen AI MAX+ 395 (Strix Halo) with llama.cpp, Ollama, Vulkan, and ROCm. Several Strix Halo guides exist. This one is different:

Every number is measured on this machine. No theoretical estimates, no copy-pasted specs. Every benchmark was run on a Beelink GTR9 Pro with timestamps.
We document what does NOT work. Most guides only tell you what to enable. We tested optimizations that turned out to be regressions, driver versions that crash, and parameters that do nothing. That info is harder to find and more valuable.
We track the moving target. Strix Halo support changes rapidly. This guide is updated with each change, noting what broke and what improved.
We compare backends with data. Vulkan (RADV vs AMDVLK) vs ROCm HIP vs vLLM -- each has strengths. We measured them all.
We explain everything. New to local LLMs? See the Glossary. Not sure which model to pick? See the Model Guide.

Built on findings from: kyuz0/amd-strix-halo-toolboxes (1.2k stars, community standard), lhl/strix-halo-testing (deepest research), and our own extensive testing.

One-Command Setup

If you've already set your BIOS (UMA = 512MB, IOMMU = off) and installed Ubuntu 24.04:

curl -fsSL https://raw.githubusercontent.com/hogeheer499-commits/strix-halo-guide/main/setup.sh | bash

This installs everything, configures Ollama with Vulkan, pulls a model, and runs a benchmark. Takes ~10 minutes (plus model download time). For manual step-by-step setup, see Quick Start.

Hardware
What You Can Run
Benchmark Results
- Ollama Vulkan (RADV)
- ROCm HIP (llama.cpp)
- Backend Comparison
- Hardware Comparison
- Long Context Performance
Backend Decision Guide
Quick Start (6 Steps)
Phase 1: BIOS Configuration
Phase 2: Ubuntu 24.04 Installation
Phase 3: Kernel Configuration
Phase 4: Performance Tuning
Phase 5: Ollama Setup (Vulkan)
Phase 6: Benchmarking
Phase 7: ROCm with llama.cpp (Containers)
Phase 8: vLLM Serving
Phase 9: Multi-Node Clustering (RDMA)
Phase 10: SSH and Remote Access
Vulkan Driver Comparison
Key Findings and Corrections
Known Issues
Troubleshooting
Kernel and ROCm Compatibility
Testing Checklist
Model Recommendation Guide
Cost: Local vs Cloud
Buying Guide
Glossary
FAQ
Community Resources
Credits and References
Contributing
Changelog
License

Hardware

Tested Systems

System	CPU	GPU	RAM	Notes
Beelink GTR9 Pro	Ryzen AI MAX+ 395	Radeon 8060S (40 CU)	128GB LPDDR5X-8000	This guide's primary test system
Framework Desktop 13	Ryzen AI MAX+ 395	Radeon 8060S (40 CU)	128GB LPDDR5X-8000	Used by kyuz0, lhl
GMKtec EVO-X2	Ryzen AI MAX+ 395	Radeon 8060S (40 CU)	128GB LPDDR5X-8000	pablo-ross guide
HP ZBook Ultra G1a	Ryzen AI MAX+ 395	Radeon 8060S (40 CU)	128GB LPDDR5X-8000	Workstation laptop

Strix Halo Specs

Component	Spec
CPU	AMD Ryzen AI MAX+ 395 (32 cores / 64 threads, Zen 5)
GPU	Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs)
RAM	128GB unified LPDDR5X-8000 (~215 GB/s measured, 256 GB/s theoretical)
NPU	RyzenAI-npu5 (XDNA 2)

Why this hardware? 128GB unified memory shared between CPU and GPU means you can run 70B+ models entirely on the GPU -- something an RTX 4090 (24GB VRAM) cannot do. You trade raw bandwidth (~215 GB/s vs ~1 TB/s) for the ability to run much larger, smarter models at a lower price ($3,299 vs $4,699 for the DGX Spark).

What You Can Run

Real-world generation speeds measured on the Beelink GTR9 Pro (Vulkan RADV). Speeds marked with * are via llama-bench direct; others are via Ollama.

Model	Size	Type	Generation Speed	Use Case
Qwen3-0.6B (Q8_0)	0.8 GB	Dense	266 t/s *	Ultra-fast tiny model
Llama 2 7B	3.8 GB	Dense	48-52 t/s	Testing, lightweight tasks
Qwen2.5-VL 7B	6.0 GB	Vision	21.4 t/s	Image understanding
Gemma 4 26B-A4B (UD-Q4_K_M)	15.7 GB	MoE	48.5 t/s *	Google's latest MoE, strong reasoning
Qwen3-Coder 30B-A3B (UD-Q4_K_XL)	16.5 GB	MoE	87 t/s *	Best speed/quality ratio
Qwen3.6 35B-A3B (Q4_K_M)	20 GB	MoE	64 t/s *	Best all-rounder, drop-in upgrade from 3.5
Qwen3.5 35B-A3B	23 GB	MoE	48-65 t/s	General purpose, coding (65 with latest llama.cpp)
Qwen3-Coder 30B-A3B (Q8_0)	32 GB	MoE	51 t/s	Coding (highest quality MoE)
Qwen3-Coder-Next	51 GB	Dense	38-39 t/s	Large dense model
Llama 3.1 70B (Q4_K_M)	42 GB	Dense	4.7-4.9 t/s	70B intelligence, doesn't fit on RTX 4090
Llama 4 Scout 109B (Q4_K_M)	61 GB	MoE	18.3 t/s *	109B params on a mini PC -- RTX 4090 can't even load this
gpt-oss-120b	~70 GB	MoE	~34-38 t/s	Largest practical model
Qwen3-Next 80B-A3B (UD-Q4_K_XL)	42.9 GB	MoE	55 t/s *	80B model, 256K context -- faster than dense 51B
Kimi K2.5 1T (4-node cluster)	~500 GB	MoE	distributed	AMD technical article

Benchmark Results

All benchmarks run on 2026-03-20 and 2026-03-21. System: Beelink GTR9 Pro, kernel 6.19.4, tuned accelerator-performance active.

Ollama Vulkan (RADV, Ollama 0.21.2)

Qwen3.6-35B-A3B (Q4_K_M, ~20GB, MoE -- Ollama 0.21.2):

Prompt Tokens	Prompt Eval	Generation	Notes
20	163 t/s	45.6 t/s	~30% slower than llama-bench direct (64 t/s)
22	174 t/s	45.4 t/s	Ollama overhead is real but acceptable

Qwen3.5-35B-A3B (Q4_K_M, ~23GB, MoE -- Ollama 0.20.4):

Prompt Tokens	Prompt Eval	Generation	vs Previous (Mesa 26.0.1)
14	121.3 t/s	48.0 t/s	tg +4.8%
23	182.3 t/s	47.5 t/s	tg +4.4%
122	456.7 t/s	47.4 t/s	tg +4.2%

Qwen3-Coder 30B-A3B (Q8_0, ~32GB, MoE):

Prompt Tokens	Prompt Eval	Generation	Notes
12	118.3 t/s	51.4 t/s	Fastest via Ollama
21	205.2 t/s	51.3 t/s	Higher quality than Q4_K_M

Qwen3-Coder-Next (~51GB, dense):

Prompt Tokens	Prompt Eval	Generation	vs Previous
12	90.7 t/s	39.1 t/s	tg +2.9%
21	129.5 t/s	38.4 t/s	tg +3.8%
120	301.2 t/s	37.9 t/s	NEW

Other Models:

Model	Size	Prompt Tokens	pp (t/s)	tg (t/s)
Llama 2 7B	3.8 GB	24	384.6	52.0
Qwen2.5-VL 7B	6.0 GB	23	81.7	21.4
Qwen3.5 35B (no-think)	23 GB	14	127.1	47.4

Llama 3.1 70B (Q4_K_M, 42GB, Dense -- the "doesn't fit on RTX 4090" showcase):

Prompt Tokens	Prompt Eval	Generation	Notes
14	22.1 t/s	4.9 t/s	Cold start
23	36.8 t/s	4.8 t/s	Realistic chat
122	79.6 t/s	4.7 t/s	Long prompt

Why so slow? This is a 42GB dense model -- every token reads all 42GB of weights. At ~215 GB/s bandwidth, the theoretical maximum is 215/42 = 5.1 t/s. We hit 4.8 t/s = 94% of the theoretical ceiling. The model is slow not because of poor optimization, but because it's massive. An RTX 4090 (24GB VRAM) cannot run this model at all. This is the Strix Halo advantage: running models that don't fit on consumer GPUs.

What improved? Mesa 26.0.1 to 26.0.2 plus enabling the tuned accelerator-performance profile gave a consistent +4-5% generation speed improvement across all models.

llama-bench Direct -- Latest llama.cpp (b8460) vs kyuz0 Containers (b8298)

UPDATE (2026-03-21): Updating llama.cpp from b8298 to b8460 gave +25% on both pp and tg for MoE models. The new build includes a Vulkan Flash Attention refactor (PR #19625), graphics queue optimization for AMD (PR #20551), and GDN shader support for Qwen3.5 (PR #20334).

Important caveats:

The +25% improvement is specific to MoE models on Vulkan due to the Wave32 FA refactor and graphics queue change. Dense models (Llama 2 7B, Llama 3.1 70B) showed minimal change (<2%) because they were already at the memory bandwidth ceiling.

If you use kyuz0's containers, you get these updates automatically -- the containers rebuild on every llama.cpp master update. kyuz0's toolboxes remain the easiest way to stay current. Our finding here validates the importance of their approach.

WARNING: AMDVLK silently overrides RADV. If AMDVLK is installed, its /etc/vulkan/icd.d/amd_icd64.json takes priority over RADV. This halves your pp speed (1080 → 660 pp512) without any visible error. Always set AMD_VULKAN_ICD=RADV or uninstall AMDVLK entirely: sudo dpkg -r amdvlk && sudo rm -f /etc/vulkan/icd.d/amd_icd64.json. Check your driver: RADV shows (RADV STRIX_HALO) (radv) with shared memory: 65536 in llama-bench output. AMDVLK shows (AMD open-source driver) with shared memory: 32768. We originally reported this as a llama.cpp regression -- it wasn't.

Qwen3.5-35B-A3B (Q4_K_M, 19.9GB, MoE) -- the biggest improvement:

Build	Driver	pp128	pp512	tg128	vs old RADV
b8460 (latest)	RADV	623	1080	64.85	pp +24%, tg +25%
b8460 (latest)	AMDVLK	521	663	64.10	pp -24%, tg +23%
b8298 (kyuz0)	RADV	583	868	52.06	baseline
b8298 (kyuz0)	AMDVLK	479	576	56.08

RADV now wins on EVERYTHING. The old AMDVLK tg advantage (+7.7%) is gone. With the latest build, RADV is faster on both pp (+63% over AMDVLK) and tg (+1.2% over AMDVLK). Use RADV. AMDVLK is discontinued -- uninstall it to avoid silent ICD hijacking.

Extended context scaling (latest build, RADV):

pp512	pp2048	pp4096	pp8192	Drop at 8K
1080	1057	1049	1049	-3%

pp is virtually flat from 512 to 8192 tokens. Only 3% drop at 8K context.

Qwen3-Coder 30B-A3B (UD-Q4_K_XL, 16.5GB, MoE):

Build	Driver	pp512	tg128	Notes
b8460 (latest)	RADV	1342	87.11	Already at bandwidth ceiling
b8298 (kyuz0)	RADV	1350	86.81	~same (model was already at ceiling)

The 30B model shows minimal improvement because it was already hitting the memory bandwidth ceiling at 87 t/s. The 35B model had more headroom, which the new build exploited.

Gemma 4 26B-A4B (UD-Q4_K_M, 15.7GB, MoE) -- tested on b8933 (earliest build with Gemma 4 support):

Build	Driver	pp512	tg128	Notes
b8933	RADV	1142	48.46	Google's latest MoE

Gemma 4 is architecturally slower on tg than Qwen MoE models despite similar size. The reason: head_dim 256/512 (vs Qwen's 128) makes flash attention less efficient, mixed sliding-window/full attention adds overhead, and 3.8B active params vs Qwen's 3.3B. This is not a llama.cpp issue -- it's inherent to the model design. 48.5 t/s is still 3x human reading speed and very usable for interactive chat.

WARNING: Gemma 4 is extremely sensitive to KV cache quantization. Using q8_0 KV cache causes 3.5x worse quality degradation compared to Qwen models. Stick with f16 KV cache for Gemma 4. Do NOT use --cache-type-k q4_0.

Llama 4 Scout 109B (Q4_K_M, 60.9GB, MoE -- 109B total params, 17B active):

Build	Driver	pp512	tg128	Notes
b8933	RADV	331	18.32	109B model running on a mini PC

A 109 billion parameter model running at 18.3 t/s on a $3,299 mini PC. An RTX 4090 (24GB VRAM) cannot even load this model. The speed is bandwidth-limited at 17B active parameters -- theoretical max is ~25 t/s at 215 GB/s, we hit 73% of that ceiling.

Qwen3-Next 80B-A3B (UD-Q4_K_XL, 42.9GB, MoE -- 80B total params, 3B active, 256K context):

Build	Driver	pp512	tg128	Notes
b8933	RADV	657	54.92	80B model at 55 t/s

80 billion parameters running at 55 t/s on a mini PC. This is the largest Qwen3-family MoE model -- 80B total with only 3B active parameters and a 256K context window. Despite being 42.9 GB on disk, the MoE routing keeps only 3B params active per token, making it faster than the 51B dense Qwen3-Coder-Next (38 t/s).

Qwen3.6-35B-A3B (Q4_K_M, 19.9GB, MoE -- drop-in upgrade from Qwen3.5, released April 2026):

Build	Driver	pp512	tg128	Notes
b8460	RADV	1064	63.76	Same speed as Qwen3.5
b8933	RADV	1040	63.66	No regression between builds

Qwen3.6 is a drop-in replacement for Qwen3.5 with significantly improved coding and reasoning quality (same architecture, same active parameters, identical speed). Use Q4_K_M, not UD-Q4_K_M -- Unsloth Dynamic quantization costs 13% tg speed (56.6 vs 64.1 t/s) due to mixed-precision layers, with minimal quality benefit at this quant level.

ROCm HIP -- now working on kernel 6.19.4!

We discovered that HSA_OVERRIDE_GFX_VERSION=11.5.1 + HSA_ENABLE_SDMA=0 fixes the ROCm segfault on kernel 6.19.x. We also rebuilt ROCm with the same b8460 source to make the comparison fair:

Build	pp128	pp512	tg128	Notes
b8460 (latest, kernel 6.19.4)	547	1047	54.67	tg +14% vs b8301
b8301 (self-compiled, kernel 6.19.4)	542	1059	47.87	old build
b8301 (self-compiled, kernel 6.18.14)	488	996	48.80	previous best

ROCm also improved with the latest build: tg went from 47.87 to 54.67 (+14%) thanks to generic llama.cpp optimizations. But Vulkan RADV is still faster on both pp and tg: RADV 1080 vs ROCm 1047 pp512 (+3%), RADV 64.85 vs ROCm 54.67 tg128 (+19%). The +25% Vulkan improvement was ~14% generic (ROCm got this too) plus ~11% Vulkan-specific (FA refactor, graphics queue). ROCm's remaining advantage is hipBLASLt and rocWMMA at very long context (32K+).

Build version matters enormously:

What we tested	pp512	tg128	Lesson
Ollama Vulkan RADV (b8298)	~457 (via API)	47.4	Ollama adds overhead
llama-bench RADV (b8298)	868	52.06	Eliminating Ollama helps
llama-bench RADV (b8460)	1080	64.85	Updating llama.cpp = +25%
ROCm HIP (b8301, HSA fix)	1059	47.87	Old build, unfair comparison
ROCm HIP (b8460, HSA fix)	1047	54.67	ROCm got +14% tg from same update

The single biggest optimization you can make is updating llama.cpp to the latest build. It gave us more improvement (+25% on MoE models) than all kernel tuning, batch size sweeps, and driver comparisons combined. This is counter-intuitive -- people spend hours on kernel parameters, GRUB flags, and Mesa versions, while git pull && cmake --build delivers more than everything else put together. Note: this applies to MoE models specifically. Dense models were already at the bandwidth ceiling and show <2% change.

Batch size and ubatch tuning results (b8298, for reference):

We swept batch sizes 64-2048 and ubatch sizes 32-1024. Result: default 512 is optimal. No headroom via tuning -- the improvement came from updating the build.

How to build the latest llama.cpp with Vulkan:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
CC=/usr/bin/gcc CXX=/usr/bin/g++ cmake -B build -S . \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -G "Unix Makefiles"
cmake --build build -j$(nproc)

# Benchmark
AMD_VULKAN_ICD=RADV ./build/bin/llama-bench \
  -m ~/models/your-model.gguf \
  -fa 1 -ngl 999 -mmp 0 -p 512 -n 128

ROCm on kernel 6.19.x (the fix):

# Add these environment variables before running llama-bench:
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0
export ROCBLAS_USE_HIPBLASLT=1

Llama 2 7B (Q4_K_M, 3.8GB, Dense):

Driver	pp128	pp512	pp1024	tg128
RADV	1154	1377	1356	48.12
AMDVLK	335	327	325	48.02

AMDVLK is 3-4X slower on pp for dense models (2 GiB buffer limit). Use RADV.

Qwen3-0.6B (Q8_0, 762MB, Dense) -- maximum throughput:

Driver	pp128	pp512	tg128
RADV	10,313	13,112	266

ROCm HIP (llama.cpp)

NOTE (March 2026): Kernel 6.19.x misidentifies gfx1151 as gfx1100 for ROCm, but this is fixable with HSA_OVERRIDE_GFX_VERSION=11.5.1 and HSA_ENABLE_SDMA=0. See ROCm on kernel 6.19.x for the full fix. Without these environment variables, ROCm containers will segfault.

Previous results on kernel 6.18.14 (for reference -- these worked):

Build	Model	pp128	pp512	tg128
Self-compiled b8301, FA on, -mmp 0	Qwen3.5-35B-A3B Q4_K_M	488	996	48.8
kyuz0 b8298, FA on	Qwen3.5-35B-A3B Q4_K_M	306	520	55.3
kyuz0 b8298, FA off	Qwen3.5-35B-A3B Q4_K_M	352	524	53.8
kyuz0 b8189, FA + hipBLASLt	Llama 2 7B Q4_K_M	1163	1261	45.07

Vulkan llama-bench Direct (kyuz0 containers, b8298) -- March 2026:

Driver	Model	pp128	pp256	pp512	pp1024	tg128
RADV	Qwen3.5-35B-A3B Q4_K_M	503.67	-	858.88	-	52.15
AMDVLK	Qwen3.5-35B-A3B Q4_K_M	477.28	-	575.59	-	55.54
RADV	Llama 2 7B Q4_K_M	1153.53	1364.45	1377.18	1355.88	48.12
AMDVLK	Llama 2 7B Q4_K_M	334.50	337.96	327.35	325.33	48.02

Critical finding (b8298): AMDVLK has a 2 GiB single buffer allocation limit that cripples pp on dense models (3-4X slower on Llama 2 7B). On MoE models, AMDVLK was slightly faster on tg (+6.5%) with b8298, but this advantage disappeared with b8460 -- see the latest benchmarks where RADV wins on both pp and tg.

Vulkan RADV vs ROCm HIP (same build b8460, Qwen3.5-35B-A3B):

Metric	Ollama (b8298)	Vulkan RADV (b8460)	ROCm HIP (b8460)	Best
pp512	~457	1080	1047	Vulkan RADV
tg128	47.4	64.85	54.67	Vulkan RADV

Vulkan RADV wins on both pp and tg with the latest llama.cpp build. ROCm works on kernel 6.19.x with the HSA override fix but is no longer the fastest backend for MoE models. Use llama-bench or llama-server directly instead of Ollama to avoid the ~35% overhead.

Backend Comparison Table

Based on our measurements and lhl's detailed testing:

Backend	Best For	pp (relative)	tg (relative)	Context Scaling	Setup Difficulty
Ollama + Vulkan RADV	General use, chat	Good	Good	Degrades at 8K+	Easiest
llama.cpp + Vulkan RADV (container)	Max speed, no overhead	Best	Best (short ctx)	Degrades at 8K+	Easy
llama.cpp + Vulkan AMDVLK	Not recommended	Slower than RADV on b8460	Slower on dense (2 GiB limit)	Degrades at 8K+	Easy
ROCm HIP	Batch processing	Excellent	Good	Poor at 32K+	Medium (needs HSA fix on 6.19.x)
ROCm + rocWMMA (tuned)	Long context	Excellent	Best at 32K	Best scaling	Very hard
vLLM (TheRock)	API serving	Good	Good	Good	Hard

Hardware Comparison

Hardware	Bandwidth	tg (MoE ~30B)	Max Model Size	Price
RTX 4090	~1008 GB/s	100-122 t/s	24 GB	~$1600 GPU only
RTX 3090	~936 GB/s	100-112 t/s	24 GB	~$800 used
Apple Mac Studio M4 Max 128GB	~546 GB/s	~100 t/s (MLX)	128 GB	$3,699
Beelink GTR9 Pro	~215 GB/s	65-87 t/s	120+ GB	$3,299
NVIDIA DGX Spark	~273 GB/s	52-56 t/s (120B)	128 GB	$4,699

Apples-to-apples (gpt-oss-120b, same model, both platforms): Strix Halo gets 50-53 t/s vs DGX Spark's 52-56 t/s -- within 5-10% on the same workload, while costing $1,400 less ($3,299 vs $4,699). On smaller MoE models (Qwen3-30B), Strix Halo hits 87 t/s. The DGX Spark wins on prompt processing (3-5X faster) and long context (23%+ faster at 32K). Source: Framework Community, lhl.

Long Context Performance

Based on lhl's measurements with gpt-oss-120b (tg32):

Context	Vulkan AMDVLK	ROCm Standard	ROCm rocWMMA-tuned
2K	50.05 t/s	46.56 t/s	48.97 t/s
4K	46.11 t/s	38.25 t/s	45.42 t/s
8K	43.15 t/s	32.65 t/s	43.55 t/s
16K	38.46 t/s	25.50 t/s	40.91 t/s
32K	31.54 t/s	17.82 t/s	36.43 t/s

At 32K context, standard ROCm drops to 17.82 t/s. Vulkan holds at 31.54 t/s (1.8X faster). But lhl's tuned rocWMMA branch is the overall winner at 36.43 t/s -- 2X faster than standard ROCm and 15% faster than Vulkan at 32K.

At extreme context (130K tokens, from strixhalo.wiki):

Backend	pp512 (t/s)	tg128 (t/s)
Vulkan RADV	17	13
ROCm	41	5
ROCm rocWMMA-tuned	51	13

Backend Decision Guide

                        Which backend should I use?
                                  |
                    Do you need long context (>32K)?
                         /                \
                       NO                  YES
                       |                    |
              Just want it easy?      ROCm + rocWMMA-tuned
                /          \            (lhl's branch)
              YES           NO          Best for 32K+ context
               |             |
          Ollama +      Build latest
          Vulkan RADV   llama.cpp yourself
               |             |
          "It just      llama-server +
           works"       Vulkan RADV
           48 t/s        65 t/s

Quick Start (6 Steps)

For those who want to get running as fast as possible:

BIOS: Set UMA Frame Buffer to 512MB, disable IOMMU
Install Ubuntu 24.04 LTS, switch to X11
Kernel params: Add amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=31457280 to GRUB
Performance: Install tuned, set accelerator-performance profile, upgrade Mesa via kisak PPA
Ollama: Install, configure Vulkan backend with OLLAMA_VULKAN=1 and HIP_VISIBLE_DEVICES=-1
Test: ollama run qwen3.5:35b-a3b -- expect ~48 t/s generation

Each step is detailed in the phases below.

Phase 1: BIOS Configuration

Do this BEFORE installing the OS.

Step 1.1: Set UMA Frame Buffer Size

Navigate to Integrated Graphics then UMA Frame Buffer Size and set to 512MB.

Why? By default, the BIOS reserves ~97GB for GPU VRAM, leaving only ~31GB visible to the OS. Setting it to 512MB lets the OS see ~125GB RAM. This does NOT reduce GPU performance -- Vulkan uses GTT (system memory) anyway, so the GPU still has access to all 128GB for LLM inference. We benchmarked before and after: zero speed difference.

Step 1.2: Disable IOMMU in BIOS

Find the IOMMU setting and set to Disabled.

Why? lhl's memory bandwidth testing shows amd_iommu=off gives ~6% better memory reads compared to default (234 vs 221 GB/s). iommu=pt (pass-through, recommended by some guides) gives no benefit over default. We use amd_iommu=off in the kernel command line as well, but disabling in BIOS ensures it's completely off. Only re-enable if you need VFIO/GPU passthrough or RDMA clustering.

Phase 2: Ubuntu 24.04 Installation

Step 2.1: Install Ubuntu 24.04 LTS

Install Ubuntu 24.04 LTS Desktop with default settings. After installation:

sudo apt update && sudo apt upgrade -y

Step 2.2: Switch to X11

Wayland causes issues with RustDesk, Zoom screen sharing, and some GPU monitoring tools.

sudo tee -a /etc/gdm3/custom.conf > /dev/null << 'EOF'
WaylandEnable=false
EOF

If the line already exists (commented out), uncomment it instead. Reboot to apply.

Ubuntu 26.04 LTS (released April 2026) ships with Linux 7.0, Mesa 26.0, and native apt install rocm. However, 26.04 is Wayland-only (X11 switch above does not work) and the performance-relevant components (kernel, Mesa RADV) are already available on 24.04 via the kisak PPA and mainline kernel PPA. Upgrading is not needed for LLM performance. This guide stays on 24.04 LTS.

Phase 3: Kernel Configuration

Step 3.1: Kernel Version

CRITICAL: Kernel version matters enormously for Strix Halo.

Kernel 6.18.4+ is the minimum stable version (older kernels have gfx1151 stability bugs)

Kernel 6.19.x misidentifies gfx1151 as gfx1100 for ROCm -- fixable with HSA_OVERRIDE_GFX_VERSION=11.5.1 (see Known Issues)

Recommended: Kernel 6.18.6+ or 6.19.x (6.19.x needs HSA override for ROCm)

Check your kernel:

uname -r

Step 3.2: Configure GRUB Boot Parameters

sudo tee /tmp/grub_update.txt << 'EOF'
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=31457280 amdgpu.cwsr_enable=0"
EOF

Then edit /etc/default/grub and replace the GRUB_CMDLINE_LINUX_DEFAULT line with the content above.

Parameter	Purpose	Impact
`amd_iommu=off`	Disable IOMMU completely	+6% memory bandwidth (lhl)
`amdgpu.gttsize=131072`	Set GTT (GPU-accessible system memory) to 128GB	Required for large models
`ttm.pages_limit=31457280`	Set TTM page limit to ~120GB	Required for large models
`amdgpu.cwsr_enable=0`	Disable compute wave save/restore	Not needed for LLM inference

Note: kyuz0's toolboxes use iommu=pt instead of amd_iommu=off. We use off based on lhl's benchmark data showing ~6% better memory bandwidth. The difference is documented in kyuz0 issue #66. If you need RDMA clustering, use iommu=pt instead (RDMA NICs require IOMMU for DMA remapping).

Apply:

sudo update-grub

Step 3.3: Create AMD GPU Modprobe Configuration

sudo tee /etc/modprobe.d/amdgpu_llm_optimized.conf > /dev/null << 'EOF'
options amdgpu gttsize=122800
options ttm pages_limit=31457280
options ttm page_pool_size=31457280
EOF

Update initramfs:

sudo update-initramfs -u -k all

Step 3.4: Create udev Rules for GPU Access

sudo tee /etc/udev/rules.d/99-amd-kfd.rules > /dev/null << 'EOF'
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF

IMPORTANT: The renderD[0-9]* rule is critical. Without it, you get HSA_STATUS_ERROR_OUT_OF_RESOURCES errors with ROCm.

Add your user to GPU groups:

sudo usermod -aG render $USER
sudo usermod -aG video $USER

Reload and reboot:

sudo udevadm control --reload-rules
sudo udevadm trigger
sudo reboot

Phase 4: Performance Tuning

Step 4.1: Install and Configure tuned

sudo apt install tuned -y
sudo systemctl enable --now tuned
sudo tuned-adm profile accelerator-performance

Verify:

tuned-adm active
# Expected: Current active profile: accelerator-performance

Impact: +5-8% overall performance improvement. Memory bandwidth improves from ~221 GB/s to ~234 GB/s write. We measured +4-5% token generation improvement when tuned was running vs not running.

WARNING: tuned may not survive reboots on some systems. Add a check to your .bashrc or create a systemd service to verify it's running after boot.

Step 4.2: Upgrade Mesa Vulkan Drivers

The default Mesa on Ubuntu 24.04 is significantly slower. Upgrade to 26.0.2+:

sudo add-apt-repository ppa:kisak/kisak-mesa
sudo apt update
sudo apt upgrade -y

Verify:

vulkaninfo --summary 2>&1 | grep driverInfo
# Expected: driverInfo = Mesa 26.0.2 - kisak-mesa PPA

Impact: Mesa 25.2.8 to 26.0.1 gave +9% prompt eval (87 to 96 t/s). Mesa 26.0.1 to 26.0.2 gave an additional small improvement.

Note: You may see DKMS errors about mt76-mt7925 during the upgrade. These are harmless -- see Troubleshooting.

Step 4.3: Verify GPU Clock

The GPU should run at its maximum clock speed (2900 MHz) during inference:

cat /sys/class/drm/card*/device/pp_dpm_sclk
# Expected: 2: 2900Mhz *  (asterisk on highest clock)

GPU Clock Bug: On some kernel/firmware combinations, the GPU gets stuck at 900 MHz, causing ~8% performance loss. If your GPU is not at 2900 MHz during load, see Troubleshooting.

Step 4.4: Linux Firmware

dpkg -l | grep linux-firmware | head -5

CRITICAL: Do NOT install linux-firmware-20251125. It breaks ROCm support on Strix Halo (confirmed by kyuz0 toolboxes). Symptoms: instability, crashes, ROCm containers failing to start. The safe versions are 20240318 or 20260110+. If you're on 20251125, downgrade immediately:
# Check your version
dpkg -l | grep linux-firmware
# If 20251125, hold the package to prevent auto-updates pulling it back
sudo apt-mark hold linux-firmware

Phase 5: Ollama Setup (Vulkan)

Ollama is the easiest way to run LLMs on Strix Halo. With the right configuration, it works great.

Step 5.1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Step 5.2: Configure Ollama for Vulkan

Update (April 2026): Ollama ROCm now works on gfx1151 with HSA_OVERRIDE_GFX_VERSION=11.5.1 (ollama/ollama#14855). However, Vulkan is still ~9% faster on token generation (46.6 vs 42.4 t/s on Qwen3.5-35B). We recommend Vulkan for best performance. If you need ROCm (for vLLM compatibility or other reasons), add HSA_OVERRIDE_GFX_VERSION=11.5.1 and HSA_ENABLE_SDMA=0 to your Ollama environment instead of the Vulkan variables below.

sudo systemctl edit ollama

Add between the comment lines:

[Service]
Environment="OLLAMA_VULKAN=1"
Environment="HIP_VISIBLE_DEVICES=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="AMD_VULKAN_ICD=RADV"
Environment="VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json"
Environment="OLLAMA_NUM_BATCH=512"
Environment="OLLAMA_NUM_PARALLEL=1"

Restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Variable	Purpose
`OLLAMA_VULKAN=1`	Force Vulkan backend (9% faster than ROCm on Strix Halo)
`HIP_VISIBLE_DEVICES=-1`	Disable HIP device enumeration (avoids ROCm fallback)
`OLLAMA_FLASH_ATTENTION=1`	Enable flash attention (+13% prompt processing)
`OLLAMA_CONTEXT_LENGTH=8192`	Limit context to prevent OOM (increase if needed)
`AMD_VULKAN_ICD=RADV`	Force RADV driver (faster than AMDVLK for general use)
`VK_ICD_FILENAMES=...`	Explicitly point to RADV ICD file
`OLLAMA_NUM_BATCH=512`	Larger batch size for better throughput
`OLLAMA_NUM_PARALLEL=1`	Single request at a time (maximizes single-request speed)

Step 5.3: Pull Models

# Fast MoE model, great for general use and coding (~23GB)
ollama pull qwen3.5:35b-a3b

# Higher quality MoE, Q8_0 quantization (~32GB)
ollama pull qwen3-coder:30b-a3b-q8_0

# Google's MoE model, strong reasoning (~16GB)
ollama pull gemma4:26b-a4b

# Large dense model for complex tasks (~51GB)
ollama pull qwen3-coder-next

Step 5.4: Test

ollama run qwen3.5:35b-a3b

You should see responses generating at ~48 t/s.

Phase 6: Benchmarking

Step 6.1: Quick Benchmark Script

tee ~/bench-ollama.sh > /dev/null << 'SCRIPT'
#!/bin/bash
MODEL="${1:-qwen3.5:35b-a3b}"
PROMPT="${2:-hello how are you}"
echo "Model: $MODEL"
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
curl -s http://localhost:11434/api/generate -d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":false}" | python3 -c "
import sys,json
d=json.load(sys.stdin)
pp=d['prompt_eval_count']/d['prompt_eval_duration']*1e9
tg=d['eval_count']/d['eval_duration']*1e9
print(f'Prompt eval: {pp:.1f} t/s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation:  {tg:.1f} t/s ({d[\"eval_count\"]} tokens)')
print(f'Total time:  {d[\"total_duration\"]/1e9:.2f}s')
"
SCRIPT
chmod +x ~/bench-ollama.sh

Usage:

# Default (qwen3.5:35b-a3b, short prompt)
bash ~/bench-ollama.sh

# Specific model with custom prompt
bash ~/bench-ollama.sh qwen3-coder-next "explain backpropagation in simple terms"

Step 6.2: Long Prompt Benchmark

tee ~/bench-ollama-long.sh > /dev/null << 'SCRIPT'
#!/bin/bash
MODEL="${1:-qwen3.5:35b-a3b}"
echo "Model: $MODEL (long prompt)"
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
curl -s http://localhost:11434/api/generate -d "{\"model\":\"$MODEL\",\"prompt\":\"You are an expert software architect. I need you to review and refactor the following Python code for a web application that handles user authentication, session management, database connections, API rate limiting, error handling, logging, caching with Redis, background job processing with Celery, WebSocket connections for real-time updates, file upload handling with S3 integration, email notification service, payment processing with Stripe, and search functionality with Elasticsearch. Please provide a comprehensive architecture review covering separation of concerns, SOLID principles, design patterns, security best practices, performance optimization, and scalability considerations.\",\"stream\":false}" | python3 -c "
import sys,json
d=json.load(sys.stdin)
pp=d['prompt_eval_count']/d['prompt_eval_duration']*1e9
tg=d['eval_count']/d['eval_duration']*1e9
print(f'Prompt eval: {pp:.1f} t/s ({d[\"prompt_eval_count\"]} tokens)')
print(f'Generation:  {tg:.1f} t/s ({d[\"eval_count\"]} tokens)')
print(f'Total time:  {d[\"total_duration\"]/1e9:.2f}s')
"
SCRIPT
chmod +x ~/bench-ollama-long.sh

Prompt Length Impact on Speed

Prompt processing speed scales with prompt length due to GPU parallelism:

Prompt Tokens	pp (qwen3.5:35b-a3b)	pp (qwen3-coder-next)
12-14	121 t/s	91 t/s
21-23	182 t/s	130 t/s
120-122	457 t/s	301 t/s

Phase 7: ROCm with llama.cpp (Containers)

For maximum prompt processing performance, use llama.cpp with ROCm via kyuz0 containers.

NOTE: On kernel 6.19.x, ROCm requires HSA_OVERRIDE_GFX_VERSION=11.5.1 and HSA_ENABLE_SDMA=0 to work. Without these, it segfaults. See ROCm on kernel 6.19.x.

Step 7.1: Install Distrobox and Podman

sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh

Note: Ubuntu 24.04 does not include toolbox in its repos. Use Distrobox instead. The default toolbox on Ubuntu also breaks GPU access.

Step 7.2: Create the ROCm Container

distrobox create llama-rocm-72 \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-7.2 \
  --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render --group-add sudo --security-opt seccomp=unconfined"

Step 7.3: Enter and Test

distrobox enter llama-rocm-72
rocm-smi  # Should show your gfx1151 GPU

Step 7.4: Run llama-bench

The container comes with pre-built, optimized llama.cpp binaries:

export ROCBLAS_USE_HIPBLASLT=1
llama-bench -m ~/models/your-model.gguf -fa 1 -ngl 999 -mmp 0 -p 128,512 -n 128

Critical flags:

Flag	Impact	Notes
`-fa 1`	+13% prompt processing	Always use on Strix Halo
`-mmp 0` (--no-mmap)	+22% pp128, more stable	Always use on Strix Halo
`ROCBLAS_USE_HIPBLASLT=1`	+8% token generation	Set in environment
`-ngl 999`	Full GPU offload	Use all available VRAM

The kyuz0 pre-built binary includes the critical compiler flag --amdgpu-unroll-threshold-local=600 which works around the LLVM compiler regression in ROCm 7+. Self-compiled binaries without this flag may be significantly slower.

Step 7.5: Self-Compiling llama.cpp for ROCm

If you need the latest llama.cpp features or want to use lhl's rocWMMA patches:

# Inside a ROCm container
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Standard build (without rocWMMA)
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
  -DCMAKE_BUILD_TYPE=Release

# With rocWMMA (for long context, use lhl's tuned branch)
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151" \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build -j$(nproc)

WARNING: Do NOT enable GGML_HIP_ROCWMMA_FATTN=ON on upstream llama.cpp without lhl's patches. ROCm 7.2 has a 73% performance regression with rocWMMA FA enabled. lhl's custom rocm-wmma-tune branch fixes this and delivers 2X better performance at 32K context.

Phase 8: vLLM Serving

kyuz0's vLLM toolboxes enable API serving on gfx1151.

distrobox create vllm-gfx1151 \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined"

Known vLLM issues on gfx1151:

Qwen3.5 block_size validation (issue #28): Hybrid mamba/attention models compute block_size=1056 which gets rejected by a hardcoded whitelist. Fix available in the issue.
MIOpen encoder hang (issue #30): Vision models hang during kernel search because MIOpen lacks pre-compiled solver DBs for gfx1151. Workaround: disable encoder profiling.

Tested models on vLLM:

Model	Max Context
Llama-3.1-8B	128K
Gemma-3-12b	128K
Qwen3-Coder-30B-A3B (GPTQ 4-bit)	256K
gpt-oss-120b	128K
Qwen3-Next-80B-A3B (GPTQ Int4)	256K

Phase 9: Multi-Node Clustering (RDMA)

For models that exceed 128GB, you can cluster multiple Strix Halo machines using RDMA.

From kyuz0's vLLM clustering guide:

Hardware needed:

2x Strix Halo machines (e.g., Framework Desktop)
2x Intel E810-CQDA1 100GbE NICs
1x DAC cable (direct attach copper, no switch needed for 2 nodes)

Performance:

~50 Gbps bandwidth, ~5 us latency (vs ~70-100 us TCP/IP)
TP=2 across machines = 256GB unified memory
Enables trillion-parameter model inference (AMD article)

Additional kernel parameter for clustering:

pci=realloc

Network configuration:

# Set MTU to 9000 (jumbo frames)
sudo ip link set <interface> mtu 9000

Phase 10: SSH and Remote Access

Step 10.1: Install SSH and fail2ban

sudo apt install openssh-server fail2ban -y

Step 10.2: Disable Root Login

sudo sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
sudo systemctl restart ssh

fail2ban starts automatically and blocks IPs after repeated failed login attempts. We found 68 brute-force attempts on our system within hours of enabling SSH -- fail2ban is essential.

Vulkan Driver Comparison

We tested both Vulkan drivers via llama-bench. Results depend heavily on the llama.cpp build version:

With kyuz0 containers (b8298):

Driver	Model	pp512	tg128
RADV	Qwen3.5-35B-A3B	859	52.15
AMDVLK	Qwen3.5-35B-A3B	576	55.54
RADV	Llama 2 7B	1377	48.12
AMDVLK	Llama 2 7B	327	48.02

With latest llama.cpp (b8460) -- AMDVLK advantage is gone:

Driver	Model	pp512	tg128
RADV	Qwen3.5-35B-A3B	1080	64.85
AMDVLK	Qwen3.5-35B-A3B	663	64.10

AMDVLK is discontinued. Uninstall it -- even inactive, its ICD file silently hijacks Vulkan and halves your pp speed. See AMDVLK warning above.

Our recommendation: Use RADV. AMDVLK is discontinued (last release April 2025) -- RADV is now AMD's only supported open-source Vulkan driver. Even before discontinuation, RADV won on both pp and tg with latest llama.cpp. AMDVLK also had a 2 GiB buffer limit that caused 3-4X slower pp on dense models. Don't install AMDVLK.

Optimal ubatch sizes per driver (from lhl's testing):

AMDVLK: -ub 512
RADV: -ub 1024
ROCm HIP: -ub 2048

Key Findings and Corrections

These findings correct several common recommendations found in other Strix Halo guides.

Things That DON'T Work (Don't Waste Your Time)

Issue	Common Advice	Reality	What Happens If You Try
~~Ollama HIP/ROCm~~	~~"Use ROCm backend"~~	Fixed in Ollama 0.20+ with `HSA_OVERRIDE_GFX_VERSION=11.5.1`. Works but ~9% slower tg than Vulkan	Use Vulkan for best speed, ROCm if you need vLLM compatibility
`iommu=pt` for speed	"Use pass-through for performance"	No benefit over default (lhl)	Same speed as `iommu=on`, wastes a kernel param
AMDVLK for all workloads	"AMDVLK is fastest"	Project discontinued (last release April 2025). RADV beats AMDVLK on both pp (+63%) and tg. Worse: even if you don't use AMDVLK, its ICD file (`/etc/vulkan/icd.d/amd_icd64.json`) silently hijacks Vulkan and halves your pp speed. You won't see an error -- just mysteriously slow prompt processing	Uninstall it completely: `sudo dpkg -r amdvlk && sudo rm -f /etc/vulkan/icd.d/amd_icd64.json`. Verify with llama-bench: RADV shows `(RADV STRIX_HALO)` with `shared memory: 65536`. AMDVLK shows `(AMD open-source driver)` with `shared memory: 32768`
rocWMMA on upstream llama.cpp	"Enable for 2x speed"	73% regression on ROCm 7.2	Massively slower prompt processing
BIOS VRAM increase for speed	"More GPU VRAM = faster"	Zero speed difference, but you lose OS-visible RAM and GTT capacity. Set to 512MB or your system is crippled (31GB usable instead of 125GB).	OS sees only 31GB RAM, large models won't load at all
ROCm 7.0 RC	"Use ROCm 7 RC"	Segfaults on kernel 6.18.14+	`HSA_STATUS_ERROR` crash
Kernel 6.19.x with ROCm (without fix)	"Just use latest kernel"	GPU misidentified as gfx1100 without HSA override	Segfaults unless you set `HSA_OVERRIDE_GFX_VERSION=11.5.1`
linux-firmware-20251125	Auto-update	Breaks ROCm on Strix Halo	Instability, crashes
PyTorch / HuggingFace Transformers	"Just load the model"	92-95% of decode time is hipMemcpy, not compute. ~1.5 t/s on 70B vs llama.cpp's 4.8 t/s	PyTorch doesn't handle UMA correctly -- use llama.cpp or Ollama

Things That DO Work

Optimization	Impact	How
Mesa 25.2.8 to 26.0.2	+9-10% pp	`sudo add-apt-repository ppa:kisak/kisak-mesa`
Flash Attention	+13% pp	`-fa 1` or `OLLAMA_FLASH_ATTENTION=1`
`--no-mmap` (disable mmap)	+22% pp128	`-mmp 0` in llama.cpp, always use on Strix Halo
hipBLASLt	+8% tg	`ROCBLAS_USE_HIPBLASLT=1` (ROCm only)
tuned accelerator-performance	+5-8% overall	`sudo tuned-adm profile accelerator-performance`
RADV over AMDVLK	+63% pp, +1.2% tg	Uninstall AMDVLK entirely (see above). `AMD_VULKAN_ICD=RADV` works too but is easy to forget
`amd_iommu=off`	+6% memory bandwidth	GRUB parameter
BIOS VRAM to 512MB	OS sees 125GB vs 31GB, GTT gets full 128GB	No speed change, but required -- without this, models >31GB won't load
`HIP_VISIBLE_DEVICES=-1`	Fixes Ollama crash	Required for Vulkan-only mode
LLVM unroll workaround	Restores ROCm 7+ perf	`-mllvm --amdgpu-unroll-threshold-local=600`
lhl's rocWMMA-tuned	2X tg at 32K context	Custom branch, requires manual build
Updating llama.cpp	+25% pp and tg (MoE)	`git pull && cmake --build` -- biggest single optimization
HSA_OVERRIDE_GFX_VERSION=11.5.1	Fixes ROCm on kernel 6.19.x	Required for ROCm on 6.19.x, +6% pp vs 6.18.x

Known Issues

Kernel 6.19.x ROCm GPU Misidentification (March 2026 -- FIXED)

Symptoms: Without the fix, ROCm containers segfault. ggml_cuda_init reports gfx1100 (0x1100) instead of gfx1151.

Fix: Set these environment variables before running any ROCm binary:

export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0

With this fix, ROCm works on kernel 6.19.4 and actually performs +6% better on pp than it did on kernel 6.18.14. See benchmarks for numbers.

Qwen3.5 ROCm Hang Bug (ROCm #6027)

Symptoms: Qwen3.5 models (35B-A3B and 27B) hang during load_tensors on ROCm. CPU pegs at 99.9%.

Status: Open. AMD confirmed working with TheRock 7.13.0a20260316+ nightlies.

Workaround: Use very conservative flags: --batch-size 128 --ubatch-size 32 --flash-attn off --n-gpu-layers 1

GPU Clock Bug

Symptoms: GPU stays at 900 MHz instead of 2900 MHz, causing ~8% performance loss.

Check:

cat /sys/class/drm/card*/device/pp_dpm_sclk
# Should show: 2: 2900Mhz *

Fix: Force highest performance level:

echo high | sudo tee /sys/class/drm/card*/device/power_dpm_force_performance_level

GFX1151 1.5X VGPR Capacity

Newer kernels (6.18.4+) recognize gfx1151's 1.5X VGPR capacity compared to standard gfx11 chips. This enables better occupancy for compute shaders. If you're on an older kernel, you may not be getting full performance.

Troubleshooting

DKMS mt7925 WiFi Errors During apt install

You'll see this on every apt install:

Error! Bad return status for module build on kernel: 6.18.14-061814-generic
dkms autoinstall failed for mt76-mt7925(10)

This is harmless. WiFi works fine via the kernel driver. To permanently silence:

sudo dkms remove mt76-mt7925/1.5.0 --all

Ollama "Out of Memory" Even with Small Models

This happens when Ollama tries to use HIP/ROCm instead of Vulkan:

# Check current Ollama environment
systemctl show ollama | grep Environment

# Fix: ensure these are set
sudo systemctl edit ollama
# Add: OLLAMA_VULKAN=1, HIP_VISIBLE_DEVICES=-1
sudo systemctl daemon-reload
sudo systemctl restart ollama

ROCm Container Segfaults (Kernel 6.19.x)

If your ROCm containers crash immediately with segfaults on kernel 6.19.x:

# Fix: set these BEFORE running any ROCm binary
export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0
export ROCBLAS_USE_HIPBLASLT=1

# Then run llama-bench or llama-server as normal
llama-bench -m model.gguf -fa 1 -ngl 999 -mmp 0 -p 512 -n 128

The GPU is misidentified as gfx1100 instead of gfx1151 on kernel 6.19.x. The HSA_OVERRIDE_GFX_VERSION forces correct identification. This is a kernel/ROCm compatibility issue that will likely be fixed in future ROCm releases.

Verifying GPU Memory Configuration

# Check TTM pages limit
cat /sys/module/ttm/parameters/pages_limit

# Check GTT size
cat /sys/module/amdgpu/parameters/gttsize

# Check Vulkan driver
vulkaninfo --summary 2>&1 | grep -E "driverName|driverInfo"

# Check OS-visible RAM
free -h

# Check GPU memory allocation
for file in /sys/class/drm/card*/device/mem_info*; do
  echo "$file: $(cat $file)"
done

rocm-smi Shows Wrong VRAM

For APUs with unified memory, mem_info_vram_total showing ~1GB is normal. The actual compute memory is in GTT, which should show ~128GB.

tuned Not Running After Reboot

# Check status
tuned-adm active

# If not running:
sudo systemctl enable --now tuned
sudo tuned-adm profile accelerator-performance

# Verify it persists
tuned-adm active

GPU Stuck at Low Clock Speed

# Check current clock
cat /sys/class/drm/card*/device/pp_dpm_sclk

# If not on highest (2900Mhz):
echo high | sudo tee /sys/class/drm/card*/device/power_dpm_force_performance_level

# To make persistent, add to /etc/rc.local or a udev rule

Kernel and ROCm Compatibility

Based on community testing and our own findings:

Kernel	ROCm 6.4.4	ROCm 7.2	ROCm 7 Nightly	Vulkan (Ollama)
6.17.7	Works (with right firmware)	Unknown	Works	Works
6.18.4-6.18.14	Works (patched)	Works	Works	Works
6.19.4	Works (HSA fix)	Works (HSA fix)	Unknown	Works

Key rules:

Kernel 6.18.4+ has a fix that breaks ALL older ROCm versions
Kernel 6.19.x misidentifies gfx1151 as gfx1100, fixable with HSA_OVERRIDE_GFX_VERSION=11.5.1
linux-firmware-20251125 breaks ROCm regardless of kernel
linux-firmware-20260110+ is safe

Our current recommendation (March 2026): Kernel 6.19.x works for both Vulkan and ROCm (ROCm requires HSA_OVERRIDE_GFX_VERSION=11.5.1). Kernel 6.18.6-6.18.14 works without the HSA workaround.

Testing Checklist

After completing setup, verify each item:

Community Resources

kyuz0/amd-strix-halo-toolboxes -- Community standard containers for llama.cpp (1.2k+ stars)
kyuz0/amd-strix-halo-vllm-toolboxes -- vLLM serving + RDMA clustering
kyuz0/amd-strix-halo-gfx1151-toolboxes -- Meta repository with all toolboxes
kyuz0 Backend Benchmarks Dashboard -- Interactive benchmark comparison
lhl/strix-halo-testing -- Deep performance research and rocWMMA patches
strixhalo.wiki -- Community wiki
llm-tracker.info -- GPU performance comparison
Level1Techs Forum -- Community benchmark results
Framework Community -- Framework Desktop discussions
ROCm Strix Halo Optimization Guide -- Official AMD guide

Model Recommendation Guide

Not sure which model to run? Here's what we recommend based on use case:

I want to...	Model	Size	Speed	Why
Code (best speed)	Qwen3-Coder 30B-A3B (UD-Q4_K_XL)	16.5 GB	87 t/s	Fastest coding model, MoE architecture
Code (best quality)	Qwen3-Coder 30B-A3B (Q8_0)	32 GB	51 t/s	Same model, higher fidelity quantization
Chat (general)	Qwen3.6 35B-A3B (Q4_K_M)	20 GB	64 t/s	Best all-rounder, successor to 3.5
Chat (no thinking)	Qwen3.6 35B-A3B (no-think)	20 GB	64 t/s	Same speed, direct answers
Code (best quality, 256K ctx)	Qwen3-Next 80B-A3B	42.9 GB	55 t/s	80B MoE, only 3B active, 256K context
Chat (smartest possible)	Qwen3-Coder-Next	51 GB	38 t/s	Dense 51B model, slower but smarter
Reasoning	Gemma 4 26B-A4B	15.7 GB	48.5 t/s	Google's latest MoE, strong reasoning
Analyze images	Qwen2.5-VL 7B	6 GB	21 t/s	Vision-language model
Maximum intelligence	Llama 3.3 70B (Q4)	~40 GB	~5 t/s	Slow but very capable
"Can it run?"	Llama 4 Scout 109B	61 GB	18 t/s	109B model on a mini PC. RTX 4090 can't
Process documents	Qwen3.6 35B-A3B (Q4_K_M)	20 GB	64 t/s	Fast enough for RAG pipelines
Learn / experiment	Llama 2 7B	3.8 GB	52 t/s	Small, fast, well-documented
Throughput testing	Qwen3-0.6B (Q8_0)	0.8 GB	266 t/s	Speed ceiling benchmark

How to install any model:

# Via Ollama (easiest)
ollama pull qwen3.6:35b-a3b

# For llama-bench direct (need GGUF file)
# Download from huggingface.co, place in ~/models/

Understanding Model Names

  Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
  |     |      |   |   |        |  |
  |     |      |   |   |        |  +-- Quantization (see Glossary)
  |     |      |   |   |        +-- "Unsloth Dynamic" quant method
  |     |      |   |   +-- Fine-tuned for instructions
  |     |      |   +-- 3B Active parameters (MoE)
  |     |      +-- 30B Total parameters
  |     +-- Optimized for coding
  +-- Model family (by Alibaba)

Cost: Local vs Cloud

Is a Strix Halo system worth it vs paying for cloud AI?

Assumptions: Qwen3.6-35B-A3B level intelligence, 1000 tokens per query, 50 queries per day.

Option	Monthly Cost	Speed	Privacy	Offline
ChatGPT Plus	$20/mo	Fast	No	No
Claude Pro	$20/mo	Fast	No	No
OpenAI API (gpt-4o, 50 queries/day)	~$15/mo	Fast	No	No
Anthropic API (Claude Sonnet, 50 queries/day)	~$12/mo	Fast	No	No
Strix Halo (after purchase)	~$8/mo electricity	48-87 t/s	Yes	Yes

Break-even calculation:

Scenario	System Cost	Monthly Savings	Break-even
vs ChatGPT Plus	~$3,299	$12/mo	~23 years
vs API heavy use (200 queries/day)	~$3,299	~$50/mo	~5.5 years
vs API power use (1000+ queries/day)	~$3,299	~$200/mo	~16 months

The real value is not cost savings. It's running AI with no rate limits, no content filters, no data leaving your machine, and no internet required. If you value privacy, unrestricted use, or offline capability, local LLM pays for itself immediately.

Power consumption:

Idle: ~30W
Under inference load: 120-140W
Monthly electricity (8 hours/day inference): ~$8 at $0.15/kWh

Use Cases

AI Coding Assistant (Claude Code, Cursor, Continue.dev)

Ollama provides an OpenAI-compatible API. Point any coding tool at it:

# For Cursor, Continue.dev, or any OpenAI-compatible client:
# Base URL: http://localhost:11434/v1
# Model: qwen3.5:35b-a3b (or qwen3-coder-next for max quality)
# API Key: ollama (or leave empty)

For Claude Code specifically:

ANTHROPIC_BASE_URL=http://localhost:11434 claude --model qwen3.5:35b-a3b

At 48-87 t/s, local inference feels instant for code completion and review.

ChatGPT-like Web Interface (Open WebUI)

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. You get conversation history, document upload, multi-model support, and built-in RAG -- all local, no cloud.

RAG (Document Q&A)

For querying your own documents locally:

# 1. Pull an embedding model
ollama pull nomic-embed-text

# 2. Use Open WebUI's built-in RAG (easiest)
#    or set up LangChain + ChromaDB for custom pipelines

Image Generation

kyuz0's ComfyUI toolboxes provide ROCm containers for Flux, Wan 2.2, and Hunyuan on gfx1151. For Vulkan-only: stable-diffusion.cpp works with the RADV driver.

Voice / TTS

Qwen3-TTS and Chatterbox TTS both run on Strix Halo with GPU acceleration. lhl's voicechat2 provides a complete local AI voice chat system.

Buying Guide

All current Strix Halo mini PCs use the same AMD Ryzen AI MAX+ 395 APU with 128GB LPDDR5X-8000. The differentiators are form factor, cooling, ports, and price.

System	Price (Apr 2026)	Cooling	Networking	Key Differentiator
GMKtec EVO-X2	~$2,349	Air (blower)	2.5GbE	Best value, most popular
Bosgame M5	$2,399	Air (blower)	2.5GbE	Budget option
Framework Desktop 13	~$2,599	Air (optimized)	Modular	Best community/support, quietest, DIY kit (no SSD/OS)
Beelink GTR9 Pro	$3,299	Air (Mac Studio)	Dual 10GbE	Best for clustering (this guide's test system)
Corsair AI Workstation 300	$3,399	Liquid cooled	2.5GbE	Brand reputation, quiet under load
Minisforum MS-S1 MAX	$3,039	Air	Dual 10GbE, USB4 v2	PCIe x16 slot (x4 speed)
HP ZBook Ultra G1a	~$4,049+	Air (laptop)	WiFi/1GbE	Only portable option, 14" OLED

Note: Prices have increased significantly since launch due to global LPDDR5X memory shortages and tariffs. The DGX Spark went from $3,999 to $4,699 in Feb 2026. Strix Halo systems are up $500-1,000+ from launch prices (Corsair jumped $699 in one month). Check current availability before buying.

WARNING (Beelink GTR9 Pro): The v1 motherboard has a fatal NIC stability issue that cannot be fixed in software. Verify you are getting board revision v2.2 (with Realtek NICs) before purchasing. Beelink offers free replacement for v1 boards. Contact their support with your serial number.

Recommendation tiers:

Best value: GMKtec EVO-X2 (~$2,349)
Best overall: Framework Desktop 13 (~$2,599) -- best cooling, community, repairability, used by kyuz0 and lhl
Best for clustering: Beelink GTR9 Pro v2.2 ($3,299) or Minisforum MS-S1 MAX ($3,039) -- dual 10GbE for RDMA
Only if you need portability: HP ZBook Ultra G1a ($4,049+)

Important: ~90% of Chinese mini PCs (Bosgame, GMKtec, Beelink) use the same Sixunited platform internally. Performance is identical. Pick based on price, ports, and cooling preference.

Windows vs Linux

Feature	Linux (recommended)	Windows
LLM performance	Baseline (fastest)	~20-40% slower
Max model size	~120 GB	~64 GB (known limitation)
ROCm/HIP	Supported (kernel 6.18.x)	Very limited
vLLM serving	Works	Not supported
Image generation	Works (ComfyUI)	Limited
Setup effort	Higher (this guide helps)	Lower (but slower)

Linux is strongly recommended for Strix Halo LLM work. Windows works for casual use with Ollama but leaves significant performance on the table.

Glossary

New to local LLMs? Here's what the technical terms mean.

Click to expand glossary

APU -- Accelerated Processing Unit. AMD's term for a chip that combines CPU and GPU on one die. Strix Halo's APU shares 128GB of memory between CPU and GPU, which is why it can run large models.

GGUF -- GPT-Generated Unified Format. The file format used by llama.cpp to store AI models. A .gguf file contains the model weights and metadata needed to run inference.

Quantization -- Reducing the precision of model weights to use less memory and run faster. Common types:

Q4_K_M -- 4-bit quantization, medium quality. Good balance of size and quality.
Q8_0 -- 8-bit quantization. Better quality, ~2x the size of Q4.
UD-Q4_K_XL -- Unsloth Dynamic 4-bit. Uses higher precision for important layers.
BF16 -- Full precision (16-bit). Best quality, largest size.

MoE (Mixture of Experts) -- A model architecture where only a subset of parameters are active for each token. A "30B-A3B" model has 30 billion total parameters but only activates 3 billion per token, making it much faster than a dense 30B model while retaining most of the intelligence.

Dense Model -- A model where all parameters are used for every token. Slower but potentially smarter per parameter count. A dense 7B model uses all 7 billion parameters for every token.

Token -- The basic unit of text for LLMs. Roughly 3/4 of a word in English. "Hello, how are you?" is about 6 tokens.

Prompt Processing (pp) -- How fast the model reads your input. Measured in tokens/second. Higher is better. A pp of 800 t/s means the model can read ~600 words per second.

Token Generation (tg) -- How fast the model writes its response. Measured in tokens/second. This is the speed you "feel" when chatting. 50 t/s feels instant. 5 t/s feels slow.

Unified Memory -- Memory shared between CPU and GPU. Unlike discrete GPUs (RTX 4090 has separate 24GB VRAM), Strix Halo's GPU uses the same 128GB as the CPU. This means you can load models up to ~120GB.

GTT (Graphics Translation Table) -- The portion of system memory that the GPU can access via Vulkan. On Strix Halo, you configure this to ~128GB so the GPU can use all available memory.

Vulkan -- A graphics/compute API. On Strix Halo, Vulkan is the most reliable backend for LLM inference via Ollama.

ROCm -- AMD's GPU compute platform (like NVIDIA's CUDA). Provides HIP backend for llama.cpp. On kernel 6.19.x, requires HSA_OVERRIDE_GFX_VERSION=11.5.1 to work. With the latest llama.cpp, Vulkan RADV is now faster than ROCm on both pp and tg for MoE models.

RADV -- Mesa's open-source Vulkan driver for AMD GPUs. AMD's only supported open-source Vulkan driver since AMDVLK was discontinued. Fastest backend for LLM inference on Strix Halo.

AMDVLK -- AMD's former open-source Vulkan driver. Discontinued (last release April 2025). Uninstall it -- even inactive, its ICD file silently hijacks Vulkan and halves pp speed.

Ollama -- A tool that makes running LLMs as easy as ollama run model-name. Handles model downloading, GPU acceleration, and provides an API. Uses Vulkan on Strix Halo.

llama.cpp -- The open-source C++ library that powers most local LLM inference. Supports Vulkan, ROCm/HIP, and CPU backends.

Flash Attention -- An optimized attention algorithm that reduces memory usage and improves speed. Always enable it on Strix Halo (-fa 1 or OLLAMA_FLASH_ATTENTION=1).

tuned -- A Linux daemon that applies system performance profiles. The accelerator-performance profile gives +5-8% LLM speed on Strix Halo.

FAQ

What is the difference between Ollama and llama.cpp? Why is llama.cpp faster?

They are not two different programs. Ollama is a wrapper around llama.cpp. It adds model management (ollama pull), a simple API, and easy commands (ollama run). Under the hood, it runs the same llama.cpp inference engine.

So why is llama.cpp direct 35% faster? Two reasons:

Wrapper overhead. Ollama adds layers between you and the GPU: model loading, API translation, memory management. This costs ~8-15% on token generation.
Bundled version. Ollama ships with a specific llama.cpp version baked in. As of March 2026, Ollama bundles an older build that misses recent Vulkan optimizations (Flash Attention refactor, graphics queue on AMD, GDN shaders). These optimizations gave us +25% on MoE models. Ollama will catch up eventually, but there's always a lag.

Think of it like a web browser: Ollama is Chrome (easy to use, auto-updates, but bundles a specific engine version). llama.cpp direct is building Chromium from source (more work, but you get the latest engine immediately).

What should you use?

Use case	Recommendation
Just want it to work	Ollama -- install and go, 48 t/s is still fast
Want maximum speed	llama-server (from latest llama.cpp) -- 65 t/s, same API as Ollama
Using kyuz0 containers	kyuz0 -- they auto-rebuild on llama.cpp updates, best of both worlds
Benchmarking	llama-bench -- eliminates all overhead, pure GPU measurement

How to run llama-server (Ollama replacement with full speed):

# Start llama-server with your model (OpenAI-compatible API on port 8080)
cd ~/llama-cpp-latest
AMD_VULKAN_ICD=RADV ./build-vulkan/bin/llama-server \
  -m ~/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 999 -fa --no-mmap -c 8192 \
  --host 0.0.0.0 --port 8080

Then point your tools at http://localhost:8080/v1 instead of http://localhost:11434/v1. Same API, 35% faster.

Can I run ChatGPT-level intelligence locally?

Yes. Qwen3.6-35B-A3B runs at 64 t/s via llama-server and is comparable to GPT-4o-mini for most tasks. For coding, Qwen3-Coder 30B-A3B runs at 87 t/s and is competitive with commercial coding assistants. For maximum intelligence, you can run 70B+ dense models at ~5 t/s -- slower but very capable.

Do I need Linux? Can I use Windows?

Linux (Ubuntu 24.04) gives the best performance and is the only way to use ROCm. Windows works for basic inference via Ollama with Vulkan, and AMD's Adrenalin 25.8.1+ drivers added Variable Graphics Memory support for up to 96GB VGM. However, Windows performance is typically 10-20% lower and community tooling is less mature.

Is 128GB enough for the biggest models?

128GB unified memory lets you run models up to ~120GB (some memory reserved for OS and GPU overhead). This covers all 70B Q4 models and most 120B MoE models. For larger models, you can cluster two Strix Halo systems via RDMA for 256GB unified memory. AMD demonstrated a 4-node cluster running a 1 trillion parameter model.

How does this compare to a Mac Studio?

The Mac Studio M4 Max (128GB) costs $3,699 and gets ~100 t/s via MLX with ~546 GB/s bandwidth. The Beelink GTR9 Pro costs $3,299 and gets 50-87 t/s via Vulkan (model-dependent) with ~215 GB/s bandwidth. The Mac is faster per-model due to higher bandwidth, but costs $400 more. The Mac has better software polish (MLX is excellent). The Strix Halo offers better value, Linux flexibility, and ROCm/vLLM ecosystem access.

Why is my speed lower than the guide says?

Common causes:

tuned not running -- Run tuned-adm active. Should show accelerator-performance. This alone is worth +5-8%.
Old Mesa drivers -- Check vulkaninfo --summary | grep driverInfo. Should be Mesa 26.0.2+.
Using Ollama instead of llama-bench -- Ollama has ~8% overhead. The 87 t/s number is via llama-bench direct.
GPU clock stuck low -- Check cat /sys/class/drm/card*/device/pp_dpm_sclk. Should show 2900Mhz with asterisk.
Wrong BIOS VRAM setting -- Check free -h. Should show ~124GB. If only 31GB, set UMA Frame Buffer to 512MB in BIOS.
Different model/quantization -- The 87 t/s is specifically Qwen3-Coder-30B-A3B UD-Q4_K_XL via RADV. Larger or denser models are slower.

Can I use this for AI coding assistants like Cursor or Continue.dev?

Yes. Ollama provides an OpenAI-compatible API at http://localhost:11434/v1. You can point any tool that supports OpenAI API to your local Ollama:

# In Continue.dev, Cursor, or any OpenAI-compatible client:
# Base URL: http://localhost:11434/v1
# Model: qwen3.5:35b-a3b
# API Key: (leave empty or use "ollama")

At 48 t/s, local inference feels instant for code completion and review tasks.

Can I run image generation (Stable Diffusion, Flux)?

Yes. kyuz0's ComfyUI toolboxes provide ROCm containers for image and video generation on gfx1151, supporting Flux, Wan 2.2, and Hunyuan models.

Can I fine-tune models on this hardware?

Yes, with limitations. QLoRA fine-tuning of 7B-30B models works via kyuz0's fine-tuning toolbox. Full fine-tuning of large models is not practical due to memory bandwidth constraints compared to datacenter GPUs.

Credits and References

kyuz0 -- Maintainer of the Strix Halo toolbox ecosystem, community standard containers
lhl -- Deep performance research, rocWMMA patches, IOMMU/bandwidth testing
pablo-ross -- Original GMKtec EVO-X2 setup guide
TechnigmaAI / Hardware Corner -- Alternative optimization guide
AMD -- Trillion-parameter LLM clustering article
Lychee-Technology -- Pre-built llama.cpp binaries for gfx1151
kisak-mesa PPA -- Latest Mesa drivers for Ubuntu
GPUOpen-Drivers/AMDVLK -- AMD open-source Vulkan driver

Contributing

Found something that's wrong, outdated, or missing?

Open an issue with your hardware, kernel version, and benchmark results
PRs welcome -- especially from other Strix Halo systems (Framework, GMKtec, HP ZBook)
If you find a new optimization, include before/after benchmarks

Changelog

2026-04-26 -- April Update + Qwen3.6 + Qwen3-Next 80B Benchmarks

AMDVLK ICD hijacking discovered: All "pp regression" findings (b8460 vs b8933, Mesa 26.0.2 vs 26.0.5) were caused by AMDVLK's /etc/vulkan/icd.d/amd_icd64.json silently overriding RADV. No actual regression exists. Corrected on #22375. All benchmarks re-verified on actual RADV
Qwen3.6-35B-A3B benchmark: 64 t/s tg, 1064 pp512 via Vulkan RADV. Drop-in replacement for Qwen3.5 with better coding/reasoning quality, identical speed. Use Q4_K_M -- UD-Q4_K_M costs 13% speed (56.6 t/s)
Qwen3-Next 80B-A3B benchmark: 55 t/s tg, 657 pp512 via Vulkan RADV (b8933). 80B MoE (3B active) with 256K context window. Faster than the 51B dense Qwen3-Coder-Next (38 t/s)
Gemma 4 26B-A4B benchmark: 48.5 t/s tg, 1142 pp512 via Vulkan RADV (b8933). First Strix Halo benchmark for this model. Includes KV cache quantization warning (3.5x worse quality degradation vs Qwen at q8_0)
Llama 4 Scout 109B benchmark: 18.3 t/s tg, 331 pp512 via Vulkan RADV (b8933). 109B parameter model running on a mini PC -- RTX 4090 can't load this
Merged PR #1: vulkan-tools install check in setup.sh (thanks @ignasivt)
Updated all prices: Beelink $2,999 to $3,299, Corsair $2,700 to $3,399, GMKtec $2,199 to ~$2,349
Added linux-firmware-20251125 source attribution and downgrade instructions
Added Ubuntu 26.04 LTS note (Wayland-only, testing in progress)
Ollama upgraded to 0.21.2: FA now enabled by default. Qwen3.6 via Ollama: 45.5 t/s (vs 64 t/s llama-bench direct, ~30% overhead)
Ollama ROCm confirmed working on gfx1151 with HSA_OVERRIDE_GFX_VERSION=11.5.1 (Ollama 0.20.4). Benchmarked: 42.4 t/s tg vs Vulkan's 46.6 t/s (-9%). Vulkan still recommended for speed

2026-03-21 -- Performance Breakthrough + Beginner Content

Performance discoveries:

llama.cpp b8298 to b8460 = +25% tg and +24% pp on MoE models (52 to 65 t/s tg, 868 to 1080 pp512)
- Key PRs: #19625 (FA refactor), #20551 (graphics queue), #20334 (GDN shader)
- +25% breaks down as ~14% generic (both backends got this) + ~11% Vulkan-specific
- Dense models show <2% change (already at bandwidth ceiling)
RADV now beats AMDVLK on both pp AND tg with latest build (old AMDVLK tg advantage gone)
Exceeded theoretical tg ceiling: measured 65 t/s vs calculated max of ~57 t/s. The standard formula (bandwidth / active_model_size) underestimates MoE performance because it ignores caching and memory access optimizations in newer llama.cpp builds. The real ceiling is a moving target.
RADV now beats ROCm on both pp (1080 vs 1047) and tg (65 vs 55) on same b8460 build
ROCm works on kernel 6.19.4 with HSA_OVERRIDE_GFX_VERSION=11.5.1 + HSA_ENABLE_SDMA=0
ROCm b8460 got +14% tg from generic improvements (47.87 to 54.67)
Batch/ubatch sweep: default 512 is optimal, no tuning headroom left

New benchmarks:

Llama 3.1 70B (4.8 t/s, 94% of theoretical ceiling, doesn't fit on RTX 4090)
Qwen3-Coder-30B UD-Q4_K_XL (87 t/s tg via RADV)
Qwen3-0.6B (266 t/s tg, 13,112 pp512)
Extended context scaling (pp flat from 512 to 8K, only 3% drop)

Beginner content:

Ollama vs llama.cpp FAQ with browser analogy and llama-server setup
Model recommendation guide (10 use cases)
Cost comparison (local vs cloud with break-even analysis)
Buying guide (7 systems with March 2026 verified prices, Beelink v1 board warning)
Glossary (20+ terms for beginners)
FAQ (8 common questions)
Use cases (Claude Code, Cursor, RAG, image gen, TTS)
Windows vs Linux comparison

Infrastructure:

One-command setup script (setup.sh)
Auto-update script for llama.cpp (update-and-build.sh)
CONTRIBUTING.md and 3 GitHub issue templates
GitHub release v1.0.0
19 topics for discoverability
GitHub stars + last-commit badges

Fixes:

All prices verified against current retail (March 2026 snapshot)
DGX Spark comparison is now apples-to-apples (same model, same context)
Fixed 12 outdated "ROCm broken on 6.19.x" references
BIOS VRAM 512MB is mandatory, not just speed-neutral
Vulkan Driver Comparison updated with b8460 data
RADV_PERFTEST env vars (cswave32, nogttspill) tested and found to be -10% slower. Don't use.
Posted findings on llama.cpp Vulkan discussion

2026-03-20 -- Major Rewrite

Complete rewrite with live benchmarks on current system
Added: Kernel 6.19.x ROCm fix (HSA_OVERRIDE_GFX_VERSION=11.5.1)
Added: Mesa 26.0.2 results (+4-5% tg improvement over 26.0.1)
Added: qwen3-coder:30b-a3b-q8_0 benchmarks (51.4 t/s -- fastest model)
Added: Long context performance data from lhl (Vulkan vs ROCm at 32K)
Added: rocWMMA status update (upstream broken, lhl's tuned branch works)
Added: vLLM setup and known issues
Added: RDMA clustering section
Added: Kernel/ROCm compatibility matrix
Added: linux-firmware-20251125 warning
Added: LLVM compiler regression workaround
Added: Qwen3.5 ROCm hang bug (ROCm #6027)
Added: Backend decision guide
Added: Testing checklist
Added: Collapsible troubleshooting sections
Updated: ROCm HIP works on kernel 6.19.4 with HSA override (even +6% faster pp than 6.18.14)
Updated: All benchmark numbers re-measured
Updated: Replaced nano instructions with tee for copy-paste ready commands
Corrected: rocWMMA is no longer blanket "don't use" -- lhl's tuned branch is best for long context
Corrected: iommu=pt has no benefit -- use amd_iommu=off instead

Initial Release

Basic setup guide based on pablo-ross' GMKtec guide
Ollama Vulkan configuration
ROCm container setup

Upcoming

Gorgon Halo (Ryzen AI Max 400, Q4 2026): Same architecture, higher clocks
Medusa Halo (Ryzen AI Max 500): LPDDR6, ~80% more memory bandwidth
Lemonade 10.0 (March 2026): First Linux NPU support for LLM inference via FastFlowLM
AMD Variable Graphics Memory (Windows): Up to 128B parameter models in Vulkan llama.cpp

License

MIT

Found this guide useful? Give it a star on GitHub -- it helps other Strix Halo owners find it. Found something wrong? Open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
BENCHMARKS.md		BENCHMARKS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
RESEARCH.md		RESEARCH.md
SYSTEM_AUDIT.md		SYSTEM_AUDIT.md
generate_preview.py		generate_preview.py
setup.sh		setup.sh
social-preview.png		social-preview.png

Folders and files

Latest commit

History

Repository files navigation