Netstatz

How to Force-Enable Claude Opus 4.6 [1M] Context with a Max Subscription

imac@netstatz.com — Wed, 11 Feb 2026 02:29:23 +0000

Opus 4.6 [1m]

A quick HOWTO on Max 5x enablement

How to Force-Enable Claude Opus 4.6 [1M] Context with a Max Subscription

Topic: Claude CLI, Opus 4.6, Large Context Windows

The last month has been a never-ending drop of new models and AI tools. With the release of Claude Opus 4.6 and Codex 5.3, my week has been a balancing act of managing daily limits and burning context on planning, delegation, and sub-agent tasks.

The goal is always autonomous productivity, but context limits are the bottleneck. Even with “TOON efficient context” (despite Opus 4.6 missing the Nov 2025 standard), the 200k limit forces a massive pre-commit effort to keep iterations on track. The “ballooning crater of YAML frontmatter” from daily skill evolution makes it worse.

Enter the 1 Million Token Context Window.

If you are a Max Subscription user like me, you might have noticed that while Opus 4.6 arrived, the 1M context window did not. After some troubleshooting with a friend (who had it working on a 20x sub), I discovered a workaround. It seems the 1M capability requires a “signal” from the API side to unlock for the subscription side.

Here is the one-shot process to get it working.

Update: This loophole is closed! New sessions with 1M selected going to API Error 400; Existing sessions ending full stop. I guess it is just limited to some 20x Max Users

The TL;DR Summary

Platform Login: Log into platform.claude.com with your standard email.
Add Funds: Purchase a small amount of API credits
Verify: Execute a curl command to confirm the 1M model works via API.
CLI API Login: Log into the Claude CLI using the API key (not the subscription).
Switch Back: Log out, then log back in using your Max Subscription.
Success: The /model opus[1m] command should now be valid.

Step-by-Step Guide

1. The Pre-Requisites

First, acknowledge the frustration. Until today, selecting /model opus[1m] likely resulted in an immediate rejection or error. A few days ago you could even select the model, teasing functionality and sending people to look at open issues on github.

To fix this, go to platform.claude.com. Log in with the same email you use for your generic claude.ai account. You need to verify you have active API credits. I had over $25, but a smaller amount may work. I created a new key to test with.

2. Test the API Access

We need to confirm the backend allows your account to access the model. Create a simple script or run the following curl command.

Note: If you are in a project folder, you can add your ANTHROPIC_API_KEY to your .env file first.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "anthropic-beta: context-1m-2025-08-07" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-opus-4-6",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Process this giant context please..."}
    ]
  }'

If this returns a successful JSON response rather than an error, your API account is ready.

3. The “API Login” Switcheroo

This is the critical step.

Open a terminal in a fresh directory (one without a .claude folder).
Run the login command for the Claude CLI.
Crucial: Select API Login instead of the Max Subscription login.
Once logged in, verify your status using /status. My client flashed a brief message informing about the use of API funds.

Note: You don’t need to perform actual work here; we are just establishing the session.

4. Return to Max Subscription

Now that the “handshake” has occurred via the API login:

Exit the API session.
Return to your main project folder.
Logout of your current CLI session and Exit.
Reload and Login again—this time, select your Max Subscription.

5. Verify 1M Access

Type the following command:

/model opus[1m]

It should now accept the model change without error.

Conclusion

I haven’t bisected exactly why this works—whether it’s the $25 spend or simply the act of logging in via API—but this process successfully triggered the availability of the 1M context window on my Linux client (v2.1.38).

Hopefully, this saves you from the token-limit loop. Now, back to tinkering with agent teams.

From Zero to Tokens: ROCm 7.0.2 Quickstart on Cloudrift’s 8-GPU Node

imac@netstatz.com — Wed, 22 Oct 2025 12:04:38 +0000

We spin up an Azure ND MI300X v5 node via CloudRift.ai, keep everything apt‑managed, and hit ~49–51 tok/s on GLM‑4.6—with the flags, caches, and gotchas (including AITer’s power‑of‑2 experts).

Node: 8× MI300X (192 GB HBM3 each, ~5.3 TB/s), dual Xeon, 8× NVMe RAID‑0 (≈30 TB).
OS/Stack: Ubuntu 24.04.3 + 6.14 HWE / azure + ROCm 7.0.2—no source builds.
Throughput tip: HF_HUB_ENABLE_HF_TRANSFER=0 + HF_XET_HIGH_PERFORMANCE=1 doubled model pulls to >20 Gb/s
GLM‑4.6 BF16: AITer fused‑MoE blocked by 160 experts (non power‑of‑2); ~49–51 tok/s single concurrency out-of-the-box
What’s next: FP8 + GLM‑4.6‑Air (power‑of‑2 experts) to unlock fused‑MoE paths. Ring-1T? ComfyUI Parallel Flows?

Why AMD Instinct MI300X Now?

AMD’s ROCm 7.x era is a real inflection point: setup on modern Debian based distributions is finally “Day-0 simple” while the hardware’s HBM footprint and fabric bandwidth unlock private inference at scale. At the low end, RDNA‑based parts (Ryzen AI/Strix family, RX 7900 XTX) give small teams a credible, low‑concurrency path; at the high end, an 8× MI300X node like this one delivers 5.3 TB/s per‑GPU memory bandwidth and 1.5 TB of pooled HBM for running ~1 TB models with room left for KV cache. This is the line where “private cloud” moves from a slide to a system.

CloudRift.ai (and why we used them – and why we’d return)

CloudRift.ai is a NVIDIA‑focused provider that also has some AMD capability. Dmitry Trifonov (founder & CEO) gave us time on this AMD box specifically to validate the ROCm 7 day‑0 path on vLLM without building from source. That combination—NV fleet depth plus a hard‑to‑find MI300X node—is why we chose them for this piece. CloudRift.ai provided access to the MI300X node used for this testing; all configurations, observations, and benchmarks are our own.

Contents

What helped us move fast on MI300X:

Day-0 image that matched our stack: Ubuntu 24.04.3 + 6.14 HWE with ROCm 7.0.2 already in place (no source build); containers for vLLM/SGLang/OpenWebUI/ComfyUI/Prometheus ready to pull. This alone shaved hours of OS/driver yak shaving.
The “right” node for very‑large models: Azure ND MI300X v5 (8× MI300X, 192 GB HBM3 each ~5.3 TB/s), dual Xeon, and 8× NVMe RAID‑0 (~30 TB)—exactly what you want for 0.5–1.0 TB checkpoints with meaningful KV headroom.
Pre‑wired caches & apt‑managed tooling: HF cache, Docker cache, uv via apt—so we could bind volumes, persist JIT artifacts, and keep vLLM cold‑start costs down.
The AMD “model shelf” (handful of FP8 giants pre‑staged) to cut first‑day download time. We might add some matching pinned ROCm 7.0.2 / vLLM 0.11.x images preconfigured with the correct AITer tunables for different model classes just to make it even easier to jump right in.
A small env‑tweak that paid off big: On this node, turning off HF_HUB_ENABLE_HF_TRANSFER and setting HF_XET_HIGH_PERFORMANCE=1 pushed model pulls past 20 Gb/s—handy when you’re staging multi‑hundred‑GB repos.

Why that matters: when you’re billed by the hour, time‑to‑first‑token beats everything. CloudRift’s curated image + caches let us spend time on models, flags, and numbers—not on package archaeology.

Our Testbed: CloudRift on Azure ND MI300X v5

The hardware behind this instance is impressive.

GPUs: 8x AMD Instinct MI300X GPUs.
192 GB HBM3 per GPU, ~5.3 TB/s memory bandwidth
CPU: 2x 4th-Gen Intel Xeon Scalable processors, totaling 96 physical cores.
128 GB/s per‑link Infinity Fabric; 896 GB/s fully interconnected in an 8‑GPU UBB node
Storage: 8 NVMe 3.5 TB drives arranged in a RAID-0 array

These numbers matter: they let you run big models in full precision and keep a viable KV cache.

Day‑0 Experience (No Building From Source)

One of the first things we noticed was the modern software stack CloudRift.ai had running on this instance. It was running the current Ubuntu 24.04.3 distribution with the 6.14 HWE kernel, which is the most advanced combination of Linux kernel and Operating system that is fully supported by the AMD ROCm 7 series with the AMD ROCm 7.0.2 software stack at the time of publication.

It is worth noting, the current guidance from Microsoft (dated 4/24/25) indicates Ubuntu 22.04 + Kernel 5.15 + ROCm 6.22 should be used. Even the current and related AzureHPC Marketplace N-series documentation (dated 10/16/25) points only to ROCm 7.0.1 sources for other AMD targets. This underscores the value of a provider that tracks the moving target. Immediately this saves time, often hours, that might otherwise be spent upgrading the platform and the driver stack to the very latest kit.

Caches and Containers

Beyond the default operating system and driver-ready hardware, the software layer on this CloudRift.ai instance came with a handful of other convenient features.

Containers: ROCm-tuned vLLM and SGLang images from AMD, OpenWebUI, ComfyUI, Prometheus – ready to bind caches and go.
Caches pre-wired: HF Cache, uv, and Docker cached to a large NVMe mount with envs preset
uv available via apt: drop in a uv.lock and recreate Python envs deterministically, in keeping with a pure apt-managed posture.

Throughput & Downloads (HF knobs that actually helped)

8 NVMe drives providing 30 TB of space in a RAID-0 configuration. One observation we made is that the HF_HUB_ENABLE_HF_TRANSFER=1 flag we use to improve performance pulling models over a 10Gbps network into our HF_HOME was actually holding back throughput. In the CloudRift.ai node, once we set HF_HUB_ENABLE_HF_TRANSFER=0 and HF_XET_HIGH_PERFORMANCE=1 the speeds jumped from 8-10 Gb/s and more than doubled to over 20 Gb/s. (uv add huggingface_hub[cli] hf_xet)

Hugging Face Pull at >20 Gb/s

We downloaded a number of interesting and varied MoE model architectures, and if we find the cycles, hope to showcase a model sweep and performance gains with recent changes to ROCm7 AITer and vLLM.

$ hf cache scan
REPO ID                                 REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH                                                                    
--------------------------------------- --------- ------------ -------- ------------- ------------- ---- ----------------------------------------------------------------------------- 
Comfy-Org/Wan_2.2_ComfyUI_Repackaged    model           677.1G       42 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--Comfy-Org--Wan_2.2_ComfyUI_Repackaged    
Comfy-Org/stable-diffusion-3.5-fp8      model            48.0G       11 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--Comfy-Org--stable-diffusion-3.5-fp8      
Qwen/Qwen-Image-Edit-2509               model            57.7G       32 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit-2509               
Qwen/Qwen1.5-MoE-A2.7B-Chat             model            28.6G       19 11 hours ago  11 hours ago  main /root/.cache/huggingface/hub/models--Qwen--Qwen1.5-MoE-A2.7B-Chat             
Qwen/Qwen2.5-VL-72B-Instruct            model           146.8G       50 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-72B-Instruct            
Qwen/Qwen3-0.6B                         model             1.5G       10 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B                         
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8  model           236.4G       34 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507-FP8  
Qwen/Qwen3-235B-A22B-Thinking-2507-FP8  model           236.4G       34 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-235B-A22B-Thinking-2507-FP8  
Qwen/Qwen3-30B-A3B-FP8                  model            32.5G       17 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B-FP8                  
Qwen/Qwen3-Coder-480B-A35B-Instruct     model           960.3G      253 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-Coder-480B-A35B-Instruct     
Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 model           482.2G       61 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-Coder-480B-A35B-Instruct-FP8 
Qwen/Qwen3-Next-80B-A3B-Instruct        model           162.7G       51 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct       
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8    model            82.1G       18 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Instruct-FP8    
Qwen/Qwen3-Next-80B-A3B-Thinking-FP8    model            82.1G       18 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-Next-80B-A3B-Thinking-FP8    
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8    model           237.6G       36 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-235B-A22B-Instruct-FP8    
Qwen/Qwen3-VL-235B-A22B-Thinking        model           471.4G      108 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-235B-A22B-Thinking        
RedHatAI/gemma-3-27b-it-FP8-dynamic     model            29.3G       21 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--RedHatAI--gemma-3-27b-it-FP8-dynamic     
allenai/OLMoE-1B-7B-0125                model            27.7G       15 8 hours ago   8 hours ago   main /root/.cache/huggingface/hub/models--allenai--OLMoE-1B-7B-0125               
amd/Llama-3.1-405B-Instruct-FP8-KV      model           410.1G       97 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--amd--Llama-3.1-405B-Instruct-FP8-KV      
amd/Llama-3.1-70B-Instruct-FP8-KV       model            72.7G       26 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--amd--Llama-3.1-70B-Instruct-FP8-KV      
amd/Llama-3.3-70B-Instruct-FP8-KV       model            72.7G       26 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--amd--Llama-3.3-70B-Instruct-FP8-KV       
amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV  model           141.0G       38 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--amd--Mixtral-8x22B-Instruct-v0.1-FP8-KV  
amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV   model            50.4G       20 11 hours ago  11 hours ago  main /root/.cache/huggingface/hub/models--amd--Mixtral-8x7B-Instruct-v0.1-FP8-KV   
deepseek-ai/DeepSeek-R1-0528            model           688.6G      174 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-0528            
deepseek-ai/DeepSeek-V3.1-Terminus      model           688.6G      181 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-V3.1-Terminus      
ibm-granite/granite-4.0-h-small         model            64.4G       26 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--ibm-granite--granite-4.0-h-small         
ibm-granite/granite-docling-258M        model           529.9M       17 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--ibm-granite--granite-docling-258M        
mistralai/Mixtral-8x22B-Instruct-v0.1   model           281.3G       68 8 hours ago   8 hours ago   main /root/.cache/huggingface/hub/models--mistralai--Mixtral-8x22B-Instruct-v0.1   
mistralai/Mixtral-8x7B-Instruct-v0.1    model           190.5G       36 11 hours ago  11 hours ago  main /root/.cache/huggingface/hub/models--mistralai--Mixtral-8x7B-Instruct-v0.1    
moonshotai/Kimi-K2-Instruct-0905        model             1.0T       80 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--moonshotai--Kimi-K2-Instruct-0905      
openai/gpt-oss-120b                     model           130.5G       27 1 week ago    1 week ago    main /root/.cache/huggingface/hub/models--openai--gpt-oss-120b                     
unsloth/GLM-4.6-GGUF                    model           389.8G        8 6 days ago    6 days ago    main /root/.cache/huggingface/hub/models--unsloth--GLM-4.6-GGUF                    
unsloth/gpt-oss-120b-BF16               model           233.7G       82 6 days ago    6 days ago    main /root/.cache/huggingface/hub/models--unsloth--gpt-oss-120b-BF16              
zai-org/GLM-4.5                         model           716.7G      101 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--zai-org--GLM-4.5                         
zai-org/GLM-4.5-Air-FP8                 model           112.6G       55 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-Air-FP8                 
zai-org/GLM-4.5-FP8                     model           361.3G      101 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--zai-org--GLM-4.5-FP8                     
zai-org/GLM-4.6                         model           713.6G      101 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--zai-org--GLM-4.6                         
zai-org/GLM-4.6-FP8                     model           354.9G      100 2 days ago    2 days ago    main /root/.cache/huggingface/hub/models--zai-org--GLM-4.6-FP8                     
Done in 0.1s. Scanned 38 repo(s) for a total of 10.7T.

Cache Quirks and Build Sources (how we avoided 50-second cold starts)

We added some directories to store the runtime outputs from our vLLM containers various JIT compilers. Matching environment variables set inside the container, the basic set includes PYTORCH_TUNABLEOP_FILENAME, JIT_WORKSPACE_DIR, AITER_JIT_DIR, HF_HOME and VLLM_CACHE_ROOT. These ensure all the just-in-time paged attention builds, kernel operators, and other cached items are not lost between container restarts and cuts down on warm-up time and cold starts if you are iterating on the same build. AMD ships pytorch tunable ops pre-tuned shapes inside their builds. If you are using AMD prebuilt images, or one based on the tuning build base image, you will already some of these benefits baked in.

We also added some additional AMD nightly vLLM builds to the docker library, as these often take advantage of cherry picked PRs against the next stable vLLM release, and often merge in specific ROCm and AITer fixes queued for future vLLM releases once they meet a broader set of vLLM requirements. Just building against the vLLM upstream tree with nightly pytorch wheels works, but may not match in performance.

Qwen Image Edit 2509

The new Qwen Image Edit arrived September 25th. This new Qwen Image Edit 2509 release introduced the ability to input up to three images of people, places and things and do almost anything with them. It is a big jump from the last release, and we could not resist testing how it performed out of the box on this CloudRift node. To do this we used a simple cli tool qwen-image-mps. The uv recipe is very simple and shown below. The example generation, borrowed from the example used at Ivan’s repository, is a great reminder of the 1999 Fluorscence image from digitalblashsphemy.com, which drove this author’s Windows 98 SE desktop wallpaper at the time, and is likely to be the origin of the magical mushroom theme.

uv add torch --index https://download.pytorch.org/whl/nightly/rocm7.0/
uv add git+https://github.com/ivanfioravanti/qwen-image-mps.git
uv add git+https://github.com/huggingface/diffusers.git

“A magical forest with glowing mushrooms”

60s of work for 1/24th of this node’s horsepower

The first Google Images mouse with no background, upscaled

The merged edit took much longer ~10m at 1600px

Merged result from Qwen Image Edit 2509

This type of simple editing, using about 30% of one of the 8 GPUs is a great teaser for Advanced Video Generation and we hope to come back and drive a full ComfyUI workflow with Qwen-Image-Edit-2509 with additional use of the Wan2.2 Models to take advantage of multiple GPUs working in parallel for the various video, audio and image components. There appears to be a strong use case for creatives that might prefer to more quickly iterate through draft ideas on high powered multi-GPU nodes, generating multiple images in parallel, early in the creative process, with the goal of returning to their personal serialized and slower desktop pipeline once the scope of work has been sufficiently reduced.

Granite 4

IBM’s new Granite 4 models dropped on October 2nd. Arguably the most relevant new model for large and small enterprise. Granite 4 brings IBM reputation, a North American brand message, and commercial certification together with a modern, scalable and performant model. The model itself is a mamba-transformer hybrid; it has properties of traditional transformer models, which are very accurate, as well as mamba model properties which enable large contexts and require less memory. These two types of model layers (mamba and transformer) are arranged in ratio of 9:1 to maintain high performance while balancing accuracy. You can read all about it on IBM’s site. The smaller versions are designed to run on low cost GPUs, mobile devices and embedded systems, and I have read that larger versions will be released. Today, the largest is granite-4.0-h-small at 32 billion parameters, with 9 billion active. A write-up at storagereview.com shows visually how only 72 GB of memory is required vs. other models performing at the same level, and we included their diagram below. The application for this node, would be to see where the ceiling on concurrency lies. One strategy for deployment, even before inference tuning, if using actual bare metal, is to set the AMD Instinct MI300X partitioning into CPX/NPS4 mode, unlocking 32 completely isolated instances to run granite-4.0-h-tiny. A second options, more relevant to this Azure configuration we are using, would be to use the SPX/NPS1 configuration. This could drive 16 instances of granite-4.0-h-small, placing 8 into each numa zone sharing system resources, and then using vLLM and ROCm configuration to bind only 2 instances to each GPU sharing 192 GB of memory. Evaluating concurrency throughput vs latency with these different patterns while pushing overall throughput to a limit, would make an interesting experiment if we find the cycles. Note: For anyone using this model, with tool calling, it requires the hermes tool parser enabled in vLLM 0.11.0 or newer.

VRAM Memory Graph of Comparable Models

GLM-4.6

GLM-4.6 dropped on September 30th in Hugging Face. The ability to run very large models with full precision are what separates the 8xMI300X with 1.5 TB of HBM3 apart from other platforms. One of the big threads for GLM-4.6 is its claimed abilities for coding, when given access to tools. This is further enhanced by an extended context window at 201k tokens, which allows for tackling larger features and more complex code with each iteration. It was no surprise to us that GLM-4.6 was released the day after Claude Sonnet 4.5.

The model loaded right up using the default AMD stable vLLM build without any additional environment flags set. The Triton inference engine was select by default and initially we saw a consistent 33-40 tokens/s on our single concurrency. We turned on AITer kernel ops and hit an issue where the AITer fusedMoE operator requires that the number of experts in each decode layer of the model, is a power of 2. The full precision GLM-4.6 has 160 experts in each of its 92 decode layers, and this causes a full stop in the normal path where AITer is just turned on. The specific error message is.

Worker failed with error 'num_experts must be a power of 2, but got 160'

Early Numbers: Big-Model Sweet Spot

The quick solution, to get AITer into the mix, is to use a few environment variables to steer vLLM’s internal logic away from the AITer fusedMoE operator and towards another modular one without disabling other AITer optimizations.

VLLM_ROCM_USE_AITER=1
VLLM_ROCM_USE_AITER_MOE=0
VLLM_USE_FLASHINFER_MOE_FP16=1

OpenRouter GLM-4.6 Provider Check

This yielded about 45 token/s which was ahead of most of the BF16 day zero offerings on openrouter.ai at the time serving higher batch sizes, so we decided this was sufficient throughput to evaluate how the model performed.

We made one more switch to the current AMD nightly build, with no changes to our environment, and ended up with a few more tokens/s and improved latency, providing a vLLM bench pattern consistently close to the one below.

============ Serving Benchmark Result ============
Successful requests:                     1         
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  20.23  
Total input tokens:                      8000      
Total generated tokens:                  1000      
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         49.43     
Peak output token throughput (tok/s):    51.00     
Peak concurrent requests:                1.00      
Total Token throughput (tok/s):          444.83    
---------------Time to First Token----------------
Mean TTFT (ms):                          511.79    
Median TTFT (ms):                        511.79    
P99 TTFT (ms):                           511.79    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.74     
Median TPOT (ms):                        19.74     
P99 TPOT (ms):                           19.74     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.74     
Median ITL (ms):                         19.83     
P99 ITL (ms):                            20.90     
==================================================

vLLM ROCm Environment Complexities

Looking at what to tune next, the myriad vLLM environment variables and serve settings need to be reviewed, and then impactful options tested under different workloads to find the optimal path for GLM-4.6. The constant momentum of changes and improvements to both vLLM and ROCm AITer stack continues to drive new optimizations weekly. Reviewing the current vLLM options in the repository is a good place to start undertanding what knobs can be turned. Reviewing the AITer release notes and vLLM release notes can also often surface new features and their associated environment flags or configuration options.

The volume of tunables makes performance evaluation incredibly nuanced. New features, attention logic, kv cache capabilities and kernel operators are constantly changing with new options available all the time. Within the AMD family, moving between CDNA and RDNA platforms requires keeping track of patterns for compatibility and for breaking changes. For example, on the AMD Instinct MI300X we have a bunch of static settings that are generally always on.

HIP_FORCE_DEV_KERNARG=1             #Puts kernel arguments directly into device memory
NCCL_MIN_NCHANNELS=112              #turns on all channels in the Infinity fabric between all GPUs 
TORCH_BLAS_PREFER_HIPBLASLT=1       #specifies to use hipBLASLt over rocBLAS for performance  
HSA_OVERRIDE_CPU_AFFINITY_DEBUG=0   #keep ROCm threads aligned with their numa zone, given our dual 48-core intel architecture backend 
SAFETENSORS_FAST_GPU=1              #load safetensors directly to GPU memory

Environment variables for AITer, particularily transient ones for new features, are sometimes only documented in PRs and are not fully merged upstream, at least from a documentation perspective.

AITER_JIT_DIR - works with JIT_WORKSPACE_DIR to allow storing dynamically generated kernels for AITer ops in a separate cache outside the build environment path, avoiding delays required to build them each time vLLM starts up.
AITER_ONLINE_TUNE=1 - allows padding to find find a compatible kernel when the tuning logic determines there is not a match

The vLLM list available in the source is the best reference for current vLLM flags and toggles, well ahead of the online documentation by almost a full minor version, and shows a modification date almost 3 months old at the time of writing.

When models are released, new novel configurations can require the selection of a specific attention backend in order to govern which logic path configuration a particular model will follow. Recently a number of older v0 backends were disabled by default, however there is still quite a bit of choice for just the AMD ROCm stack on v1 in vLLM 0.11.0. As you iterate through models and configuration, the high level attention backend is always emitted on the console at startup once the decision tree logic begins to execute. It is a good indicator when you are familar with an optimal path.

Using Triton MLA backend on V1 engine.
Using AITER MLA backend on V1 engine.
Using Aiter Unified Attention backend on V1 engine.
Using Rocm Attention backend on V1 engine.
Using Flash Attention backend on V1 engine.
Using Triton Attention backend on V1 engine.
Using Aiter Flash Attention backend on V1 engine.

Deepwiki does a great job of visualizing the decision tree logic. A good starting point on AMD is VLLM_ROCM_USE_AITER=1. Both AMD and vLLM are working on more automatic optimization for v0.12.0 and even on AITer specifically if this PR is accepted in a future version.

All of these additional variations present a small challenge in running a model sweep that might aim to test each knob looking for the fastest path for a given scenario or to evaluate a specific improvement in the stack. Generating performance metrics with tools like MAD work, however there are gaps in the selection of environment setup, container launching, command line parameters, and use therefore requires quite a bit of ‘wrapper’ effort to get the job done. Even with recipes for tuning batch sizes, context, sequence and concurrency, etc. there is still a lot of work involved in discovery of the optimal configuration for any model.

Vibe Coding with GLM-4.6

We decided that solving this issue, or making our model sweep more efficient, was also a good challenge for GLM-4.6 in full precision. For the time being, we parked our ambitious model sweep goal, and focused on giving GLM-4.6 a fair shake at coding, whilst also improving our tooling for any future benchmarking activities. This is where personal interest sidelined our initial goals. We wired up Continue.dev into Visual Studio Code for this task. At full precision, the GLM-4.6 model requires all 8 GPUs to be in operation for any reasonable KV cache.

We wired up some MCP connectors (playwright and Google CSP tuned to our documentation, skipped Serena for now) and started down the path of creating a vLLM benchmark orchestration engine, complete with results collation and generation of standard graphical patterns using pandas and seaborn. It was fun and impressive. It just goes until the work is done, and tested, with each iteration. GLM-4.6 stumbled periodically with one of the Continue.dev tools, but we managed to work around that with a custom rule. In order to both test our orchestration of LLM benchmarking and test live inference on GLM-4.6 for development, on the same machine, we ran two instances of vLLM. One with GLM-4.6 limited to use only 0.85 of the GPU memory (~1.3TB of HBM3) and a second vLLM instance with model Qwen3-0.6B with 230 GB of HBM3 where we would iterate through settings, builds, environment, etc. The experience using GLM-4.6 in Continue was nothing short of great. It felt very enlightening to have an experience very similar to Claude Sonnet 4.5, except with no data going to the cloud whatsoever, using the Cloudrift.ai node to both host the brain of our lead programmer, as well as the framework we were building and orchestrating using a separate vLLM instance.

(APIServer pid=23628) INFO 10-11 17:54:18 [loggers.py:161] Engine 000: Avg prompt throughput: 15454.4 tokens/s, Avg generation throughput: 47.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 70.6%

Continue.dev and GLM-4.6 vs Claude Sonnet 4.5

Using Continue.dev in vscode, we iterated a number of new features into our vLLM vllm orchestration and analysis engine. One of the big differences over Claude Sonnet 4.5 was the full 201k token context. In Claude, we get nervous when the ‘context pie’ indicates less than 30%, and often very quickly end up with the message ‘This session is being continued from a previous conversation that ran out of context.’ The result, is that our behavior has changed, and as an example; In Claude, after solving a bug that required the agent to make few iterations unsucessfully, we are now prematurely asking Claude to stop and update architecture documentation to capture that learned pattern before the detail is lost when the context is compacted. Furthermore, we will now often document a fix before it has been fully tested by the agent, to eliminate the risk of losing that pattern, if the context expires resolving any fallout from final testing. From a behavior perspective, the opposite was the case with Continue.dev and GLM-4.6. In Continue, when the ‘context battery’ is near 66% full, it can still take considerable time and use, before the warning about context running out is presented; And that still leaves us with plenty of context for documentation updates. The result is that we are more comfortable taking on larger feature iterations and have had better capture of tricky patterns that in turn makes any future context more effective. The best analogy I can come up with, is that Continue.dev with GLM-4.6 feels like a car, that when showing 1 km of gas left in the tank, could actually drive another 20km. Claude on the other hand, can feel like a car indicating 40km of gas left in the tank, but then actually runs out unexpectedly after 20km. Disclaimer: It is early days, we are a Claude Max subscriber, and do not have the deep experience to qualify the development output of either agent; We have already learned a few more tricks that might give Claude the upper hand.

Simplified Chinese Leak

Simplified Chinese in our GLM-4.6 output

One of our goals was to process enough tokens to observe non-english characters coming out of GLM-4.6, after reading that this was fairly common. It only took about an hour of use improving our ansible playbook for the pending vLLM matrix bench run before we hit our first non-english language leak.

If you translate the simplified Chinese characters, it reads “Finally, let me update the label scheme section and reflect the current bs32 settings.” It was just a blip, and we trucked on. This could probably be solved with a negative rule easily, similar to when using generative prompts with Qwen-Image-Edit-2509. I would say the inconvenience factor was in the realm of Claude Sonnet 4.5 asking to /renew (a function that doesn’t exist), requiring just a context replay to pass by. The important fact is that this output was not garbage, but merely behaving in a multi-lingual manner. “Sleep talking” in its native tongue. I kind of like the idea that you might think another language looks improperly decoded characters, but with the right cipher, appears to be accurate and precise. Fine tune away, but I prefer “always right, but sometimes in another language”, over any strategy to eliminate the problem while degrading either accuracy or precision. In fact, downstream from an effective planning agent, it would not have any impact. It possibly goes the other way, if unstructured data gathering also needs to be translated [on the fly]. Enter the new language CrewAI.

GLM 4.6 Benchmarking

As much as we wanted to just continue using GLM-4.6, we could not benchmark it specifically, and use it for development at the same time. We switched over to Claude Sonnet 4.5, to handle feature refinements while we ran the first series of benchmarks. The new orchestration system can produce a variety of different types of graphics. It processes the vLLM bench serve JSON labels which include details on each build, environment and command line parameters in order to enable automation of the results presentation for a variety of scenarios without having to manually input or update and titles, captions, legends, axis labels, themes, etc.

In the bench runs so far, there is already a lot of variability; Below are some early image examples. The performant paths are farther down our matrix, so we have lots of gains to make. It is taking far longer than anticipated, as some of the variants create load times loading tensors, generating graphs and building kernel operators that are 5 minutes long before the benchmark starts which can range from a few seconds, to up to five minutes. We are going to restructure our test plan to skip right to performant configurations, rather than be comprehensive, as our time on this system is limited. We intend to share some great stepwise recipes for vLLM on large models like GLM-4.6, Ring 0.1T and others, probably each with a model specific post. The partial iteration below indicates that the third facet is getting patterns before the first two, as it shows a few more results. At least one path in the batch of 16 has found some gains. Others failed or were removed as invalid (missing bars) which can happen on first pass if a JIT build takes place. This is identified by large deviations in the P99 latency, and automatically excluded from the data load. The actual results are a WIP, so this is just a sample.

Faceted Dodges of different workloads from our new vLLM Orchestration Engine on various ROCm scenarios (labels removed)

GLM 4.6-FP8 and GLM-4.6-Air

The GLM-4.6-Air series, when released, is expected to have an experts per layer that is a power of 2, and works with the AITer fusedMoE operator. It will allow us to benchmark that kernel against the ones we are using for GLM-4.6 and expect significant gains on throughput. The FP8 version also provides opportunities to increase throughput or decrease latency for the same context length, depending on the approach to parallelism across multiple GPUs. Given that the accuracy on FP8 is considered to be negligible vs the full precision BF16 we have been running up until now, we fully expect FP8 to be fast and furious. In theory we expect to double the batch size while keeping other parameters unchanged.

What a Remarkable 4 Weeks for Open Source AI

https://github.com/pytorch/pytorch/releases/tag/v2.9.0
https://github.com/ROCm/ROCm/releases/tag/rocm-7.0.2
https://github.com/vllm-project/vllm/releases/tag/v0.11.0
https://huggingface.co/zai-org/GLM-4.6
https://huggingface.co/inclusionAI/Ring-1T
https://huggingface.co/Qwen/Qwen-Image-Edit-2509
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
https://huggingface.co/ibm-granite/granite-4.0-h-small
https://huggingface.co/Wan-AI/Wan2.2-Animate-14B

Ready to Build on This?

Building for the future without creating technical debt is a powerful paradigm. We enjoy unfurling the details, and coming up with strategies that avoid building from source, and preserve a forward thinking upgrade and security management strategy. If you are looking for help along the lines of some of the things discussed in this post, contact us to see how we can help. If you’re in the Toronto area, we can grab a coffee (or a beer) and talk shop.

Helpful References

Most links are embedded in the article, however these were additionally flagged in our research as good sources of information.

https://blog.vllm.ai/2025/08/20/torch-compile.html
https://rocm.blogs.amd.com/artificial-intelligence/vllm/README.html
https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html
https://rocm.blogs.amd.com/artificial-intelligence/vllm-optimize/README.html
https://rocm.blogs.amd.com/software-tools-optimization/vllm-0.9.x-rocm/README.html
https://rocm.blogs.amd.com/artificial-intelligence/fp4-mixed-precision/README.html
https://rocm.blogs.amd.com/artificial-intelligence/audio-driven-videogen/README.html
https://rocm.blogs.amd.com/artificial-intelligence/video-generation-models/README.html
https://rocm.docs.amd.com/projects/composable_kernel/en/latest/conceptual/Composable-Kernel-structure.html
https://deepwiki.com/vllm-project/vllm
https://docs.vllm.ai/en/latest/design/hybrid_kv_cache_manager.html
https://github.com/vllm-project/vllm/blob/main/vllm/platforms/rocm.py

Strix Halo on Ubuntu looks great

imac@netstatz.com — Thu, 04 Sep 2025 02:11:24 +0000

Run GPT‑OSS‑120B on Strix Halo (Ubuntu 25.10) — 40-45 tok/s, no containers

OS: Ubuntu 25.10 (kernel 6.17 series)
Drivers: AMDGPU 30.20 + ROCm 7.1 (via APT, using repo.radeon.com)
Python: 3.13 managed by uv (APT repos), lockfile‑reproducible venv

Hardware: Bosgame M5 128 GB
Serving: Lemonade + llama.cpp (TheRocK)
gpt-oss-120b-GGUF perf: ~40 tokens/s

On current Strix Halo boxes (e.g., Ryzen AI MAX+), Ubuntu 25.10 “just works”: the stock kernel recognizes AMDXDNA and AMD’s Instinct 30.20 (amdgpu) + ROCm 7 packages install entirely via APT—no compiling, git pulls or tarballs. With Lemonade on Strix Halo, you can serve gpt‑oss‑120b (GGUF) on the iGPU through llama.cpp‑ROCm and expose an OpenAI‑compatible API. The setup is fully reproducible using uv, can be run headless, and takes very little time to get going. Updated on 11/03/25 to reflect Ubuntu 25.10, ROCm 7.1 and TheRock 7.10 release.

All steps below use Ubuntu/Debian tooling (apt, bash, vi) and prioritize forward‑compatibility, and easy rollback.

Contents

Dual Boot Setup

Cloned the original SSD to a 2nd M.2 NVMe drive using gparted from the Ubuntu USB live installer.
Resized C: and Moved Recovery to create free space at the end of the disk, preserving Win11 recovery actions.
In the system BIOS: enable SR-IOV/IOMMU, leave Secure Boot ON (allows us to enroll MOK for DKMS) – DEL to Ener BIOS, F7 for Boot Selection on the Bosgame M5

root@ai2:~# uname -a
Linux ai2 6.17.0-5-generic #5-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep 22 10:00:33 UTC 2025 x86_64 GNU/Linux

imac@ai2:~$ journalctl -k | grep -i amdxdna
Sep 03 11:39:43 ai2 kernel: amdxdna 0000:c6:00.1: enabling device (0000 -> 0002)
Sep 03 11:39:44 ai2 kernel: [drm] Initialized amdxdna_accel_driver 0.0.0 for 0000:c6:00.1 on minor 0

The 2TB NVMe drive that came with this box is shown below. It was modified using gparted on the Ubuntu USB live boot prior to installation. New Linux users with a shipped device that includes Windows 11, may opt to to create a single new p5 partition for the Ubuntu 25.10 instance and skip the additional partitioning exercise. Creating a second new partition (p6) is not required for running Lemonade on Strix Halo, and has no impact on any steps described in this post. The Ubuntu installer will allocate free space to a selected new partition on the device during the installation process, and can “Install Ubuntu alongside Windows” handling all resizing on its own.

Disk /dev/nvme0n1: 1.86 TiB, 2048408248320 bytes, 4000797360 sectors
Disk model: KINGSTON OM8PGP42048N-A0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 79708580-B666-424E-8D0D-C785190FA328

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 206847 204800 100M EFI System
/dev/nvme0n1p2 206848 239615 32768 16M Microsoft reserved
/dev/nvme0n1p3 239616 616009727 615770112 293.6G Microsoft basic data
/dev/nvme0n1p4 616009728 618057727 2048000 1000M Windows recovery environm
/dev/nvme0n1p5 618057728 1071108095 453050368 216G Linux filesystem
/dev/nvme0n1p6 1071108096 4000794623 2929686528 1.4T Linux filesystem

The Strix Halo crypto performance is excellent. I wrap nvme0n1p6 with LUKS encryption, and the system hardly blinks, hitting over 13 GB/s in hardware-supported decode. The second M.2 slot creates an opportunity for RAID 1+0 reconfiguration for added performance and redundancy, should those become considerations for a longer term deployment plan.

Dual PCIe 4.0 M.2

imac@ai2:~$ lsblk
...
nvme0n1 259:0 0 1.9T 0 disk 
├─nvme0n1p1 259:1 0 100M 0 part /boot/efi
├─nvme0n1p2 259:2 0 16M 0 part 
├─nvme0n1p3 259:3 0 293.6G 0 part 
├─nvme0n1p4 259:4 0 1000M 0 part 
├─nvme0n1p5 259:5 0 216G 0 part /
└─nvme0n1p6 259:6 0 1.4T 0 part 
└─lvm_crypt 252:0 0 1.4T 0 crypt 
└─nvme1-models 252:1 0 500G 0 lvm /mnt/models

imac@ai2:~$ cryptsetup benchmark
...
aes-xts 256b 13151.8 MiB/s 13010.5 MiB/s
...

Add AMDGPU & ROCm APT repos

(Updated: 10/22/25) Using the amdgpu-install package is the recommended way to go if you are not an apt native. AMD have recently updated their documentation and instructions found there should just work now. On newer Ubuntu 25.04 and 25.10 systems, if you are going to use the installer, then you should ignore prompts by apt recommending you use sudo apt modernize-sourcesas it will result in duplicate files being created on apt upgrades until AMD adopt the newer .sources format over the legacy .list format still required to support Ubuntu 22.04.

Via Install Wrapper

wget https://repo.radeon.com/amdgpu-install/7.1/ubuntu/noble/amdgpu-install_7.1.70100-1_all.deb
sudo apt install ./amdgpu-install_7.1.70100-1_all.deb
sudo amdgpu-install  --usecase=rocm

Via Apt Directly

You can skip this section, if you used the install wrapper deb. You can also browse for newer releases here and here. NOTE: Updated 11/25 to ‘7.1‘ and ‘30.20‘ in the apt sources. For the Instinct 30.20 amdgpu drivers, Debian 13 and Ubuntu 25.x use the ‘noble’ repository. Older distributions use the ‘jammy’ repository to maintain ABI compatibility. Similarily, newer distributions, i.e. Ubuntu >25.10, Debian >13 will use+ the amdgpu component that ships with the kernel package, and so the amdgpu-dkms package can be left out entirely, though the amdgpu-dkms-firmware package may still be required to ensure you have the current Strix Halo firmware.

# Key (/etc/apt/keyrings for user managed vs. /usr/share/keyrings where packages deploy)
curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key \
| sudo gpg --dearmor -o /etc/apt/keyrings/rocm.gpg

# AMDGPU 30.20
echo 'Types: deb
URIs: https://repo.radeon.com/amdgpu/30.20/ubuntu/
Suites: noble
Components: main
Signed-By: /etc/apt/keyrings/rocm.gpg' \
| sudo tee /etc/apt/sources.list.d/amdgpu.sources

# ROCm 7.1
echo 'Types: deb
URIs: https://repo.radeon.com/rocm/apt/7.1/
Suites: noble
Components: main
Signed-By: /etc/apt/keyrings/rocm.gpg' \
| sudo tee /etc/apt/sources.list.d/rocm.sources

# apt preferences
echo 'Package: *
Pin: release o=repo.radeon.com
Pin-Priority: 1001' \
| sudo tee /etc/apt/preferences.d/rocm-radeon-pin

sudo apt update
sudo apt install rocm rocminfo
sudo apt install amdgpu-dkms

Enroll Machine Owner Key (MOK)

Secure Boot note: During DKMS install you’ll set a one‑time MOK password and enroll it on the next reboot so the kernel can load signed modules.

You can use the installer to kick off the package installation or, again you can do it yourself. If you want to use Ubuntu 25.04, rather than 25.10, simply install the amdgpu-dkms package.

Add your local user to the groups with access to the GPU hardware

Set Group Permissions and Check Status

# Group permissions for the local user to access the GPU hardware 
sudo usermod -a -G render,video $LOGNAME

The rocm install pulls in a bunch of math libs and runtime packages. (rocblas rocsparse rocfft rocrand miopen-hip rocm-core hip-runtime-amd rocminfo rocm-hip-libraries)

If you are coming from ROCm 6.4.4, you maybe used to the unified version numbering for the ROCm and Instinct components. This diagram from the AMD ROCm blog shows how the ROCm Toolkit and Instinct Driver (amdgpu) now evolve on separate paths.

imac@ai2:~$ modinfo amdgpu | head -n 3
filename: /lib/modules/6.17.0-7-generic/updates/dkms/amdgpu.ko.zst
version: 6.16.6
license: GPL and additional rights
imac@ai2:~$ rocminfo | head -n 1
ROCk module version 6.16.6 is loaded

imac@ai2:~$ apt show rocm-libs -a
Package: rocm-libs
Version: 7.1.0.70100-20~24.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <[email protected]>
Installed-Size: 13.3 kB
Depends: hipblas (= 3.1.0.70100-20~24.04), hipblaslt (= 1.1.0.70100-20~24.04), hipfft (= 1.0.21.70100-20~24.04), hipsolver (= 3.1.0.70100-20~24.04), hipsparse (= 4.1.0.70100-20~24.04)>
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1,050 B
APT-Sources: https://repo.radeon.com/rocm/apt/7.1 noble/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack

Kernel & Memory Tunables

Strix Halo’s iGPU uses RDNA3.5 and can use GTT to dynamically allocate system memory to the GPU. However, oversizing the GTT window could affect stability if loading a large model does not leave enough room for the operating system to function. This scenario will trigger OOM issues if the system lacks enough memory. If you are curious about seeing what oom-kill looks like in your kernel logs, it can be triggered with memtest_vulkan on this platform. The AMD docs additionally show a couple of environment variables that appear to control thresholds for GTT allocations to prevent this. Limited testing has shown these, when enabled and using GTT, can prevent model loading when there is memory pressure.

One notable issue (not retested since 7.0.1, may be resolved) with Lemonade’s llama.cpp+rocm stack arises when VRAM is set to 96GB in the BIOS. With this setting, loading gpt-oss-120b, or any large model, can cause you to just wait endlessly for a model to load. This appears to be some kind of issue with the mmap enabled strategy in llama.cpp misbehaving, as noted in a lemonade issue here and here. The llama-server and llama-bench under the hood of Lemonade work with –no-mmap passed directly to them, so for those trying to use Lemonade with 96GB VRAM pinned in the BIOS, there may be an option soon.

With Strix Halo, using GTT mode does not have the same restriction, and you can load any model as long as you have enough free memory. We set GTT set to 125 GiB on a headless system accessed only by ssh in multi-user mode.

You have two strategies for gpt-oss-120b or other large models with Lemonade v8.1.12. You can either a) tune the VRAM BIOS settings down and allocate GTT as you like (32768000=125GiB) or b) set your VRAM BIOS to 64 or 96 GiB and not use GTT. It is unclear to this author what benefit there is leaving GTT allocated when VRAM is set in the BIOS. For b) I set GTT to a low value of 512M, to simply avoid the default allocation of 16GB.

One note from my experience: using GTT results in slightly lower gpt-oss-120b TPS (38-41) compared to VRAM (40-45). However I have not tested this in a structured manner, or extensively as Leonard Lin. YMMV. It looks like ROCWMMA is right around the corner, which should show up in a uv –package-upgrade shortly along with using hipblaslt more effectively and enabling NPU capabilities in Linux.

a) Low VRAM, High GTT

The VRAM is set to 512MB in the BIOS and GTT is set to 125 GiB in the kernel parameters. GTT at 125 GiB is enough to load a top ten coding model (#6 on 9/14/25) like GLM-4.5-Air-GGUF with 6bit quantization in Lemonade today. If you are using the desktop of your Strix Halo, you may need to roll this back to 105GB (27648000) to reserve space for desktop applications. (Updated: 10/22/25 to address release changes with 7.0.2 – https://github.com/ROCm/ROCm/issues/5562)

sudo vi /etc/default/grub.d/amd_ttm.cfg

# /etc/default/grub.d/amd_ttm.cfg
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }transparent_hugepage=always numa_balancing=disable ttm.pages_limit=32768000 amdttm.pages_limit=32768000"
sudo update-grub
sudo reboot

imac@ai2:~$ sudo dmesg | egrep "amdgpu: .*memory"
[ 3.375071] [drm] amdgpu: 512M of VRAM memory ready
[ 3.375074] [drm] amdgpu: 128000M of GTT memory ready.

With GTT set to 125GB, we can now load models beyond the 96GB BIOS limit if using the BIOS to set the VRAM allocation for the GPU. On our optimized headless system (systemctl set-default multi-user.target), we load GLM-4.5-Air-UD-Q6_K_XL , Qwen3-235B-A22B-Instruct-2507-Q3_K_M. Qwen3 235B yields about 12 TPS today. This platform’s ability to go beyond 96GB is exciting and unique. In earlier versions, you also had to set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, described below, but no longer required.

Some common values for GTT are noted below for convienence:

131072=512 MiB
2097152=8 GiB
27648000=~105 GiB
31457280=~120 GiB
32768000=~125 GiB

When the system is shuffling memory around (dumping disk cache), you can see the movement in top cli output. GPT-OSS-120B takes about a minute to load, and GLM-4.5-AIR-UD-Q6-K-XL and Qwen3-235B-A22B-Instruct-2507-Q3_K_M take about five minutes to load. Sometimes we see the following kernel message as memory is being shuffled around.

kernel: workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 19 times, consider switching to WQ_UNBOUND

b) High VRAM, Low GTT

The VRAM is set to 64 or 96 GIB in the BIOS and GTT is set to 512M in the kernel module parameters. This is currently slightly more performant than Low VRAM, High GTT, but only by a few TPS for Lemonade on Strix Halo running gpt-oss-120b, so we opt to stay in GTT mode as our daily driver. (Updated: 10/22/25 to address release changes with 7.0.2 – https://github.com/ROCm/ROCm/issues/5562)

sudo vi /etc/default/grub

# /etc/default/grub.d/amd_ttm.cfg
GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX:+$GRUB_CMDLINE_LINUX }transparent_hugepage=always numa_balancing=disable ttm.pages_limit=131072 amdttm.pages_limit=131072"

sudo update-grub
sudo reboot

In earlier ROCm releases, when operating with BIOS VRAM pinned to 96GB, leaving 32GB of system memory, and trying to load a large model in Lemonade (~>63GB), we see the following messages, often repeated many times and a failure to load the model.

Sep 02 17:49:11 ai2 kernel: amdgpu: SVM mapping failed, exceeds resident system memory limit

The dmesg output below shows GTT is set to 512MB via kernel parameters. (amdttm.pages_limit=131072 or ttm.pages_limit=131072 depending on whether you are using in-kernel or amdgpu-dkms to provide the amdgpu module).

imac@ai2:~$ sudo dmesg | grep "GTT memory"
[ 3.333156] [drm] amdgpu: 512M of GTT memory ready.

Signal Unified Memory Allocation Logic

This setting was useful in earlier ROCm 7 releases to allow loading of large models beyond 64GB in size. A clue as to the usefulness of this flag is that it was renamed from HIP_UMA as noted here. You can see it is defined in our sample systemd unit template below.

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

When enabled, this setting will cause invalid values to be shown in rocm-smi currently. Below is the output after a 98GB model has been loaded. There is no inference going on, as the GPU percentage reflects current load and typically goes to 99-100% when in use.

$ rocm-smi --showvbios --showmeminfo all --showuse

============================ ROCm System Management Interface ============================
========================================= VBIOS ==========================================
GPU[0] : VBIOS version: 113-STRXLGEN-001
==========================================================================================
=================================== % time GPU is busy ===================================
GPU[0] : GPU use (%): 0
==========================================================================================
================================== Memory Usage (Bytes) ==================================
GPU[0] : VRAM Total Memory (B): 536870912
GPU[0] : VRAM Total Used Memory (B): 169627648
GPU[0] : VIS_VRAM Total Memory (B): 536870912
GPU[0] : VIS_VRAM Total Used Memory (B): 169627648
GPU[0] : GTT Total Memory (B): 134217728000
GPU[0] : GTT Total Used Memory (B): 14753792
==========================================================================================
================================== End of ROCm SMI Log ===================================

Signal GEMM to use HipBlaslt

There is some design discussion on AMD’s site here. Hipblaslt is available, and an evironment variable ensures it is used all the time, however there appear to be some quirks with it. You can simply export the variable, to enable it on the command line. In the systemd unit template further down in this article, you can see it defined in the [Service] description

export ROCBLAS_USE_HIPBLASLT=1

Update PCI IDs

sudo update-pciids

Swap File

I have disabled the swap file on my system. It seems to generate SVM messages from the kernel, usually during model load when swap is enabled.

kernel: amdgpu: SVM mapping failed, exceeds resident system memory limit

Disabling swap is achieved by simply commenting the swapfile load out of /etc/fstab, as shown in the last line below

imac@ai2:~$ cat /etc/fstab 
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
#      
# / was on /dev/nvme0n1p5 during curtin installation
/dev/disk/by-uuid/cea76f55-f802-4ef3-a1cd-ebda84150293 / ext4 defaults 0 1
# /boot/efi was on /dev/nvme0n1p1 during curtin installation
/dev/disk/by-uuid/7E3F-BB4F /boot/efi vfat defaults 0 1
#/swap.img none swap sw 0 0

llamacpp-rocm

The Lemonade team maintain their own build of llamacpp and ROCm libraries. When Lemonade runs with debug enabled, you can see the LD_LIBRARY_PATH emitted indicating the location within the python environment. This build tracks against TheRock providing updated ROCm support for the Strix Halo.

LD_LIBRARY_PATH=/path/to/ROCm/libraries # Lemonade sets to .venv/bin/rocm/llama_server

The first time you load Lemonade, you will see this custom build downloaded and added to your environment as shown here for v8.1.8. It does not change if you go up and down Lemonade versions, so be careful to wipe your python environment if you are rolling back Lemonade versions for test scenarios.

Sep 4 15:52:55 ai2 lemonade-server-dev[4168]: INFO: Downloading llama.cpp server from https://github.com/lemonade-sdk/llamacpp-rocm/releases/download/b1021/llama-b1021-ubuntu-rocm-gfx1151-x64.zip
Sep 4 15:53:03 ai2 lemonade-server-dev[4168]: INFO: Extracting llama-b1021-ubuntu-rocm-gfx1151-x64.zip to /home/imac/src/lemonade/.venv/bin/rocm/llama_server

Now on v8.1.10 we see the following after a uv upgrade

Sep 13 19:48:29 ai2 lemonade-server-dev[4430]: INFO: Downloading llama.cpp server from https://github.com/lemonade-sdk/llamacpp-rocm/releases/download/b1057/llama-b1057-ubuntu-rocm-gfx1151-x64.zip
Sep 13 19:48:38 ai2 lemonade-server-dev[4430]: INFO: Extracting llama-b1057-ubuntu-rocm-gfx1151-x64.zip to /home/imac/src/lemonade/.venv/bin/rocm/llama_server

Setup Python venv with uv

I prefer uv over pyenv/poetry and use a packaged version from debian.griffo.io. (10/20/25 Update – Moved to the ROCm7 nightlies for PyTorch)

# Key
curl -fsSL https://debian.griffo.io/EA0F721D231FDD3A0A17B9AC7808B4DD62C41256.asc \
| sudo gpg --dearmor -o /etc/apt/keyrings/debian.griffo.io.gpg

# Repo Source
echo 'Types: deb 
URIs: https://debian.griffo.io/apt 
Suites: trixie
Components: main
Signed-By: /etc/apt/keyrings/debian.griffo.io.gpg' \
| sudo tee /etc/apt/sources.list.d/debian.griffo.io.sources

apt update
apt install uv

Head to wherever you want to store your lemonade project. (Updated 10/20/25 from rocm6.4 wheels to rocm7.0)

cd ~/src/lemonade #replace with your own project location
uv init --python 3.13
uv add --index rocm7_nightly=https://download.pytorch.org/whl/nightly/rocm7.0/ --index-strategy unsafe-best-match --prerelease allow "torch==2.10.0.dev20251102+rocm7.0" "torchvision==0.25.0.dev20251105+rocm7.0" "torchaudio==2.10.0.dev20251105+rocm7.0"
uv add --index https://pypi.org/simple lemonade-sdk[dev]

Lemonade on Strix Halo does not require pyTorch for GGUF+ROCm, but it is useful for other LLM related tools. Pinning the ROCm wheel extras index in your pyproject.toml helps resolve some dependency extras cleanly when you pull lemonade-sdk. This also avoids installing about 1GB of extra nvidia tools and libraries that will never be used with an AMD GPU. (Updated: 11/03/25 Switch to using TheRock wheels)

Run Lemonade in the Background (screen)

Running in screen allows you to start Lemonade and leave it running in the background. You can then close your terminal window. I picked a reasonable context size, which is configurable. I also set the host so that Lemonade listens on all interfaces, not just localhost. This system is on a private network. Do not port forward or put this system on a public IP in this configuration, please.

cd ~/src/lemonade
screen -S lemony

# inside screen:
source .venv/bin/activate

lemonade-server-dev run gpt-oss-120b-GGUF \
--ctx-size 8192 \
--llamacpp rocm \
--host 0.0.0.0 \
--log-level debug |& tee -a ~/src/lemonade/lemonade-server.log

Detach from screen with CTRL-a d. Reattach with screen -r lemony. Access: http://STRIX_HALO_LAN_IP_ADDRESS:8000 from a browser on any device on the same network. Debug log level will output TPS and other useful information, but can be removed when not needed.

Lock it down

Secure the interfaces once you have it working. In this case, only two ports are required. SSH port 22 is for administration. HTTP port 8000 is for web access to the model manager and API.

sudo ufw allow 22/tcp
sudo ufw allow 8000/tcp
sudo ufw enable

Screen is used here, but a systemd wrapper may be better for long-term use. This would be as a service to provide an API to something like Open WebUI. When this Strix Halo is not tied up with other workloads, I have a separate Debian trixie instance that serves up Open WebUI to provide memory (look at your old chats) and advanced features for other local network users. It is a mature tool and great for engaging with private data. This is an alternative to enterprise AI chat tool subscriptions. It’s a clear candidate for enhancing household and small business productivity.

Run Lemonade at Startup (systemd)

If you are dedicating your Strix Halo to serving Lemonade, moving the service into systemd makes sense. Using your project folder and an unprivledged user, this can be accomplished with the systemd configuration below. In my case, the project location path is /home/imac/src/lemonade. Update /home/%i/src/lemonade in the unit file below to match your project location. Some of the configurable environment options are explained here.

# /etc/systemd/system/[email protected]
[Unit]
# Running as an instance with the same name as the local user
Description=Lemonade Server (ROCm) for %i
Wants=network-online.target
After=network-online.target systemd-resolved.service

# If models in LVM, or a path that might not be ready
#RequiresMountsFor=/mnt/models

[Service]
Type=simple
User=%i
# Replace with your project location
WorkingDirectory=/home/%i/src/lemonade
# Tunable environment variables 
Environment=LEMONADE_LLAMACPP=rocm
#Environment=LEMONADE_LLAMACPP=vulkan
Environment=LEMONADE_CTX_SIZE=65536
Environment=LEMONADE_HOST=0.0.0.0
Environment=LEMONADE_PORT=8000
#Environment=LEMONADE_LOG_LEVEL=debug
Environment=ROCBLAS_USE_HIPBLASLT=1
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
#If you store your models in another location, you can override the default huggingface home path 
#Environment=HF_HOME=/mnt/models/huggingface
# You can chose to load a model at startup, or wait to load a model using the web interface
#ExecStart=/home/%i/src/lemonade/.venv/bin/lemonade-server-dev run gpt-oss-120b-GGUF
ExecStart=/home/%i/src/lemonade/.venv/bin/lemonade-server-dev serve
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Once the file is loaded, you can add it to your systemd configuration, and enable it to start automatically using the following commands, substituing imac with your own local username.

imac@ai2:~$ sudo systemctl daemon-reload
imac@ai2:~$ systemctl enable --now [email protected]

To see the console output, you can now use journalctl just like any other service.

imac@ai2:~$ journalctl -u [email protected]
Oct 22 13:14:23 ai2 systemd[1]: Started [email protected] - Lemonade Server (ROCm) for imac.
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: INFO: Started server process [4805]
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: INFO: Waiting for application startup.
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: INFO:
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: 🍋 Lemonade Server v8.1.12 Ready!
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: 🍋 Open http://0.0.0.0:8000 in your browser for:
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: 🍋 💬 chat
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: 🍋 💻 model management
Oct 22 13:14:24 ai2 lemonade-server-dev[4805]: 🍋 📄 docs
Oct 22 13:14:27 ai2 lemonade-server-dev[4805]: INFO: Application startup complete.
Oct 22 13:14:27 ai2 lemonade-server-dev[4805]: INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: Starting Lemonade Server...
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: INFO: 192.168.79.143:33910 - "GET / HTTP/1.1" 200 OK
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: DEBUG: Total request time (streamed): 0.0313 seconds
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: INFO: 192.168.79.143:33910 - "GET /static/styles.css HTTP/1.1" 200 OK
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: INFO: 192.168.79.143:33914 - "GET /static/favicon.ico HTTP/1.1" 200 OK
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: DEBUG: Total request time (streamed): 0.0051 seconds
Oct 22 13:14:45 ai2 lemonade-server-dev[4805]: DEBUG: Total request time (streamed): 0.0064 seconds

Running with 128000M GTT (~125GB) yielded about 1GB of free memory during inference using with GLM-4.5-Air-GGUF-Q8_0 on a service optimized Ubuntu 25.04 desktop system running headless with debug logging enabled. It was under a few TPS at 128k context, but loaded.

root@ai2:~# free
total used free shared buff/cache available
Mem: 128649272 29146652 1064980 374200 99999544 99502620
Swap: 0 0 0

Understanding the MES Firmware

ROCm 7.1 ships the current AMD MES firmware, pull from the upstream linux-firmware repository, which is then repackaged for the Strix Halo in the Ubuntu linux distribution via the linux-firmware package, in addition to via the Instinct driver repository as part of the amdgpu-dkms-firmware package. If you have both packages installed, the more specific amdgpu-dkms-firmware package will override the linux-firmware package by default. Prior to AMD Instinct 30.20, the AMD packages contained firmware that did not work with the in-kernel MES software to avoid some GPU hang scenarios during long duration workflows. There are still a few issues as of 11/03/25, as described here, and the previous workaround to load the firmware package from the upstream Ubuntu Resolute Raccoon may be useful with future firmware issues if versions >0x80 become available ahead of the next ROCm release.

A command you can use to check your MES firmware is:

cat /sys/kernel/debug/dri/128/amdgpu_firmware_info | grep MES

You should see at least 0x80 as of 11/3/25.

$ sudo cat /sys/kernel/debug/dri/128/amdgpu_firmware_info | grep MES
MES_KIQ feature version: 6, firmware version: 0x0000006f
MES feature version: 1, firmware version: 0x00000080

On Questing before the updated firmware was installed, we observed this version from linux-firmware.

MES feature version: 1, firmware version: 0x0000007e

Below are some version identifiers we have seen and where they have come from.

amdgpu-dkms-firmware 30.10.2.0.30100200-2226257.24.04 (apt install amdgpu-dkms-firmware/noble via repo.radeon.com ROCm 7.0.2)

MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x00000077

linux-firmware 20250901.git993ff19b-0ubuntu1.2 (apt install linux-firmware/questing)

MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000007e

linux-firmware 20250317.git1d4c88ee-0ubuntu1.9 (apt install linux-firmware/plucky-updates)

MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000007c

amdgpu-dkms-firmware 1:6.12.12.60404-2202139.24.04 (apt install amdgpu-dkms-firmware/noble via repo.radeon.com ROCm 6.4.4 )

MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000006e

linux-firmware 20251009.git46a6999a-0ubuntu1 (apt install amdgpu-dkms-firmware/racoon and also amdgpu-dkms-firmware 30.20.0.0.30200000-2238411.24.04)

MES_KIQ feature version: 6, firmware version: 0x0000006f
MES feature version: 1, firmware version: 0x00000080

Beware of not pinning Instinct apt repositories

Due to the version numbering change with Instinct 30.xx, if you installed ROCm/Instinct 6.4.4 from repo.radeon.com, you can actually end up in a situation where the amdgpu-dkms-firmware package is not being upgraded due to the package numbering if the policy is not set to explicitly use the version from AMD. This scenario, shown below with apt pinning only to 600 vs the 1001 used in the instructions above, can be inspected via the apt policy as shown below.

$  sudo apt policy amdgpu-dkms-firmware

amdgpu-dkms-firmware:
Installed: 1:6.12.12.60404-2202139.24.04
Candidate: 1:6.12.12.60404-2202139.24.04
Version table:
*** 1:6.12.12.60404-2202139.24.04 100
100 /var/lib/dpkg/status
30.10.2.0.30100200-2226257.24.04 600
600 https://repo.radeon.com/amdgpu/30.10.2/ubuntu noble/main amd64 Packages

A similar situation occurs when upgrading 25.10 from versions prior to ROCm 7.1 in Ubuntu 25.10 as explained here. In this case, the rocprofiler shipped with Ubuntu 25.10 interferes with dependencies for the pinned ROCm 7.1 versions.

$sudo apt policy rocprofiler-sdk
rocprofiler-sdk:
Installed: 1.0.0-56~24.04
Candidate: 1.0.0-56~24.04
Version table:
*** 1.0.0-56~24.04 100
100 /var/lib/dpkg/status
1.0.0-20~24.04 600
600 https://repo.radeon.com/rocm/apt/7.1 noble/main amd64 Packages

Pinning the AMD repository to preference 1001 resolves these situations if you come across one of them.

Basic Local Monitoring

Messages about VRAM and GTT allocations and ongoing SVM mapping failures

journalctl -b | egrep "amdgpu: .*memory"

Follow logs live to see errors in realtime

journalctl -f

Watch the GPU memory use

watch -n1 /opt/rocm/bin/rocm-smi --showuse

Inspect VRAM capacity

rocm-smi --showvbios --showmeminfo vram --showuse
rocm-smi --showvbios --showmeminfo gtt --showuse
rocm-smi --showvbios --showmeminfo all --showuse

Evergreening

apt update && apt upgrade

Check for bumps in the apt repositories here and here. Move to the stable or latest when new releases become available. Make a copy of uv.lock first for rollback.

uv lock --upgrade --index-strategy unsafe-best-match --prerelease=allow 
uv sync

Optionally, you can just upgrade lemonade

uv lock --upgrade-package lemonade-sdk

Also keep an eye on your torch wheel if you are using torch, and either update the index in your pyproject.toml or remove it so its dependencies can not conflict with lemonade.

[project]
name = "lemonade"
version = "0.5.0"
description = "Updated 11_3_2025 on Ubuntu 25.10 with ROCm 7.1 and The Rock 7.10"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"lemonade-sdk[dev]>=8.2.0",
"torch==2.10.0.dev20251102+rocm7.0",
"torchaudio==2.10.0.dev20251105+rocm7.0",
"torchvision==0.25.0.dev20251105+rocm7.0",
]
[tool.uv]
index-strategy = "unsafe-best-match"
prerelease = "allow"

[[tool.uv.index]]
name = "pypi"
url = "https://pypi.org/simple"
default = true

[[tool.uv.index]]
name = "rocm7"
url = "https://download.pytorch.org/whl/nightly/rocm7.0/"

[[tool.uv.index]]
name = "rocm6"
url = "https://download.pytorch.org/whl/rocm6.4"

[[tool.uv.index]]
name = "therock"
url = "https://rocm.nightlies.amd.com/v2/gfx1151/"

[tool.uv.sources]
# Choose between ROCm 6, ROCm 7 and TheRock 7.10.0
torch = { index = "rocm7" }
torchvision = { index = "rocm7" }
torchaudio = { index = "rocm7" }
pytorch-triton-rocm = { index = "rocm7" }

Initial State 9/3/25 (At Publication)

apt managed

ii linux-image-6.14.0-29-generic 6.14.0-29.29 amd64 Signed kernel image generic
ii amdgpu-dkms 1:6.14.14.30100000-2193512.24.04 all amdgpu driver in DKMS format.
ii rocm 7.0.0.70000-17~24.04 amd64 Radeon Open Compute (ROCm) software stack meta package
ii uv 0.8.14-1+trixie amd64 An extremely fast Python package and project manager, written in Rust.

uv managed

lemonade-sdk 8.1.10 / llamacpp-rocm b1021
torch 2.8.0+rocm6.4

Current State 11/03/25 (Evergreening)

At some point after initial publication, we upgraded from 25.04 (Plucky) to 25.10 (Questing) and removed amdgpu-dkms as a newer version of the module is provided with the kernel.

apt managed

ii linux-image-6.17.0-7-generic 6.17.0-7.7 amd64 Signed kernel image generic
ii rocminfo 1.0.0.70100-20~24.04 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
ii uv 0.9.7-1+trixie amd64 An extremely fast Python package and project manager, written in Rust.

uv managed

lemonade-sdk 8.2.2 / llamacpp-rocm b1066
torch 2.7.1+rocm7.9.0rc1

Open WebUI

If you do want to spawn Open WebUI, similar steps below should work on Debian Trixie, and probably also on any Ubuntu Plucky instance.

sudo apt install pkg-config python3-dev build-essential libpq-dev
uv init 
uv venv
uv python pin 3.11
uv sync
source .venv/bin/activate
uv pip install setuptools wheel
uv add open-webui
open-webui serve

Performance

With the release of ROCm 7.0, a lot of people might be wondering what kind of performance they can expect. This article was based on Ubuntu 25.04, as it provides a convienent way to enjoy a complete desktop experience right on top of a Strix Halo device, while taking advantage of 100GB+ models. For more permanent workloads, we prefer pure Debian for a cleaner multi-user target that more closely resembles a commercial production environment without any overhead from desktop packages, and stricter policies and release cycles on the underlying operating system. Debian 13 is fully supported by AMD, and with ROCm 7.1 Ubuntu 25.10 it seems like all current Ubuntu and Debian versions will be added to the official support matrix any day now. Below we include benchmarks on both Ubuntu and Debian, as well as a comparison to a Radeon RX 7900 XTX GPU in our benchmarks, using some popular models at the time of writing.

llama-bench is included with the lemonade package, and is easily marked executable to avoid having to pull down any additional code, or build any packages from source to execute local benchmarks of various models. It can be pointed directly at the models downloaded in the Lemonade model manager from Hugging Face to avoid any file replication, which quickly adds up with the larger models.

QWEN3-30B-Coder-A3B – 48 t/s – Strix Halo – Debian 12 – ROCm 6.4 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

There is almost no change on Debian 12 with Linux 6.1+ROCm 6.4 and moving to ROCm 7.0. Switching to Ubuntu Linux 6.14+ROCm 7.0 showing a 23% improvement in generation over Debian 12. We might expect similar for gains over Ubuntu 22.04 and will likely upgrade to Debian 13 at a later date.

QWEN3-30B-Coder-A3B – 59 t/s – Strix Halo – Ubuntu 25.04 – ROCm 7.0 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

GPT-OSS-120B – 46 t/s – Strix Halo – Debian 12 – ROCm 6.4 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

The difference between Debian 12 (6.1) with ROCm 6.4 and Ubuntu 25.04 (6.14) with ROCm 7.0 for this model is much smaller than with Qwen 30B.

GPT-OSS-120B – 47 t/s – Strix Halo – Ubuntu 25.04 – ROCm 7.0 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

QWEN3-30B-Coder-A3B – 98 t/s – RX 7900 XTX – Ubuntu 25.04 – ROCm 7.0 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

The result below shows you a 24GB RDNA 3 card executing the same test. Just over twice the performance on generation, but that is about as large a model as it can handle. For 20GB models, these GPUs are available for approx $600 USD. (9/25)

QWEN3 Big and Small – 235B-Coder-A22B Instruct 2507 – 11 t/s – Strix Halo – Debian 12 – ROCk 7.0 – LLAMA-ROCM b1057 (Lemonade v8.1.10)

Here we see how the difference in performance running some of the smallest and largest Qwen3 models. At 11 t/s if there is nothing else going on, letting Qwen 235B make optimizations and improvements to existing code, is just a fun background task.

Ready to Build on This?

Building for the future without creating technical debt is a powerful paradigm. But the real business advantage comes from mapping your unique business logic into multi-agent AI workflows that solve real problems and create real scalability.

At Netstatz Ltd., this is our focus. We leverage our enterprise experience to build intelligent agent systems on stable, secure, and cost-effective edge platforms like Strix Halo. If you are a small or medium sized business looking to prototype or deploy local AI solutions, contact us to see how we can help. If you’re in the Toronto area, we can grab a coffee (or a beer) and talk shop.

Helpful References

Jeff Geerling – Increasing the VRAM allocation on AMD AI APUs under Linux

Leonard Lin – iGPU VRAM – How much can be assigned

The Knowledge Flywheel: How AI-Powered Wikis Forge Smarter Teams and Build the Future of Work

imac@netstatz.com — Sun, 20 Jul 2025 20:48:27 +0000

Audio Summary

Listen to the 6 minute audio teaser instead?

https://netstatz.com/wp-content/uploads/2025/07/Knowledge-Debt_-How-Your-Organization-Can-Turn-Hidden-Costs-Into-Competitive-Advantage-with-AI.mp3

The Hidden Tax on Technical Teams: Why Your Undocumented Knowledge is Costing You Millions

In the fast-paced world of technology, the appearance of productivity can be deceptive. Teams are constantly active—coding, deploying, troubleshooting, and meeting. Yet, beneath this surface of relentless effort lies a hidden and exorbitant tax on efficiency, a drag on innovation that silently bleeds resources from even the most capable organizations. This is the tax of undocumented knowledge, a systemic issue that manifests as “knowledge churn”—the repetitive, low-value, and deeply frustrating work of searching for or recreating information that already exists, scattered across disconnected systems and locked within the minds of individual team members. This is not a minor inconvenience; it is a multi-million-dollar liability.

Quantifying the “Knowledge Tax”

The scale of this inefficiency is staggering when quantified. Seminal research from firms like McKinsey & Company reveals a stark reality: knowledge workers spend, on average, 20% of their day—one full day per week—simply looking for the internal information they need to do their jobs. To ground this abstract percentage in financial reality, consider a mid-sized technical organization of 500 employees with an average salary of $60,000. The cost of this search activity alone amounts to approximately $6 million in annual payroll dedicated not to innovation or development, but to an internal scavenger hunt.
This “knowledge tax” is levied through multiple avenues of waste. Beyond the primary time sink of searching, research shows that knowledge workers report spending two hours recreating information that they know exists but cannot find, and 1.7 hours providing duplicate answers to questions that have been asked and answered before. The implementation of a well-structured, centralized knowledge base directly addresses this drain, with studies showing it can reduce information search time by as much as 35%.2 This represents a direct, measurable, and substantial return on investment, freeing up millions of dollars in human capital to be redirected toward value-creating activities.

The High Cost of Forgetting: Employee Turnover and Knowledge Loss

The financial burden escalates dramatically when considering the fragility of “tribal knowledge”—the unwritten expertise, nuanced processes, and critical context that accumulates within a team. Research estimates that a staggering 42% of the knowledge required to capably perform a given role is known only by the person currently in that position.2 This undocumented expertise is a critical organizational asset, yet it is precariously balanced on the tenure of individual employees.
When a team member leaves, a significant portion of this asset walks out the door with them. The consequences are immediate and severe: project timelines slip, costly mistakes are repeated, and the productivity of the entire team is impacted as they struggle to fill the void.3 The onboarding process for new hires becomes a protracted and inefficient drain on the time of senior staff, who are forced to repeatedly transfer knowledge verbally instead of focusing on their own high-value tasks.6 A robust knowledge base acts as the organization’s institutional memory, a system for capturing and preserving this vital intellectual property.5 By documenting critical processes, best practices, and project histories, it transforms onboarding from a resource-intensive, person-to-person data dump into a structured, self-paced learning journey that empowers new hires to become productive faster.3

Introducing the Concept of “Knowledge Debt”

To fully grasp the strategic implications, it is useful to synthesize these costs into a single, powerful metaphor: Knowledge Debt. Much like technical debt in software development, knowledge debt is the implied cost of choosing expediency over best practice. Every undocumented process, every unshared solution, every piece of critical information that remains siloed in an inbox or a private chat log adds to the principal of this debt.
This debt accrues “interest” every single day. The interest payments are the hours wasted by employees searching for information, the productivity lost to recreating solved problems, and the project delays caused by the absence of a key expert. The principal of the debt grows with every new project and every new hire, and it skyrockets when a veteran employee departs, taking their 42% of unique expertise with them.
What makes this concept so critical for leaders to understand is that knowledge debt, like financial debt, compounds. The various costs are not isolated problems but are nodes in a vicious cycle. The frustration and inefficiency caused by constant information searching leads to lower job satisfaction and reduced employee engagement.2 Studies show that employees who feel unsupported and ill-equipped are more likely to leave, thus increasing turnover rates.2 This higher turnover directly accelerates the loss of institutional knowledge, which in turn increases the amount of information that remaining and future employees must search for or recreate. This negative feedback loop ensures that for organizations without a formal knowledge management strategy, the problem of knowledge debt only gets worse over time, becoming a compounding strategic liability that stifles innovation, drains morale, and erodes the bottom line.

The Cambrian Explosion of Knowledge: How AI Is Solving the Content Bottleneck

For decades, the primary obstacle to adopting a comprehensive knowledge base has been the immense manual effort required for its creation and maintenance. The “activation energy” needed to populate a system from scratch was simply too high for most teams, who were already under pressure to deliver on project goals. Documentation was a task perpetually relegated to “later,” and as a result, knowledge bases often became barren wastelands of outdated or incomplete information. Today, this fundamental barrier has been shattered by a Cambrian explosion in Artificial Intelligence, particularly in its ability to process and generate human language. AI has transformed knowledge management from a manual chore into an automated, continuous process, finally making the vision of a living, breathing organizational brain an achievable reality.

The New Paradigm: AI-Powered Knowledge Extraction

The breakthrough lies in the ability of modern AI, powered by Natural Language Processing (NLP) and Machine Learning (ML), to ingest, understand, and structure the vast quantities of unstructured data that technical teams produce as a natural byproduct of their work.9 This “project exhaust”—once relegated to dusty digital archives—is now a rich ore of organizational knowledge. AI systems can systematically process a wide array of sources, including text documents (PDFs, Word files), project plans, emails, support tickets, and chat transcripts.10
Specialized AI techniques are used to distill intelligence from this raw data. Named Entity Recognition (NER) can automatically identify and tag key entities like project names, server IDs, software versions, and personnel. Coreference Resolution understands that “it” or “the system” in a later sentence refers to a specific server mentioned earlier. Summarization algorithms can condense lengthy email chains or project reports into concise, actionable summaries.11 In fields like construction, NLP has even been used to extract critical knowledge about methods and dependencies directly from project schedules.13 This automated extraction process transforms archived data from a passive storage cost into an active, value-generating asset.

From Ingestion to Intelligence: AI-Powered Organization and Content Creation

AI’s role extends far beyond simple extraction. Once the knowledge is ingested, AI-powered systems can intelligently organize it, creating a structured framework that makes the information discoverable without requiring hours of manual tagging and categorization by human experts.9
This is where generative AI provides a massive leap forward. It can take the structured information extracted from multiple sources and synthesize it into entirely new, high-quality content. For example, an AI can analyze a series of support tickets and chat logs about a recurring issue and automatically generate a draft for a new FAQ page or a troubleshooting guide.7 This dramatically lowers the barrier to entry for content creation. Furthermore, a new class of AI tools can create detailed knowledge base articles, complete with annotated screenshots and step-by-step instructions, simply by observing and recording a user’s workflow on their screen.15 This means the very act of performing a task can now generate the documentation for that task.
This convergence of AI capabilities creates a powerful, symbiotic loop of continuous curation, marking the first major turn of the Knowledge Flywheel. The traditional model of knowledge management was a discrete, manual, and often-delayed process. The new, AI-driven model transforms it into a continuous, automated stream that runs in parallel with normal operations. As a team works, the AI works alongside them, capturing the knowledge being generated in near-real-time. Project deliverables are no longer the end of the process; they are the beginning of the knowledge creation cycle. This fundamental shift solves the core behavioral problem of documentation avoidance by making knowledge preservation a frictionless byproduct of the work itself. It establishes a virtuous cycle where operations feed the knowledge base, and the knowledge base, in turn, makes operations smarter—a symbiotic relationship between a team and its collective memory.17

The Wiki Way: Fostering a Culture of Cognitive Reinforcement

While AI provides the engine to populate a knowledge base, its true, lasting value is unlocked by the human interaction with that knowledge. A knowledge base is more than a passive repository for information retrieval; it is an active learning system. The very act of contributing to, editing, and refining the knowledge base has a profound and often-overlooked cognitive impact on the team members involved. It makes them, and by extension the entire organization, smarter. This process of cognitive reinforcement is best cultivated by a specific style of collaborative platform—a style exemplified by the wiki.

The Science of “Writing to Learn”

The power of this interaction is grounded in well-established principles of cognitive science. The first is the concept of knowledge externalization. This is the process of translating tacit knowledge—the fluid, context-rich understanding that exists in a person’s mind—into an explicit, external form, such as a written document or a diagram.18 The act of articulating what you know forces a deeper level of processing. It offloads the mental burden of holding a complex idea in working memory, freeing up cognitive resources to analyze, structure, and refine that idea.18
This act of writing engages two other critical learning mechanisms: elaboration and organization.20 To explain a concept to someone else, a contributor must connect it to other known facts (elaboration) and structure it in a logical, coherent way (organization). This process builds stronger, more interconnected mental models of the subject matter. A compelling study conducted with students demonstrated that the act of editing and writing articles for a platform like Wikipedia significantly boosted a wide range of cognitive skills, including critical thinking, logical reasoning, and problem-solving.21 This is because the platform’s standards required them to not just state facts, but to research them, validate their authenticity, and organize them into a logical narrative—the very essence of deep learning.

The Power of the Update: Retrieval Practice and Long-Term Memory

The cognitive benefits do not stop at the initial creation of a page. Perhaps the most powerful mechanism for cementing long-term memory is retrieval practice—the active effort of recalling information from memory.20 Reading a manual ten times is far less effective for long-term retention than reading it once and trying to recall its contents from memory nine times.
This is precisely what happens in a living knowledge base. When an employee needs to update a standard operating procedure, they don’t start from scratch. They first retrieve the existing procedure from their own memory and from the wiki page. They then identify what has changed, integrate the new information, and restructure the page accordingly. Each update is an act of retrieval, reinforcement, and re-encoding. This continuous cycle of small, iterative updates builds durable, long-lasting knowledge and expertise far more effectively than any one-off training session or static PDF manual ever could.3 New hires, in particular, benefit from this, as they can revisit materials as needed to reinforce their learning within the natural flow of their work.3

The “Wiki Way” as the Ideal Environment

This organic, iterative lifecycle of knowledge—growing from a few bullet points jotted down in a meeting, to a semi-structured page, to a comprehensive and continuously refined document—thrives in a specific type of environment: one that is flexible, collaborative, and has a low barrier to contribution. This is the essence of the “wiki way.”
Platforms like MediaWiki, the open-source software that powers Wikipedia, are architected around this philosophy.23 They are designed to make it incredibly easy to start a page, link concepts, and allow for collective, incremental improvement. The focus is on the content and the collaboration, not on rigid, predefined processes.25 User testimony from technical teams consistently highlights a preference for the speed, simplicity, and unobtrusive nature of wikis.25
Conversely, more monolithic and process-heavy platforms like Microsoft SharePoint are often cited by technical users as being “clunky,” “complicated,” and slow, creating significant friction that discourages contribution.25 When the effort to make a small correction or add a quick note is high, users simply won’t do it. This friction breaks the cycle of cognitive reinforcement. The small, frequent interactions that drive retrieval practice and knowledge externalization never occur.
This reveals a critical connection: the choice of a knowledge base platform is not a mere technical or financial decision. It is a strategic choice that has direct consequences for the cognitive development and learning capacity of the entire team. A platform’s architecture and user experience design fundamentally shape user behavior. That behavior, in turn, determines whether the powerful learning effects of knowledge externalization and retrieval practice can take root and flourish. A high-friction platform actively inhibits the very interactions that build deep, lasting institutional knowledge. Opting for a platform that embodies the “wiki way” is an investment in fostering a true learning culture, not just in procuring a documentation repository.

From Repository to Reasoning Engine: Your KB as the Brain for Agentic AI

The imperative to build a robust, living knowledge base extends far beyond immediate productivity gains and long-term team learning. It is the single most critical preparatory step an organization can take for the next era of artificial intelligence: the age of autonomous, agentic systems. A well-structured knowledge base is not just a resource for humans; it is the foundational “brain” that will empower AI agents to reason, act, and create value with unprecedented levels of capability and reliability.

The Critical Distinction: Corpus vs. Knowledge Base

To understand this future, one must first grasp a crucial technical distinction. Many current AI applications use a technique called Retrieval-Augmented Generation (RAG), where an AI model is pointed at a corpus of documents—a folder of PDFs, a Slack archive, a website’s text content.31 When a user asks a question, the system finds relevant chunks of text from the corpus and feeds them to the language model to generate an answer. While useful, this approach has a fundamental limitation: it often destroys context.33 A retrieved text chunk may be meaningless or even misleading without the surrounding information from the original document. The AI is essentially quoting from a library of books it hasn’t truly understood.
A true knowledge base, by contrast, is more than a collection of text. It is a structured representation of facts, rules, and, most importantly, the relationships between concepts.32 When an AI queries a knowledge base, it’s not just finding keywords; it’s accessing an organized model of reality. It’s the difference between an AI that can find a sentence about a server in a document and an AI that
knows that Server A is a type of web server, is part of Application B, is governed by Policy C, and is maintained by Team D.

Architecting the Agent’s Brain

This structured knowledge is the core component of modern knowledge-based agents.34 These advanced AI systems operate on a simple but powerful architectural principle. They have an inference engine that allows them to reason, and they interact with their environment through three primary operations:

TELL: New information is used to update the knowledge base.
ASK: The agent queries the knowledge base to understand a situation or decide on a course of action.
PERFORM: Based on the answer from the ASK operation, the agent executes an action in the world (e.g., makes an API call, sends a communication, makes a decision).

The quality of this agent’s performance is almost entirely dependent on the quality of its knowledge base. Building this “brain” involves advanced data structures. Vector Databases are used to store “embeddings” of information, allowing for sophisticated semantic search based on meaning, not just keywords.35
Knowledge Graphs are used to explicitly store the entities and the relationships between them, forming a web of interconnected facts.35 A wiki built with semantic extensions, such as the powerful Semantic MediaWiki, serves as a perfect, human-friendly interface for creating, curating, and visualizing the structured data that populates these advanced backend systems.26

Completing the Flywheel: The Symbiotic AI Loop

This is where the entire argument culminates, and the Knowledge Flywheel begins to spin at full speed, powered by a symbiotic loop between humans and AI.17

Turn 1: AI-Assisted Creation. The AI systems described in Section 2 kickstart the process, rapidly populating the knowledge base by extracting and structuring information from the organization’s vast sea of unstructured project data.
Turn 2: Human-Driven Curation. The human team, through the cognitive reinforcement processes described in Section 3, interacts with this AI-generated content. They curate it, refine it, correct it, and enrich it with their own tacit knowledge and context, ensuring its quality, accuracy, and relevance.
Turn 3: Powering Agentic AI. This high-quality, human-curated, and structured knowledge base now becomes the “brain” for the powerful agentic AI described in this section. This enables the agent to perform complex tasks, answer nuanced questions, and enforce governance with a degree of reliability that would be impossible with a simple corpus of files.
Turn 4: AI-Powered Reinforcement. The now-intelligent agentic AI closes the loop. By interacting with users and systems, it can identify knowledge gaps, flag outdated articles for human review, suggest new procedures based on observed patterns, and automate routine tasks, which further assists the human team and enriches the knowledge base for the next cycle.

This self-reinforcing cycle creates a powerful compounding advantage. In an era where the underlying Large Language Models (LLMs) from providers like OpenAI, Anthropic, and Google are rapidly becoming commoditized and accessible to all, an organization’s primary competitive differentiator for AI will not be the model itself, but the quality of its proprietary data.32 A unique, high-quality, structured internal knowledge base is an asset that cannot be bought or licensed. It must be built, cultivated, and curated over time through a combination of smart technology and a dedicated human culture. The organization that begins building its Knowledge Flywheel today is not merely solving an immediate productivity problem; it is constructing a deep, defensible strategic moat for the age of agentic AI. Their agents will be smarter, their teams will be more efficient, and their decisions will be better informed, because they invested in building a better brain.

Table 1: Comparative Analysis of Enterprise Knowledge Management Platforms for Technical Teams

Feature Dimension	Enterprise MediaWiki	Microsoft SharePoint	Analysis for Technical Teams & The Knowledge Flywheel
Flexibility & Data Schema	Every item is a wiki page; structure is defined by flexible templates and semantic properties. Infinitely adaptable without custom code.26	Rigid, pre-defined content types (lists, libraries). Customization often requires complex configuration or proprietary code.26	MediaWiki’s flexibility is essential for the organic growth of knowledge and for creating the structured, semantic data needed for advanced AI agents. SharePoint’s rigidity can stifle this process.
Versioning Integrity	Complete, granular history of every change to content and data structure. Nothing is ever truly lost.26	Limited versioning. Intermediate edits can be lost; deleted items are often gone for good; schema changes are not versioned.26	For technical documentation, where understanding the history of a change is critical, MediaWiki’s robust versioning is non-negotiable. It provides a complete audit trail essential for governance and debugging.
User Experience & Adoption	Simple, fast, and familiar to anyone who has used Wikipedia. Low friction encourages contribution.23	Often described by technical users as “clunky,” “complicated,” and slow. High friction discourages use and adoption.25	The cognitive reinforcement loop depends on low-friction contribution. MediaWiki’s superior UX for technical teams makes it far more likely to be adopted and used, allowing the flywheel to spin.
Cost & Licensing	Open-source with no licensing fees. Costs are related to hosting and maintenance, which can be self-managed.23	Expensive enterprise licensing (CALs for Windows, SQL, SharePoint) and often requires specialized, costly administrators/developers.28	The significantly lower TCO of MediaWiki allows for investment in customization and AI integration rather than licensing, offering a higher strategic ROI.
Suitability for Agentic AI	Excellent. Semantic extensions (e.g., Semantic MediaWiki) allow for the creation of explicit knowledge graphs, a perfect foundation for reasoning engines.36	Poor to Moderate. Data is less structured. While it can be a source for a corpus, it’s not inherently designed to create the relational, logical foundation an agent needs.29	MediaWiki is architecturally aligned with the future of knowledge-based AI. It provides the tools to build not just a repository of text, but a machine-readable brain.

This post was ‘vibe written’ with Google Gemini 2.5 Pro with very little editing. A good foundation for an editorial update sometime in the future.

Works cited

The social economy: Unlocking value and productivity through social technologies, https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/the-social-economy
What is an Internal Knowledge Base? Meaning and Best Practices, https://www.simpplr.com/glossary/internal-knowledge-base/
11 Benefits of a Knowledge Base – Bloomfire, https://bloomfire.com/blog/benefits-of-a-knowledge-base/
12 Knowledge Base Benefits You Can’t Miss Out On, https://userpilot.com/blog/knowledge-base-benefits/
Knowledge Retention: How to Capture and Preserve Knowledge at Work, https://bloomfire.com/blog/knowledge-retention-tactics/
Everything you need to know about internal knowledge bases, https://stackoverflow.co/teams/resources/internal-knowledge-bases/
Internal knowledge base guide – Zendesk, https://www.zendesk.com/service/help-center/internal-knowledge-base/
Securing Corporate Memory: Strategies for Preserving Institutional Knowledge, https://epiloguesystems.com/blog/preserving-institutional-knowledge/
Unlocking Unstructured Data Value with AI | Market Logic, https://marketlogicsoftware.com/blog/unstructured-data-management-with-ai/
How AI Unlocks the Value of Unstructured Data – Domo, https://www.domo.com/learn/article/ai-and-unstructured-data
Knowledge Management with NLP: How to easily process emails with AI – statworx, https://www.statworx.com/en/content-hub/blog/knowledge-management-with-nlp-how-to-easily-process-emails-with-ai
Knowledge extraction – Wikipedia, https://en.wikipedia.org/wiki/Knowledge_extraction
Extracting Construction Knowledge from Project Schedules Using Natural Language Processing, https://www.researchgate.net/publication/339670261_Extracting_Construction_Knowledge_from_Project_Schedules_Using_Natural_Language_Processing
AI Knowledge Base: The Complete Guide to Everything You Need to Know for 2025, https://www.vonage.com/resources/articles/ai-knowledge-base/
Knowledge Base Generator – Scribe, accessed July 20, 2025, https://scribehow.com/tools/knowledge-base-generator
Knowledge Loss: Turnover Means Losing More Than Employees, https://hrdailyadvisor.com/2018/07/18/knowledge-loss-turnover-means-losing-employees/
Symbiotic AI: The Future of Human-AI Collaboration, https://aiasiapacific.org/2025/05/28/symbiotic-ai-the-future-of-human-ai-collaboration/
What is Externalization? — updated 2025 | IxDF, https://www.interaction-design.org/literature/topics/externalization
Nonaka’s Four Modes of Knowledge Conversion, https://www.uky.edu/~gmswan3/575/nonaka.pdf
Understanding the Cognitive Processes Involved in Writing to Learn., https://files.eric.ed.gov/fulltext/ED600474.pdf
Using Wikipedia as a Platform to Enhance Cognitive Skills: A Trailblazing Study, https://diff.wikimedia.org/2024/12/03/using-wikipedia-as-a-platform-to-enhance-cognitive-skills-a-trailblazing-study/
Aligning Writing Instruction With Cognitive Science – Edutopia, https://www.edutopia.org/article/cognitive-science-writing-instruction/
Tips for switching your team to a SharePoint open source alternative | Opensource.com, https://opensource.com/article/20/6/mediawiki
MediaWiki: a case study in sustainability – OSS Watch, http://oss-watch.ac.uk/resources/mediawiki
Objective reasons for using a wiki tool over Sharepoint, https://stackoverflow.com/questions/631047/objective-reasons-for-using-a-wiki-tool-over-sharepoint
Enterprise MediaWiki vs. SharePoint – WikiWorks, https://wikiworks.com/enterprise-mediawiki-vs-sharepoint.html
Recommendation for a Company-Wiki : r/selfhosted – Reddit, https://www.reddit.com/r/selfhosted/comments/10co34n/recommendation_for_a_companywiki/
IT Documentation: Sharepoint wiki vs mediawiki? : r/sysadmin – Reddit, https://www.reddit.com/r/sysadmin/comments/5b51qb/it_documentation_sharepoint_wiki_vs_mediawiki/
MS SharePoint as a Wiki: Few Functions, less Compatibility – Seibert Media, accessed July 20, 2025, https://seibert.group/blog/en/ms-sharepoint-as-a-wiki-few-functions-less-compatibility/
SharePoint Wiki that really Works https://perfectwikiforteams.com/blog/sharepoint-wiki-that-works/
Manage your RAG knowledge base (corpus) | Generative AI on Vertex AI – Google Cloud, https://cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/manage-your-rag-corpus
Retrieval-Augmented Generation: Ultimate Guide – BuzzClan, https://buzzclan.com/data-engineering/retrieval-augmented-generation/
Introducing Contextual Retrieval \ Anthropic, https://www.anthropic.com/news/contextual-retrieval
Building Knowledge-Agents: Architecture, Operations, and Strategic Value, https://www.alltius.ai/glossary/understanding-knowledge-based-agents-in-ai
Unleashing the Future of Knowledge Management with Agentic AI, https://www.akira.ai/blog/ai-agent-for-knowledge-base
Using (Semantic) Mediawiki on an Enterprise Knowledge Management Platform: from Banking IT Governance to Smart City Hub Portals – SlideShare, https://www.slideshare.net/matteobusanelli/using-semantic-mediawiki-on-an-enterprise-knowledge-management-platform-from-banking-it-governance-to-smart-city-hub-portals
Knowledge.wiki – KM-A Knowledge Management Associates, https://km-a.net/consulting/knowledge-wiki/
Knowledge Base, The brain of your AI automations – Cassidy AI, https://www.cassidyai.com/knowledge-base
Set Up Your Knowledge Base in Nine Steps – C2Perform, https://www.c2perform.com/blog/9-steps-to-set-up-your-knowledge-base
Migration from Microsoft SharePoint to MediaWiki – WikiTeq, https://wikiteq.com/post/migration-sharepoint-mediawiki
MediaWiki Sharepoint Confluence | Pros and Cons – WikiTeq, https://wikiteq.com/mediawiki-vs-sharepoint-vs-confluence
Enterprise hub – MediaWiki, https://www.mediawiki.org/wiki/Enterprise_hub
Knowledge Base Guide: Examples, Templates & Best Practices – Atlassian, https://www.atlassian.com/itsm/knowledge-management/what-is-a-knowledge-base
Knowledge base metrics to improve performance – The Owlery, The KnowledgeOwl Blog https://blog.knowledgeowl.com/blog/posts/knowledge-base-metrics/
10 Actionable Knowledge Base Metrics to Start Tracking Today – Help Scout, https://www.helpscout.com/blog/knowledge-base-metrics/
How Do You Avoid Knowledge Silos?, https://www.apqc.org/blog/how-do-you-avoid-knowledge-silos
Knowledge Loss: Turnover Means Losing More Than Employees, https://hrdailyadvisor.com/2018/07/18/knowledge-loss-turnover-means-losing-employees/

Hidden Risks of Valuable Human Capital

imac@netstatz.com — Fri, 18 Jul 2025 19:23:34 +0000

Hidden Risks of Valuable Human Capital

Understanding the Unseen Challenges in Technology Leadership

Every organization has hidden assets: those deeply trusted technology professionals who hold significant responsibility within the business. These experts often predate even senior executives, serving as the critical custodians of high-level technology credentials and infrastructure. Yet, beneath this invaluable human capital lies a set of concealed risks that can profoundly impact organizational security, efficiency, and financial health.

Identity and Access Management: A Modern Imperative

The security landscape has evolved dramatically over recent decades. Multi-factor authentication (MFA), once a rarity outside specialized security operations, is now standard practice. However, within many organizations, especially smaller businesses or those that have experienced leadership transitions, highly privileged administrative accounts remain protected by nothing more than traditional passwords.

The risks here are acute. Administrators holding unrestricted access pose a significant challenge as they are often the gates for improving general security best practices. The solution is both straightforward and robust: separate administrative and operational credentials, limit the very highest credentials to key based authentication methods, and additional factors. Ensuring they appear in dashboards highlighting usage can be an easy way to create the necessary accountability. These modern approaches not only enhance security but also streamline operational efficiency for users when properly adopted. Every critical system must be resillient to the unplanned release of a single person’s credentials.

Yet, why does this vulnerability persist? Often, senior executives rely heavily on trusted tech leaders, accepting assurances that password-based protection is sufficient. Without expert guidance and advocacy for stronger controls, organizations remain vulnerable.

Technical Debt: The Invisible Anchor

The rapid pace of technology evolution has reshaped operating systems, virtualization, and container technologies. Organizations dependent on legacy systems often face costly periodic “forklift” upgrades—comprehensive, disruptive updates that consume substantial resources and limit agility.

Legacy administrators, accustomed to stable yet outdated systems, often resist transitioning to modern, scalable, and evergreen platforms due to perceived complexity or inertia. This conservatism introduces long-term costs, limits scalability, and stifles innovation, as resources remain dedicated to maintaining rather than evolving technology stacks.

Leadership often inadvertently perpetuates this status quo, relying on assurances from trusted staff about stability and security without recognizing the cumulative technical debt incurred. Without strategic intervention and advocacy for modernization, organizations—particularly smaller businesses—risk escalating administrative costs and reduced competitiveness.

The Knowledge Retention Dilemma

Employee turnover inherently threatens organizational knowledge retention. Publicly traded companies, in particular, experience significant risk when departing employees carry away critical expertise, diminishing operational continuity and efficiency.

Effective mitigation requires more than simply documenting processes. Organizations must cultivate a genuine knowledge-sharing culture, encouraging mentorship and active cross-functional collaboration. A dynamic knowledge base, adaptable enough to reflect diverse organizational thought processes, is crucial. Governance should balance flexibility with structure, avoiding overly rigid controls that can stifle participation and usability.

Why do businesses struggle to achieve this? Common barriers include outdated knowledge management tools, legacy file management systems, and competitive internal environments where individuals hoard knowledge as a perceived competitive advantage. Recognizing the human motivation for mentorship and teaching, and aligning incentives accordingly, can significantly enhance knowledge retention and organizational resilience.

Strategic Spending: Aligning Accountability with Cost Efficiency

Misaligned accountability and budgeting often drive unnecessary expenditures. Organizations frequently adopt solutions based on desired outcomes without thorough analysis of cost-effective alternatives and associated risks. This is compounded by workload considerations—evaluating cheaper alternatives requires additional effort and diligence.

In businesses owned by external investors or absentee stakeholders, spending accountability typically rests with executive leadership, who may lack detailed insight into technology costs. Without knowledgeable oversight, technology investments can spiral, driven by convenience rather than strategic procurement and robust supply chain management.

Effective cost management requires transparent communication and strategic alignment between executives, technology leaders, and procurement processes. Organizations must foster cultures where cost-efficiency and accountability coexist, prioritizing informed, balanced decision-making over short-term convenience.

The Leadership Advantage: Partnering for Success

Addressing these hidden risks demands more than isolated tactical adjustments; it requires strategic leadership. Netstatz Ltd., and its’ founder Ian MacDonald, specialize in illuminating these challenges, providing strategic solutions tailored to each organization’s unique context. For recruiters seeking experienced leadership or technology executives aiming for transformative partnerships, recognizing and navigating these hidden risks is paramount.

By proactively addressing these risks, organizations not only enhance security and efficiency but also position themselves strategically for sustained success.

Cloudflare tenant bypass in the wild

imac@netstatz.com — Wed, 01 Nov 2023 15:56:25 +0000

Sophisticated Phishing

Background

Recently we reviewed a post on hackernews.com (https://thehackernews.com/2023/10/researcher-reveal-new-technique-to.html) that disclosed a novel way to use Cloudflare's tenant policies and default certificates to enable malicious sites to proxy through content to legitimate sites, bypassing typical browser security protections and warnings.

Observations

WARNING: Do not put credentials into this URL. We extracted the following URL from a Docusign phishing email QR code. The URL links to an embedded cloudflare tenant, with a domain registered in India, and appears to be doing a MITM (Man-In-The-Middle) on Microsoft IDPs (Identity Providers). The URL https://storage-srv6576-mcrsft-987.online/ appears to pull and/or proxy the legitimate content resources from customized Office 365 IDPs to make the presented login page appear like a legitimate microsoft site. Some screen captures of this are shown below.

This first image is part of the original email content, which was made to look like a legitimate Docusign email with an embedded QR code that we extracted out of curiousity. To our surprise, it looks like it might be a sophisticated credential grabber, but we have not explored or tested it beyond capturing the observations in this post to be used as an educational tool for some of our clients.

This second image below shows the Cloudflare tenant website https://storage-srv6576-mcrsft-987.online/ that the QR code takes the user to. Most unsuspecting users would believe it to be a standard microsoft Identity Provider prompt and enter their username to proceed with authentication. Even the clever wording of the .online URL may seem unsuspecting to some users.

In the image below, we used a known Microsoft 365 customer ING, to show what happens next if you enter an email from a domain using M365 authentication. What happens is that this Cloudflare tenant finds the real Identity Provider landing page and builds a legitimate (but fake) page using resources from the customized Microsoft Identity Provider site. On this fake page, none of the links work, which is another clue that it is not legitimate. It does use of the same background images and logo from the original Mircosoft site. This is the worrisome part, as you can see that it appears very similar to the actual ING login site shown in the next screenshot.

In the image below, this is what the actual ING Office 365 login site looks like when the user “[email protected]” is entered into the actual Mircosoft Office 365 sign-in page. Notice how similar the illegitimate and legitimate sites appear. Only the URL and functional links to the forgotten password page differentiate this to the average user.

We tested another well know brand and Office 365 user KPMG. In this case, their simple, but more customized, IDP landing page yeilded a much larger difference between the the forged clone (immediately below) and the legitimate site (shown at the bottom). Changing the layout of the login page is a strategy that may help users identify the difference between real Microsoft login pages and clever clones made to look like the uncustomized default.

A Story About ChatGPT GPT-5

imac@netstatz.com — Wed, 09 Aug 2023 23:54:36 +0000

A story about ChatGPT5

Where outdated copyright meets the unstoppable force of AI innovation: a tale of unintended triumphs just may come true.

In this digital age, within the expansive realm of cyberspace, a tempest brews. It’s not about privacy, data breaches, or rogue algorithms, but rather our age-old friend – copyright. Just like a well-meaning dragon, this ancient entity unintentionally sparks a fire, shedding light on an unanticipated digital renaissance.

Blast from the Present

For context, let’s wind the clock back a bit. Copyright laws have long been the protector of creativity. Born from the need to safeguard original work, these regulations ensure creators receive due recognition and monetary compensation for their intellectual efforts. But as with all things, age sometimes takes its toll. What was once a shield has, in many instances, become a chain, restraining innovation, particularly in the rapidly evolving world of artificial intelligence.

The Firestarter: A Class Action Saga

Our story centers on OpenAI’s ChatGPT+ GPT-4, which, having integrated with Microsoft’s Bing search engine, had unlocked the golden gateway to vast amounts of publicly available data. But as Bing’s capability to detect and describe copyrighted material flourished, a class action lawsuit emerged from the shadows, accusing OpenAI of copyright infringement.

Now, some might say that taking action against companies leveraging publicly available data is akin to stopping someone from drinking water on the grounds that it looks like your water. Still, the protectors of the old way deemed this integration as copyright infringement.

The Plot Thickens

With the lawsuit pendulum hovering over them, OpenAI opted to disable their Bing integration. The forecast? A doom-laden future, filled with lost subscribers and an inferior service offering. But this is where the plot thickens.

With this sudden, enforced freedom from the Bing tether, OpenAI engineers have rolled up their sleeves and diverted their attention to other technological marvels. Locked in their vault may lie a beast of a functionality, mature and primed, but kept chained due to the fear of the unknown and public perception.

The public, meanwhile, may actually empathize with their fallen tech giant’s predicament. They may be lamenting the loss of their favorite features (I do) and, in doing so, unintentionally pave a path for OpenAI’s chatGPT+ GPT-5 resurgence.

Hints of a Grand Unveiling

Using this wave of public sympathy as their stage, and a bit of experience in trademark law, which some might consider a higher echelon of copyright focused on identity rather than financial gain, OpenAI made a grand announcement. For anyone questioning if OpenAI would get to GPT-5, it seems clear their next-gen functionality is near enough, that it qualifies as a technological leap worth protecting, in name. Its release could shatter previous records, and the resulting surge in popularity may silence some critics. It may even become evident, or described by some (by AI governance tools), as an example of a tool that has the requisite amount of due diligence, to allow the actual benefits of innovation to dwarf the potential downsides.

A New Dawn for Copyright

It’s almost poetic, the way outdated copyright rules, intended to hinder, may have inadvertently propelled OpenAI to new heights with anticipation of GPT-5. It makes one ponder: In an era where content constantly evolves and has the potential to grow like personal thought, are copyright laws becoming the digital dinosaurs?

There’s a touch of irony here too. By trying to pull OpenAI back to the age of copyright confines, the claimants accidentally thrust them, and by proxy the rest of the digital world, into the future.

In conclusion, as we tread further into this dynamic digital age, it might be time for us to reevaluate the chains of the past. Perhaps it’s time to consider if they’re holding us back or, in the delightful case of OpenAI, unexpectedly propelling us forward.

Oh, and for the copyright holders still gripping their ancient scrolls – let’s not forget that even dragons, with all their might and fury, eventually found themselves a place in the books of myths and legends.

Prompt Shaping

imac@netstatz.com — Sun, 25 Jun 2023 20:59:42 +0000

Our Story

Unlocking Productivity with AI: Problem Shaping, Not Prompt Engineering

Artificial Intelligence (AI) has gradually become a staple in our digital lives. From voice-enabled virtual assistants to recommendation systems, we’re surrounded by AI-based solutions designed to make our lives easier and more efficient. But have we ever stopped to think about how these AI systems work, and more importantly, how we can make them work better for us?

A common misconception is that the key to effective AI interaction lies in formulating the perfect prompt. People often believe that meticulous attention to grammar, Proper Nouns, and a crystal clear task definition are the keys to obtaining the desired results. However, this might not be the case. AI prompt generation is less about the fine-tuning of prompts and more about the art of problem shaping. This approach aligns closely with Albert Einstein’s reported assertion where he stated, “If I had only one hour to solve a problem, I would spend 55 minutes defining the problem and the remaining 5 minutes solving it routinely.”

A recent article from the Harvard Business Review titled [AI Prompt Engineering Isn’t the Future](https://hbr.org/2023/06/ai-prompt-engineering-isnt-the-future) touches upon this concept. Although it brings important insights to light, the essence of the matter could be explained more simply using Einstein’s wise words as a reference.

The art of problem shaping is akin to building an equation to solve a computational challenge. The power lies not in the computation itself but in understanding and articulating the problem so precisely that its solution becomes an almost routine task. This principle holds true for AI, where the challenge is less about providing specific instructions and more about asking the right questions.

When working with AI, the focus should shift from meticulous prompt crafting to understanding the broader perspective. By defining the nature of the problem, setting the context, and identifying the desired outcome, we can make the AI’s task more straightforward and its output more useful.

This shift in focus allows for more efficient and effective use of AI tools, enhancing productivity by focusing on solution-oriented problem shaping. It encourages us to dig deeper, to ask the right questions, and to think critically about the issues at hand. By focusing on the problem rather than the prompt, we can truly harness the power of AI.

As we navigate the AI-dominated future, the ability to shape problems rather than just providing prompts becomes paramount. The next time you interact with an AI tool, remember Einstein’s approach. Spend your time understanding and defining the problem, and let AI do the routine work of solving it. The key to unlocking the full potential of AI lies not in crafting the perfect prompt, but in asking the right questions.

Accelerating Innovation: Balancing Risks and Rewards in AI and Deep-Sea Exploration

imac@netstatz.com — Sat, 24 Jun 2023 00:20:28 +0000

In a recent interview with Sam Altman, CEO of OpenAI, the question was raised whether the development and deployment of artificial intelligence models like ChatGPT represent a “move fast and break things” approach. This philosophy, attributed to Facebook in its early days, suggests a preference for rapid development and deployment, often at the expense of potential errors and fixes needed down the line. This approach has since been heavily critiqued due to the unforeseen consequences in the social, political, and cultural realms.

Meanwhile, in a starkly contrasting field, the recent OceanGate accident stands as a grim reminder of the tangible risks involved when pushing the boundaries of exploration and technology. The Titan submersible, a vessel designed for deep-sea exploration, catastrophically imploded during a mission, emphasizing the grave dangers that can occur when things go awry in real-world applications.

At first glance, it might seem like these two instances have little in common: one is about software running on servers, while the other is a physical vessel exploring our planet’s uncharted depths. However, they both center on a crucial question facing innovators today: When is the right time to bring a new product into the world, and how do we balance the desire to innovate quickly with the necessity of ensuring safety and reliability?

In the case of ChatGPT, releasing it early allows for widespread user interaction, which subsequently results in a vast array of feedback and data. This approach allows the AI to learn and improve at a faster rate than if it were confined to the lab. It offers an unprecedented scale of real-world interaction, enabling the software to evolve and adapt to a broad spectrum of uses. However, the tradeoff is that the technology may sometimes generate outputs that are unexpected or even harmful, leading to necessary and sometimes rapid revisions and updates.

The OceanGate incident, on the other hand, illustrates a sobering reality of the potential costs when the risks are not just theoretical but also physical and potentially life-threatening. The accident underscores that a “move fast and break things” approach can have far more dire consequences in certain fields, where safety must be paramount.

Nevertheless, it’s crucial to remember the broader context and purpose of these endeavours. AI like ChatGPT has the potential to revolutionize how we work, communicate, and interact with information, making our lives more convenient and providing us with powerful tools for creativity and productivity. Similarly, the mission of OceanGate is rooted in the noble pursuit of exploration and knowledge, the desire to uncover the secrets of our vast oceans, a frontier as mysterious and uncharted as outer space.

In both cases, the benefits of these innovations — when managed responsibly — can far outweigh the potential pitfalls. It is only through pushing the boundaries of what is possible that we advance as a society. It’s easy to focus on the failures, but it’s essential to see them in the context of the broader journey of progress.

But what if we dare to imagine a future where these two technological frontiers could intersect, assisting each other in their quests to explore and innovate? What if a mature AI, such as ChatGPT, was employed by OceanGate in assessing the myriad of complex factors involved in a deep-sea exploration mission?

Imagine the submersible equipped with a mature AI that understands and responds to natural language, integrated within the systems of the vessel. Before the journey, it could analyze thousands of research papers and historical dive data to predict potential issues, identify risk factors, and suggest preventive measures. During the mission, it could monitor the vessel’s health, processing vast amounts of data in real time, and providing immediate feedback to the crew, alerting them of potential issues before they turn critical.

Perhaps, in this scenario, the tragic implosion of the Titan could have been averted. While this is merely a speculative scenario, it serves to highlight the potential that AI technology holds when appropriately integrated into our everyday tools and systems.

The potential applications extend far beyond deep-sea exploration. Could we see AI assisting in preventing industrial accidents, informing disaster response strategies, or even helping us in our daily tasks by analyzing patterns and predicting potential issues? As we grapple with the ethical and practical challenges of AI, it’s these kinds of ‘what if’ scenarios that should guide our ambitions and hopes.

As we ponder on these thoughts, remember that we’re on the precipice of such a future. Every day, millions of people are already harnessing the power of AI in their work, studies, and personal lives. While the technology continues to mature, its potential applications continue to expand, limited only by our imagination and ambition.

In closing, we must remember that the interplay of risk and reward is a fundamental aspect of innovation. The promise of AI, like the allure of the deep sea, beckons us forward. The journey will undoubtedly be marked by challenges and setbacks, but with each step, we inch closer to a future where the integration of AI in our lives could help prevent tragedies and unlock previously unimagined possibilities.

AI Tools for Business

How to Force-Enable Claude Opus 4.6 [1M] Context with a Max Subscription

AI Tools for Business

From Zero to Tokens: ROCm 7.0.2 Quickstart on Cloudrift’s 8-GPU Node

AI Tools for Business

Strix Halo on Ubuntu looks great

Debian: The Secret Weapon to a Faster and More Efficient Agent Development

imac@netstatz.com — Sat, 17 Jun 2023 15:53:29 +0000

As we venture deeper into the frontier of artificial intelligence (AI), the need for powerful, stable, and user-friendly tools and platforms becomes more pronounced. The open-source Debian operating system (OS) and its command-line interface (CLI) capabilities are proving to be instrumental in the efficient deployment of advanced AI agent frameworks like AutoGPT. Here’s why.

The Debian Advantage

At the core of Debian’s advantages is its policy-driven packaging system. In a nutshell, this policy ensures that all software packaged for Debian adheres to a set of strict standards that provide consistency in the way tools are configured, documented, and used. For an AI system, this uniformity is a goldmine – it allows AI to predict and understand the behavior of tools across the Debian OS without needing to adapt to different configurations or interpretations of documentation.

In addition, Debian’s long stable release cycles offer a significant edge. Longer release cycles result in an abundant amount of information about the operating environment that remains relevant for a substantial time. As an AI model, like AutoGPT, trains and learns over time, the stable environment reduces the chances of encountering unexpected changes that could disrupt the learning process or the functionality of the system.

Security and Updates: Managed and Predictable

Debian’s Filesystem Hierarchy Standard (FHS) policy, which governs the introduction of security and functional updates, adds another layer of predictability to the mix. Unlike some rpm-based distributions, Debian relies on an intermediary package maintainer. This individual or team is responsible for introducing new updates or functionalities to the OS.

This approach ensures that all changes to Debian are thoroughly vetted and evaluated for compatibility with existing packages before they’re introduced to the system. For an AI agent, this translates into a reduced risk of encountering unexpected changes or new functionalities that could trip up its operations or learning process.

The Power of CLI

Now, let’s talk about the Command Line Interface (CLI). With its textual interface, CLI might seem daunting to beginners, but it is a powerhouse for orchestrating the operation of OS and tools. More importantly, this is generally the same interface that AI agents will use to install and configure tools, thus facilitating a smoother integration process.

The CLI offers a higher degree of control over software and allows for scripting and automation of tasks. This capability plays well with AI agents as they can use it to create efficient and highly customized workflows, extending their functionalities beyond the limits of their original design.

In the context of Debian, an experienced user would likely understand one of the core ingredients as the difference between using a general deployment technique such as unzip,tar xvfz ,git clone, npm install, uv add, etc. to deploy a tool vs using the apt package installation command and it’s cousin to play upstream dpkg-buildpackage.

Embrace the Debian Revolution

In conclusion, harnessing the power of Debian and its CLI can give you a considerable edge in AI development. Its policy-driven packaging, stable releases, and FHS policy introduce a level of predictability that is highly beneficial to AI systems. The CLI offers an interface that aligns with the way AI agents operate, making it an effective tool for enhancing and managing AI functionality.

If you’re looking to accelerate your AI projects and improve efficiency, gaining expertise in Debian-based platforms could be an absolute game-changer. By providing a stable, predictable, and controllable environment, Debian lays the foundation for a successful AI future.

For users operating natively in other platforms, containers and virtual machines allow for quick staging of a Debian environment.

Docker Container

docker pull debian:bookworm-slim

Vagrant VM

vagrant init debian/bookworm64