tag:github.com,2008:https://github.com/iree-org/iree/releases Tags from iree 2026-03-20T01:58:50Z tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260320 2026-03-20T10:41:20Z iree candidate iree-3.11.0rc20260320 <p>[Codegen][CAPI] Fix C API assertion for GPU pipeline attributes in Tr…</p> <p>…anslationInfoAttr (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23868">#23868</a>)</p> <p>Context: some changes happen from the IREE side: <br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23590">#23590</a> <br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23687">#23687</a> <br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23816">#23816</a> <br />and tuner CI error: <br /><a href="https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135">https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135</a></p> <p>This PR fixes the C API assertion in `TranslationInfoAttr.get()` to <br />accept `PipelineAttr` in addition to `DispatchLoweringPassPipelineAttr`</p> <p> Assisted-by: [Claude Code](<a href="https://claude.ai/code">https://claude.ai/code</a>)</p> <p>Signed-off-by: Bangtian Liu &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/v3.11.0 2026-03-19T23:25:34Z Release v3.11.0 sa-faizal tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260319 2026-03-19T10:44:39Z iree candidate iree-3.11.0rc20260319 <p>Refactor proactor pool, add frontier-carrying signals, and fix shared…</p> <p>… infra. (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23804">#23804</a>)</p> <p>Proactor pool runner factory: <br />- Extract thread management from proactor_pool into an injectable <br />runner_factory callback. The pool creates proactors and delegates <br />poll-driving to a factory, enabling platforms without C threads (wasm, <br />embedded/RTOS) to use the pool with their own poll mechanisms. <br />- The thread-based runner moves to a new proactor_thread_runner target. <br />- _options_default() selects the thread runner on native platforms and <br />no runner on platforms without threads, so all existing callsites work <br />without changes.</p> <p>Frontier-carrying signals: <br />- Add optional frontier parameter to iree_hal_semaphore_signal() and <br />iree_hal_semaphore_list_signal() for cross-device causal ordering. <br />- Update all HAL drivers (local_task, local_sync, Vulkan, CUDA, HIP, <br />AMDGPU, Metal) and CTS tests to pass the frontier parameter. <br />- Add FIFO wait elision to semaphore submission tests.</p> <p>Shared fixes: <br />- Fix iree_call_once no-op on IREE_SYNCHRONIZATION_DISABLE_UNSAFE <br />platforms (wasm, bare-metal RISC-V). The single-threaded fallback never <br />called the init function. <br />- Guard file_transfer.c queue_copy fast path with DEVICE_VISIBLE check <br />so HOST_LOCAL-only buffers fall through to the streaming path. <br />- Add heap_buffer_wrap fallback in memory_file when device import fails. <br />- Split threaded semaphore CTS tests into semaphore_thread_test.cc so <br />platforms without C threading can run single-threaded tests.</p> <p>Co-authored-by: Claude &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260318 2026-03-18T10:20:38Z iree candidate iree-3.11.0rc20260318 <p>Simplify RISC-V QEMU configuration and make CPU flags configurable (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23777">#…</a></p> <p><a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23777">…23777</a>)</p> <p>This change makes two improvements to RISC-V testing infrastructure:</p> <p>1. Allow toolchain files to control QEMU CPU parameters via <br />RISCV_QEMU_CPU_FLAGS variable, making it easier to customize CPU <br />features for testing.</p> <p>2. Unify QEMU binary configuration by replacing QEMU_RV64_BIN and <br />QEMU_RV32_BIN with a single QEMU_BIN variable. Since the same build <br />environment cannot support both riscv64 and riscv32 simultaneously, <br />having separate variables is unnecessary.</p> <p>Changes: <br />- Add RISCV_QEMU_CPU_FLAGS to linux_riscv32.cmake and <br />linux_riscv64.cmake <br />- Pass QEMU_CPU_FLAGS environment variable to tests <br />- Update run_riscv_test.sh to use QEMU_BIN and QEMU_CPU_FLAGS <br />- Update GitHub workflow to use QEMU_BIN instead of QEMU_RV64_BIN</p> <p>Signed-off-by: Han-Kuan Chen &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260317 2026-03-17T10:41:08Z iree candidate iree-3.11.0rc20260317 <p>Unify HAL semaphores on async infrastructure. (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23695">#23695</a>)</p> <p>Many breaking(ish) API changes here: this is the grand unification of <br />the HAL semaphore mechanism that unlocks heterogeneous execution and <br />remoting, so it's worth it :) Buffers will be next (those still have <br />issues and will need some iree_async_region_t work) but at least now <br />synchronization functions the same across all layers of the stack and <br />the kernel/devices have the ability to elide device-side waits thanks to <br />the frontiers (not yet wired, but coming soon). The task system was also <br />substantially cleaned up and now no longer has a poller thread (or a <br />63-concurrent-waiter limit). Future changes will continue to optimize <br />the task system to avoid additional thread hops to reduce CPU latency.</p> <p>Note, CUDA is untested here beyond building, as I don't have access to a <br />CUDA machine right now. Anyone with access to one would be appreciated <br />in filing full reports. Or figuring out how we get a CUDA CI :)</p> <p>---</p> <p>This branch replaces IREE's legacy semaphore system — where each HAL <br />driver implemented its own timeline semaphore from scratch — with a <br />single `iree_async_semaphore_t` that every driver embeds. The HAL <br />semaphore becomes a thin shell around a shared, well-tested core.</p> <p>The result: 388 files changed, **-14,000 net lines**, 115 files deleted. <br />The codebase gets simpler and more correct at the same time.</p> <p>### What this does</p> <p>**Unified type system.** Every HAL semaphore now embeds an <br />`iree_async_semaphore_t` at offset 0 (toll-free bridge). The async <br />semaphore owns the timeline value, failure status, timepoint list, and <br />optional frontier. Driver-specific semaphore types (CUDA events, Vulkan <br />timeline semaphores, Metal shared events, software semaphores) become <br />wrappers that add only their hardware-specific signaling on top. This <br />means timeline tracking, failure propagation, multi-wait, and timepoint <br />dispatch are written once and shared by all eight backends.</p> <p>**Centralized multi-wait.** Semaphore wait-any and wait-all are now <br />implemented once in `iree_hal_semaphore_wait_list`, using the proactor's <br />native wait primitives. The old approach — where each driver <br />reimplemented multi-wait with varying degrees of correctness — is gone. <br />The Vulkan driver gets a dedicated completion watcher thread that <br />bridges Vulkan's `vkWaitSemaphores` into the async semaphore's timepoint <br />system, so Vulkan waits participate in the same unified infrastructure <br />as everyone else.</p> <p>**Proactor integration.** The async proactor (io_uring / IOCP / kqueue) <br />is now wired through device creation into every semaphore. This is the <br />foundation for event-driven scheduling: instead of polling or <br />busy-waiting on GPU completion, semaphore timepoints can be delivered <br />through the OS's native async I/O mechanism. A proactor pool manages <br />per-thread proactor instances so that device creation doesn't require <br />callers to think about I/O infrastructure.</p> <p>**Deletion cascade.** With the async semaphore as the single source of <br />truth, a large amount of legacy infrastructure becomes dead code: <br />- `semaphore_base.h/c` and the bridge timepoint API (the old <br />compatibility layer between driver semaphores and the async system) <br />- `iree_loop_t` and `loop_sync` (moved to VM where it's only needed for <br />inline module execution) <br />- `wait_handle`, `event_pool`, `wait_primitive` (replaced by async <br />primitives) <br />- The entire `experimental/web/` and `experimental/webgpu/` trees (see <br />note below) <br />- `iree_hal_wait_flags_t` (replaced by a clean three-tier `ACTIVE` / <br />`YIELD` / `BLOCK` model)</p> <p>### Emscripten / WebGPU</p> <p>The old web and WebGPU samples are deleted in this branch. They were <br />built on the emscripten loop, which was built on `wait_handle` and the <br />old synchronous wait infrastructure — all of which is now gone.</p> <p>This is intentional, not collateral damage. When emscripten support <br />comes back, it will be built on the proactor system, which is a far more <br />natural fit. The browser's event loop is fundamentally a proactor: you <br />submit work (fetch, GPU dispatch, timer) and get called back when it <br />completes. The old emscripten loop fought against this by trying to <br />impose a synchronous polling model on an inherently callback-driven <br />environment. A proactor backend for emscripten will work *with* the <br />browser's execution model — `postMessage` for cross-worker signaling, <br />`requestAnimationFrame` for frame pacing, GPU completion callbacks for <br />timeline advancement — the same way io_uring works with the Linux kernel <br />and IOCP works with Windows.</p> <p>### Why this matters</p> <p>The old semaphore system was the main obstacle to several things we want <br />to do:</p> <p>**Correct error propagation.** Every driver had its own failure <br />handling, and most of them got edge cases wrong. With a single <br />implementation, failure status propagates correctly through the entire <br />pipeline: GPU error → driver callback → async semaphore → fence → user.</p> <p>**Remote execution.** The frontier/axis system (already landed) needs <br />semaphores that can be signaled from network events, not just GPU <br />completions. The unified async semaphore makes this trivial — a <br />network-backed semaphore is just an `iree_async_semaphore_t` with no <br />hardware wrapper. Without unification, we'd need to either special-case <br />remote semaphores in every driver's wait path or build a second parallel <br />wait infrastructure.</p> <p>**New driver development.** The AMDGPU driver (in progress) benefits <br />directly: instead of building semaphore infrastructure from scratch, it <br />embeds the async semaphore and implements only the HSA-specific <br />signaling. Same for any future driver.</p> <p>**Event-driven scheduling.** The proactor integration means we can move <br />toward a model where the runtime reacts to completions rather than <br />polling for them. This is necessary for efficient multi-device <br />orchestration and for keeping CPU utilization low during GPU-bound <br />workloads.</p> <p>### API changes</p> <p>| Area | Old | New | Notes | <br />|------|-----|-----|-------| <br />| **Semaphore embedding** | Each driver defines its own semaphore struct <br />| Embed `iree_async_semaphore_t` at offset 0 | Toll-free bridge: <br />`(iree_hal_semaphore_t*)async_sem` is valid | <br />| **Semaphore vtable** | Flat `iree_hal_semaphore_vtable_t` | Embeds <br />`iree_async_semaphore_vtable_t` at offset 0 | `query()` returns <br />`uint64_t` directly; `signal()` takes frontier; `fail()` → non-virtual <br />with `on_fail()` hook | <br />| **Device creation** | No params struct | <br />`iree_hal_device_create_params_t` with `proactor_pool` | All <br />`iree_hal_driver_create_device*` signatures change | <br />| **Wait flags** | `iree_hal_wait_flags_t` | `iree_async_wait_flags_t` | <br />Three-tier: `NONE` (block), `YIELD` (brief spin), `ACTIVE` (full spin) | <br />| **Wait mode** | `iree_hal_wait_mode_t` | `iree_async_wait_mode_t` | <br />Same values, moved to async layer | <br />| **Wait primitives** | `iree_wait_primitive_t` | <br />`iree_async_primitive_t` | Proactor-native; `WAIT_PRIMITIVE` → <br />`ASYNC_PRIMITIVE` in external timepoint types | <br />| **Wait source** | `iree_wait_source_ctl_fn_t` dispatch | <br />`iree_wait_source_resolve_fn_t` | Single function: sync when <br />`callback=NULL`, async when non-NULL | <br />| **Loop** | `iree/base/loop.h`, `iree_loop_*` | `iree/vm/loop.h`, <br />`iree_vm_loop_*` | Mechanical rename; only needed for VM inline <br />execution now | <br />| **Multi-wait** | Per-driver `vtable-&gt;wait_semaphores()` | <br />`iree_async_semaphore_multi_wait()` in base layer | Drivers no longer <br />implement this | <br />| **Executable cache** | `create_executable_cache(dev, id, loop, out)` | <br />`create_executable_cache(dev, id, out)` | `iree_loop_t` parameter <br />removed |</p> <p>**Deleted**: <br />- `iree/hal/utils/semaphore_base.h` — bridge timepoint API between <br />driver semaphores and async system <br />- `iree/base/wait_handle.h`, `iree/base/event_pool.h` — replaced by <br />async/proactor infrastructure <br />- `experimental/web/`, `experimental/webgpu/` — see Emscripten note <br />above</p> <p>### Review and verification</p> <p>The branch was developed iteratively and then rebased into a clean <br />23-commit sequence. A four-arc cross-validated review (multiple models <br />with manual verification) covered the full diff. All changes verified <br />under ASAN on Linux (`//runtime/...`), plus Windows (IOCP) and macOS <br />(kqueue) for the platform-specific async backends. CUDA and HIP drivers <br />confirmed to compile.</p> <p>ci-extra: all</p> <p>---------</p> <p>Co-authored-by: Claude &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260316 2026-03-16T10:24:28Z iree candidate iree-3.11.0rc20260316 <p>[Codgen][ROCm] Fix vector distribution for transposed outputs (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23791">#23791</a>)</p> <p>Layer norm-style dispatches with a multi-output generic that has a <br />transposed output used to crash with `failed to distribute` on a <br />proprietary model.</p> <p>Teach `shouldAttachLoweringConfig` to recognize non-identity output <br />indexing maps so the op gets a `lowering_config` and proper `to_layout` <br />anchors.</p> <p>Co-authored-by: Claude Opus 4.6 (1M context) &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260315 2026-03-15T10:22:08Z iree candidate iree-3.11.0rc20260315 <p>[Codegen][ROCm] Fix crash in complex matmul configuration logic (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23790">#23790</a>)</p> <p>`getContractionHeuristicSeeds` called <br />`problem.aType.getIntOrFloatBitWidth()` unconditionally, but aType can <br />be `complex&lt;f32&gt;` from complex batch matmul dispatches.</p> <p>Route complex contractions to the SIMT-based setContractConfig fallback <br />instead. This fixes a crash on a proprietary model.</p> <p>I checked the numerics against numpy and cpu and rocm is as accurate as <br />cpu.</p> <p>Co-authored-by: Claude Opus 4.6 (1M context) &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260314 2026-03-14T09:53:25Z iree candidate iree-3.11.0rc20260314 <p>[Codegen] Apply bounds to subgroup_id (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23768">#23768</a>)</p> <p>Since we're fixing to start using subgroup_id more often with PCF, apply <br />bounds to it based on the subgroup size (or sizes) and the number of <br />threads in the workgroup. Also extend subgroup_size handling to account <br />for known subgroup sizes instead of giving up completely when there <br />isn't a fixed choice made yet.</p> <p>Also fix up some double-spaces in test attributes.</p> <p>---------</p> <p>Co-authored-by: Claude Opus 4.6 &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260313 2026-03-13T10:38:22Z iree candidate iree-3.11.0rc20260313 <p>[Codegen][Tuner] Add col_major parameter to MMAAttr/VirtualMMAAttr Py…</p> <p>…thon binding (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23757">#23757</a>)</p> <p>PR <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23633">#23633</a> removes the `attention_qk_matmul` and `attention_pv_matmul` <br />marker attributes from attention decomposition configs, replacing them <br />with a new `col_major = true` parameter on MMA intrinsics.</p> <p>This PR updates the Python bindings to support the `col_major <br />`parameter, enabling the tuner to generate attention configs compatible <br />with the new approach.</p> <p>After this PR, I will also add the required changes to the tuner side. </p> <p>Assisted-by: [Claude Code](<a href="https://claude.ai/code">https://claude.ai/code</a>)</p> <p>Signed-off-by: Bangtian Liu &lt;[email protected]&gt;</p> iree-github-actions-bot tag:github.com,2008:Repository/208145128/iree-3.11.0rc20260312 2026-03-12T19:21:16Z iree candidate iree-3.11.0rc20260312 <p>Sort cmake libs/files lists (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23708">#23708</a>)</p> <p>Formatted using <a href="https://github.com/Hardcode84/3">https://github.com/Hardcode84/3</a></p> <p>---------</p> <p>Signed-off-by: Ivan Butygin &lt;[email protected]&gt;</p> iree-github-actions-bot