tag:github.com,2008:https://github.com/iree-org/iree/releasesTags from iree2026-03-20T01:58:50Ztag:github.com,2008:Repository/208145128/iree-3.11.0rc202603202026-03-20T10:41:20Ziree candidate iree-3.11.0rc20260320<p>[Codegen][CAPI] Fix C API assertion for GPU pipeline attributes in Tr…</p>
<p>…anslationInfoAttr (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23868">#23868</a>)</p>
<p>Context: some changes happen from the IREE side:
<br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23590">#23590</a>
<br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23687">#23687</a>
<br />- <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23816">#23816</a>
<br />and tuner CI error:
<br /><a href="https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135">https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135</a></p>
<p>This PR fixes the C API assertion in `TranslationInfoAttr.get()` to
<br />accept `PipelineAttr` in addition to `DispatchLoweringPassPipelineAttr`</p>
<p> Assisted-by: [Claude Code](<a href="https://claude.ai/code">https://claude.ai/code</a>)</p>
<p>Signed-off-by: Bangtian Liu <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/v3.11.02026-03-19T23:25:34ZRelease v3.11.0sa-faizaltag:github.com,2008:Repository/208145128/iree-3.11.0rc202603192026-03-19T10:44:39Ziree candidate iree-3.11.0rc20260319<p>Refactor proactor pool, add frontier-carrying signals, and fix shared…</p>
<p>… infra. (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23804">#23804</a>)</p>
<p>Proactor pool runner factory:
<br />- Extract thread management from proactor_pool into an injectable
<br />runner_factory callback. The pool creates proactors and delegates
<br />poll-driving to a factory, enabling platforms without C threads (wasm,
<br />embedded/RTOS) to use the pool with their own poll mechanisms.
<br />- The thread-based runner moves to a new proactor_thread_runner target.
<br />- _options_default() selects the thread runner on native platforms and
<br />no runner on platforms without threads, so all existing callsites work
<br />without changes.</p>
<p>Frontier-carrying signals:
<br />- Add optional frontier parameter to iree_hal_semaphore_signal() and
<br />iree_hal_semaphore_list_signal() for cross-device causal ordering.
<br />- Update all HAL drivers (local_task, local_sync, Vulkan, CUDA, HIP,
<br />AMDGPU, Metal) and CTS tests to pass the frontier parameter.
<br />- Add FIFO wait elision to semaphore submission tests.</p>
<p>Shared fixes:
<br />- Fix iree_call_once no-op on IREE_SYNCHRONIZATION_DISABLE_UNSAFE
<br />platforms (wasm, bare-metal RISC-V). The single-threaded fallback never
<br />called the init function.
<br />- Guard file_transfer.c queue_copy fast path with DEVICE_VISIBLE check
<br />so HOST_LOCAL-only buffers fall through to the streaming path.
<br />- Add heap_buffer_wrap fallback in memory_file when device import fails.
<br />- Split threaded semaphore CTS tests into semaphore_thread_test.cc so
<br />platforms without C threading can run single-threaded tests.</p>
<p>Co-authored-by: Claude <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603182026-03-18T10:20:38Ziree candidate iree-3.11.0rc20260318<p>Simplify RISC-V QEMU configuration and make CPU flags configurable (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23777">#…</a></p>
<p><a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23777">…23777</a>)</p>
<p>This change makes two improvements to RISC-V testing infrastructure:</p>
<p>1. Allow toolchain files to control QEMU CPU parameters via
<br />RISCV_QEMU_CPU_FLAGS variable, making it easier to customize CPU
<br />features for testing.</p>
<p>2. Unify QEMU binary configuration by replacing QEMU_RV64_BIN and
<br />QEMU_RV32_BIN with a single QEMU_BIN variable. Since the same build
<br />environment cannot support both riscv64 and riscv32 simultaneously,
<br />having separate variables is unnecessary.</p>
<p>Changes:
<br />- Add RISCV_QEMU_CPU_FLAGS to linux_riscv32.cmake and
<br />linux_riscv64.cmake
<br />- Pass QEMU_CPU_FLAGS environment variable to tests
<br />- Update run_riscv_test.sh to use QEMU_BIN and QEMU_CPU_FLAGS
<br />- Update GitHub workflow to use QEMU_BIN instead of QEMU_RV64_BIN</p>
<p>Signed-off-by: Han-Kuan Chen <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603172026-03-17T10:41:08Ziree candidate iree-3.11.0rc20260317<p>Unify HAL semaphores on async infrastructure. (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23695">#23695</a>)</p>
<p>Many breaking(ish) API changes here: this is the grand unification of
<br />the HAL semaphore mechanism that unlocks heterogeneous execution and
<br />remoting, so it's worth it :) Buffers will be next (those still have
<br />issues and will need some iree_async_region_t work) but at least now
<br />synchronization functions the same across all layers of the stack and
<br />the kernel/devices have the ability to elide device-side waits thanks to
<br />the frontiers (not yet wired, but coming soon). The task system was also
<br />substantially cleaned up and now no longer has a poller thread (or a
<br />63-concurrent-waiter limit). Future changes will continue to optimize
<br />the task system to avoid additional thread hops to reduce CPU latency.</p>
<p>Note, CUDA is untested here beyond building, as I don't have access to a
<br />CUDA machine right now. Anyone with access to one would be appreciated
<br />in filing full reports. Or figuring out how we get a CUDA CI :)</p>
<p>---</p>
<p>This branch replaces IREE's legacy semaphore system — where each HAL
<br />driver implemented its own timeline semaphore from scratch — with a
<br />single `iree_async_semaphore_t` that every driver embeds. The HAL
<br />semaphore becomes a thin shell around a shared, well-tested core.</p>
<p>The result: 388 files changed, **-14,000 net lines**, 115 files deleted.
<br />The codebase gets simpler and more correct at the same time.</p>
<p>### What this does</p>
<p>**Unified type system.** Every HAL semaphore now embeds an
<br />`iree_async_semaphore_t` at offset 0 (toll-free bridge). The async
<br />semaphore owns the timeline value, failure status, timepoint list, and
<br />optional frontier. Driver-specific semaphore types (CUDA events, Vulkan
<br />timeline semaphores, Metal shared events, software semaphores) become
<br />wrappers that add only their hardware-specific signaling on top. This
<br />means timeline tracking, failure propagation, multi-wait, and timepoint
<br />dispatch are written once and shared by all eight backends.</p>
<p>**Centralized multi-wait.** Semaphore wait-any and wait-all are now
<br />implemented once in `iree_hal_semaphore_wait_list`, using the proactor's
<br />native wait primitives. The old approach — where each driver
<br />reimplemented multi-wait with varying degrees of correctness — is gone.
<br />The Vulkan driver gets a dedicated completion watcher thread that
<br />bridges Vulkan's `vkWaitSemaphores` into the async semaphore's timepoint
<br />system, so Vulkan waits participate in the same unified infrastructure
<br />as everyone else.</p>
<p>**Proactor integration.** The async proactor (io_uring / IOCP / kqueue)
<br />is now wired through device creation into every semaphore. This is the
<br />foundation for event-driven scheduling: instead of polling or
<br />busy-waiting on GPU completion, semaphore timepoints can be delivered
<br />through the OS's native async I/O mechanism. A proactor pool manages
<br />per-thread proactor instances so that device creation doesn't require
<br />callers to think about I/O infrastructure.</p>
<p>**Deletion cascade.** With the async semaphore as the single source of
<br />truth, a large amount of legacy infrastructure becomes dead code:
<br />- `semaphore_base.h/c` and the bridge timepoint API (the old
<br />compatibility layer between driver semaphores and the async system)
<br />- `iree_loop_t` and `loop_sync` (moved to VM where it's only needed for
<br />inline module execution)
<br />- `wait_handle`, `event_pool`, `wait_primitive` (replaced by async
<br />primitives)
<br />- The entire `experimental/web/` and `experimental/webgpu/` trees (see
<br />note below)
<br />- `iree_hal_wait_flags_t` (replaced by a clean three-tier `ACTIVE` /
<br />`YIELD` / `BLOCK` model)</p>
<p>### Emscripten / WebGPU</p>
<p>The old web and WebGPU samples are deleted in this branch. They were
<br />built on the emscripten loop, which was built on `wait_handle` and the
<br />old synchronous wait infrastructure — all of which is now gone.</p>
<p>This is intentional, not collateral damage. When emscripten support
<br />comes back, it will be built on the proactor system, which is a far more
<br />natural fit. The browser's event loop is fundamentally a proactor: you
<br />submit work (fetch, GPU dispatch, timer) and get called back when it
<br />completes. The old emscripten loop fought against this by trying to
<br />impose a synchronous polling model on an inherently callback-driven
<br />environment. A proactor backend for emscripten will work *with* the
<br />browser's execution model — `postMessage` for cross-worker signaling,
<br />`requestAnimationFrame` for frame pacing, GPU completion callbacks for
<br />timeline advancement — the same way io_uring works with the Linux kernel
<br />and IOCP works with Windows.</p>
<p>### Why this matters</p>
<p>The old semaphore system was the main obstacle to several things we want
<br />to do:</p>
<p>**Correct error propagation.** Every driver had its own failure
<br />handling, and most of them got edge cases wrong. With a single
<br />implementation, failure status propagates correctly through the entire
<br />pipeline: GPU error → driver callback → async semaphore → fence → user.</p>
<p>**Remote execution.** The frontier/axis system (already landed) needs
<br />semaphores that can be signaled from network events, not just GPU
<br />completions. The unified async semaphore makes this trivial — a
<br />network-backed semaphore is just an `iree_async_semaphore_t` with no
<br />hardware wrapper. Without unification, we'd need to either special-case
<br />remote semaphores in every driver's wait path or build a second parallel
<br />wait infrastructure.</p>
<p>**New driver development.** The AMDGPU driver (in progress) benefits
<br />directly: instead of building semaphore infrastructure from scratch, it
<br />embeds the async semaphore and implements only the HSA-specific
<br />signaling. Same for any future driver.</p>
<p>**Event-driven scheduling.** The proactor integration means we can move
<br />toward a model where the runtime reacts to completions rather than
<br />polling for them. This is necessary for efficient multi-device
<br />orchestration and for keeping CPU utilization low during GPU-bound
<br />workloads.</p>
<p>### API changes</p>
<p>| Area | Old | New | Notes |
<br />|------|-----|-----|-------|
<br />| **Semaphore embedding** | Each driver defines its own semaphore struct
<br />| Embed `iree_async_semaphore_t` at offset 0 | Toll-free bridge:
<br />`(iree_hal_semaphore_t*)async_sem` is valid |
<br />| **Semaphore vtable** | Flat `iree_hal_semaphore_vtable_t` | Embeds
<br />`iree_async_semaphore_vtable_t` at offset 0 | `query()` returns
<br />`uint64_t` directly; `signal()` takes frontier; `fail()` → non-virtual
<br />with `on_fail()` hook |
<br />| **Device creation** | No params struct |
<br />`iree_hal_device_create_params_t` with `proactor_pool` | All
<br />`iree_hal_driver_create_device*` signatures change |
<br />| **Wait flags** | `iree_hal_wait_flags_t` | `iree_async_wait_flags_t` |
<br />Three-tier: `NONE` (block), `YIELD` (brief spin), `ACTIVE` (full spin) |
<br />| **Wait mode** | `iree_hal_wait_mode_t` | `iree_async_wait_mode_t` |
<br />Same values, moved to async layer |
<br />| **Wait primitives** | `iree_wait_primitive_t` |
<br />`iree_async_primitive_t` | Proactor-native; `WAIT_PRIMITIVE` →
<br />`ASYNC_PRIMITIVE` in external timepoint types |
<br />| **Wait source** | `iree_wait_source_ctl_fn_t` dispatch |
<br />`iree_wait_source_resolve_fn_t` | Single function: sync when
<br />`callback=NULL`, async when non-NULL |
<br />| **Loop** | `iree/base/loop.h`, `iree_loop_*` | `iree/vm/loop.h`,
<br />`iree_vm_loop_*` | Mechanical rename; only needed for VM inline
<br />execution now |
<br />| **Multi-wait** | Per-driver `vtable->wait_semaphores()` |
<br />`iree_async_semaphore_multi_wait()` in base layer | Drivers no longer
<br />implement this |
<br />| **Executable cache** | `create_executable_cache(dev, id, loop, out)` |
<br />`create_executable_cache(dev, id, out)` | `iree_loop_t` parameter
<br />removed |</p>
<p>**Deleted**:
<br />- `iree/hal/utils/semaphore_base.h` — bridge timepoint API between
<br />driver semaphores and async system
<br />- `iree/base/wait_handle.h`, `iree/base/event_pool.h` — replaced by
<br />async/proactor infrastructure
<br />- `experimental/web/`, `experimental/webgpu/` — see Emscripten note
<br />above</p>
<p>### Review and verification</p>
<p>The branch was developed iteratively and then rebased into a clean
<br />23-commit sequence. A four-arc cross-validated review (multiple models
<br />with manual verification) covered the full diff. All changes verified
<br />under ASAN on Linux (`//runtime/...`), plus Windows (IOCP) and macOS
<br />(kqueue) for the platform-specific async backends. CUDA and HIP drivers
<br />confirmed to compile.</p>
<p>ci-extra: all</p>
<p>---------</p>
<p>Co-authored-by: Claude <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603162026-03-16T10:24:28Ziree candidate iree-3.11.0rc20260316<p>[Codgen][ROCm] Fix vector distribution for transposed outputs (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23791">#23791</a>)</p>
<p>Layer norm-style dispatches with a multi-output generic that has a
<br />transposed output used to crash with `failed to distribute` on a
<br />proprietary model.</p>
<p>Teach `shouldAttachLoweringConfig` to recognize non-identity output
<br />indexing maps so the op gets a `lowering_config` and proper `to_layout`
<br />anchors.</p>
<p>Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603152026-03-15T10:22:08Ziree candidate iree-3.11.0rc20260315<p>[Codegen][ROCm] Fix crash in complex matmul configuration logic (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23790">#23790</a>)</p>
<p>`getContractionHeuristicSeeds` called
<br />`problem.aType.getIntOrFloatBitWidth()` unconditionally, but aType can
<br />be `complex<f32>` from complex batch matmul dispatches.</p>
<p>Route complex contractions to the SIMT-based setContractConfig fallback
<br />instead. This fixes a crash on a proprietary model.</p>
<p>I checked the numerics against numpy and cpu and rocm is as accurate as
<br />cpu.</p>
<p>Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603142026-03-14T09:53:25Ziree candidate iree-3.11.0rc20260314<p>[Codegen] Apply bounds to subgroup_id (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23768">#23768</a>)</p>
<p>Since we're fixing to start using subgroup_id more often with PCF, apply
<br />bounds to it based on the subgroup size (or sizes) and the number of
<br />threads in the workgroup. Also extend subgroup_size handling to account
<br />for known subgroup sizes instead of giving up completely when there
<br />isn't a fixed choice made yet.</p>
<p>Also fix up some double-spaces in test attributes.</p>
<p>---------</p>
<p>Co-authored-by: Claude Opus 4.6 <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603132026-03-13T10:38:22Ziree candidate iree-3.11.0rc20260313<p>[Codegen][Tuner] Add col_major parameter to MMAAttr/VirtualMMAAttr Py…</p>
<p>…thon binding (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23757">#23757</a>)</p>
<p>PR <a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23633">#23633</a> removes the `attention_qk_matmul` and `attention_pv_matmul`
<br />marker attributes from attention decomposition configs, replacing them
<br />with a new `col_major = true` parameter on MMA intrinsics.</p>
<p>This PR updates the Python bindings to support the `col_major
<br />`parameter, enabling the tuner to generate attention configs compatible
<br />with the new approach.</p>
<p>After this PR, I will also add the required changes to the tuner side. </p>
<p>Assisted-by: [Claude Code](<a href="https://claude.ai/code">https://claude.ai/code</a>)</p>
<p>Signed-off-by: Bangtian Liu <[email protected]></p>iree-github-actions-bottag:github.com,2008:Repository/208145128/iree-3.11.0rc202603122026-03-12T19:21:16Ziree candidate iree-3.11.0rc20260312<p>Sort cmake libs/files lists (<a class="issue-link js-issue-link" href="https://github.com/iree-org/iree/pull/23708">#23708</a>)</p>
<p>Formatted using <a href="https://github.com/Hardcode84/3">https://github.com/Hardcode84/3</a></p>
<p>---------</p>
<p>Signed-off-by: Ivan Butygin <[email protected]></p>iree-github-actions-bot