Tags from iree

iree candidate iree-3.11.0rc20260320

2026-03-20T10:41:20Z

[Codegen][CAPI] Fix C API assertion for GPU pipeline attributes in Tr…

…anslationInfoAttr (#23868)

Context: some changes happen from the IREE side:
- #23590
- #23687
- #23816
and tuner CI error:
https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135

This PR fixes the C API assertion in `TranslationInfoAttr.get()` to
accept `PipelineAttr` in addition to `DispatchLoweringPassPipelineAttr`

Assisted-by: [Claude Code](https://claude.ai/code)

Signed-off-by: Bangtian Liu <[email protected]>

Release v3.11.0

2026-03-19T23:25:34Z

iree candidate iree-3.11.0rc20260319

2026-03-19T10:44:39Z

Refactor proactor pool, add frontier-carrying signals, and fix shared…

… infra. (#23804)

Proactor pool runner factory:
- Extract thread management from proactor_pool into an injectable
runner_factory callback. The pool creates proactors and delegates
poll-driving to a factory, enabling platforms without C threads (wasm,
embedded/RTOS) to use the pool with their own poll mechanisms.
- The thread-based runner moves to a new proactor_thread_runner target.
- _options_default() selects the thread runner on native platforms and
no runner on platforms without threads, so all existing callsites work
without changes.

Frontier-carrying signals:
- Add optional frontier parameter to iree_hal_semaphore_signal() and
iree_hal_semaphore_list_signal() for cross-device causal ordering.
- Update all HAL drivers (local_task, local_sync, Vulkan, CUDA, HIP,
AMDGPU, Metal) and CTS tests to pass the frontier parameter.
- Add FIFO wait elision to semaphore submission tests.

Shared fixes:
- Fix iree_call_once no-op on IREE_SYNCHRONIZATION_DISABLE_UNSAFE
platforms (wasm, bare-metal RISC-V). The single-threaded fallback never
called the init function.
- Guard file_transfer.c queue_copy fast path with DEVICE_VISIBLE check
so HOST_LOCAL-only buffers fall through to the streaming path.
- Add heap_buffer_wrap fallback in memory_file when device import fails.
- Split threaded semaphore CTS tests into semaphore_thread_test.cc so
platforms without C threading can run single-threaded tests.

Co-authored-by: Claude <[email protected]>

iree candidate iree-3.11.0rc20260318

2026-03-18T10:20:38Z

Simplify RISC-V QEMU configuration and make CPU flags configurable (#…

…23777)

This change makes two improvements to RISC-V testing infrastructure:

1. Allow toolchain files to control QEMU CPU parameters via
RISCV_QEMU_CPU_FLAGS variable, making it easier to customize CPU
features for testing.

2. Unify QEMU binary configuration by replacing QEMU_RV64_BIN and
QEMU_RV32_BIN with a single QEMU_BIN variable. Since the same build
environment cannot support both riscv64 and riscv32 simultaneously,
having separate variables is unnecessary.

Changes:
- Add RISCV_QEMU_CPU_FLAGS to linux_riscv32.cmake and
linux_riscv64.cmake
- Pass QEMU_CPU_FLAGS environment variable to tests
- Update run_riscv_test.sh to use QEMU_BIN and QEMU_CPU_FLAGS
- Update GitHub workflow to use QEMU_BIN instead of QEMU_RV64_BIN

Signed-off-by: Han-Kuan Chen <[email protected]>

iree candidate iree-3.11.0rc20260317

2026-03-17T10:41:08Z

Unify HAL semaphores on async infrastructure. (#23695)

Many breaking(ish) API changes here: this is the grand unification of
the HAL semaphore mechanism that unlocks heterogeneous execution and
remoting, so it's worth it :) Buffers will be next (those still have
issues and will need some iree_async_region_t work) but at least now
synchronization functions the same across all layers of the stack and
the kernel/devices have the ability to elide device-side waits thanks to
the frontiers (not yet wired, but coming soon). The task system was also
substantially cleaned up and now no longer has a poller thread (or a
63-concurrent-waiter limit). Future changes will continue to optimize
the task system to avoid additional thread hops to reduce CPU latency.

Note, CUDA is untested here beyond building, as I don't have access to a
CUDA machine right now. Anyone with access to one would be appreciated
in filing full reports. Or figuring out how we get a CUDA CI :)

---

This branch replaces IREE's legacy semaphore system — where each HAL
driver implemented its own timeline semaphore from scratch — with a
single `iree_async_semaphore_t` that every driver embeds. The HAL
semaphore becomes a thin shell around a shared, well-tested core.

The result: 388 files changed, **-14,000 net lines**, 115 files deleted.
The codebase gets simpler and more correct at the same time.

### What this does

**Unified type system.** Every HAL semaphore now embeds an
`iree_async_semaphore_t` at offset 0 (toll-free bridge). The async
semaphore owns the timeline value, failure status, timepoint list, and
optional frontier. Driver-specific semaphore types (CUDA events, Vulkan
timeline semaphores, Metal shared events, software semaphores) become
wrappers that add only their hardware-specific signaling on top. This
means timeline tracking, failure propagation, multi-wait, and timepoint
dispatch are written once and shared by all eight backends.

**Centralized multi-wait.** Semaphore wait-any and wait-all are now
implemented once in `iree_hal_semaphore_wait_list`, using the proactor's
native wait primitives. The old approach — where each driver
reimplemented multi-wait with varying degrees of correctness — is gone.
The Vulkan driver gets a dedicated completion watcher thread that
bridges Vulkan's `vkWaitSemaphores` into the async semaphore's timepoint
system, so Vulkan waits participate in the same unified infrastructure
as everyone else.

**Proactor integration.** The async proactor (io_uring / IOCP / kqueue)
is now wired through device creation into every semaphore. This is the
foundation for event-driven scheduling: instead of polling or
busy-waiting on GPU completion, semaphore timepoints can be delivered
through the OS's native async I/O mechanism. A proactor pool manages
per-thread proactor instances so that device creation doesn't require
callers to think about I/O infrastructure.

**Deletion cascade.** With the async semaphore as the single source of
truth, a large amount of legacy infrastructure becomes dead code:
- `semaphore_base.h/c` and the bridge timepoint API (the old
compatibility layer between driver semaphores and the async system)
- `iree_loop_t` and `loop_sync` (moved to VM where it's only needed for
inline module execution)
- `wait_handle`, `event_pool`, `wait_primitive` (replaced by async
primitives)
- The entire `experimental/web/` and `experimental/webgpu/` trees (see
note below)
- `iree_hal_wait_flags_t` (replaced by a clean three-tier `ACTIVE` /
`YIELD` / `BLOCK` model)

### Emscripten / WebGPU

The old web and WebGPU samples are deleted in this branch. They were
built on the emscripten loop, which was built on `wait_handle` and the
old synchronous wait infrastructure — all of which is now gone.

This is intentional, not collateral damage. When emscripten support
comes back, it will be built on the proactor system, which is a far more
natural fit. The browser's event loop is fundamentally a proactor: you
submit work (fetch, GPU dispatch, timer) and get called back when it
completes. The old emscripten loop fought against this by trying to
impose a synchronous polling model on an inherently callback-driven
environment. A proactor backend for emscripten will work *with* the
browser's execution model — `postMessage` for cross-worker signaling,
`requestAnimationFrame` for frame pacing, GPU completion callbacks for
timeline advancement — the same way io_uring works with the Linux kernel
and IOCP works with Windows.

### Why this matters

The old semaphore system was the main obstacle to several things we want
to do:

**Correct error propagation.** Every driver had its own failure
handling, and most of them got edge cases wrong. With a single
implementation, failure status propagates correctly through the entire
pipeline: GPU error → driver callback → async semaphore → fence → user.

**Remote execution.** The frontier/axis system (already landed) needs
semaphores that can be signaled from network events, not just GPU
completions. The unified async semaphore makes this trivial — a
network-backed semaphore is just an `iree_async_semaphore_t` with no
hardware wrapper. Without unification, we'd need to either special-case
remote semaphores in every driver's wait path or build a second parallel
wait infrastructure.

**New driver development.** The AMDGPU driver (in progress) benefits
directly: instead of building semaphore infrastructure from scratch, it
embeds the async semaphore and implements only the HSA-specific
signaling. Same for any future driver.

**Event-driven scheduling.** The proactor integration means we can move
toward a model where the runtime reacts to completions rather than
polling for them. This is necessary for efficient multi-device
orchestration and for keeping CPU utilization low during GPU-bound
workloads.

### API changes

| Area | Old | New | Notes |
|------|-----|-----|-------|
| **Semaphore embedding** | Each driver defines its own semaphore struct
| Embed `iree_async_semaphore_t` at offset 0 | Toll-free bridge:
`(iree_hal_semaphore_t*)async_sem` is valid |
| **Semaphore vtable** | Flat `iree_hal_semaphore_vtable_t` | Embeds
`iree_async_semaphore_vtable_t` at offset 0 | `query()` returns
`uint64_t` directly; `signal()` takes frontier; `fail()` → non-virtual
with `on_fail()` hook |
| **Device creation** | No params struct |
`iree_hal_device_create_params_t` with `proactor_pool` | All
`iree_hal_driver_create_device*` signatures change |
| **Wait flags** | `iree_hal_wait_flags_t` | `iree_async_wait_flags_t` |
Three-tier: `NONE` (block), `YIELD` (brief spin), `ACTIVE` (full spin) |
| **Wait mode** | `iree_hal_wait_mode_t` | `iree_async_wait_mode_t` |
Same values, moved to async layer |
| **Wait primitives** | `iree_wait_primitive_t` |
`iree_async_primitive_t` | Proactor-native; `WAIT_PRIMITIVE` →
`ASYNC_PRIMITIVE` in external timepoint types |
| **Wait source** | `iree_wait_source_ctl_fn_t` dispatch |
`iree_wait_source_resolve_fn_t` | Single function: sync when
`callback=NULL`, async when non-NULL |
| **Loop** | `iree/base/loop.h`, `iree_loop_*` | `iree/vm/loop.h`,
`iree_vm_loop_*` | Mechanical rename; only needed for VM inline
execution now |
| **Multi-wait** | Per-driver `vtable->wait_semaphores()` |
`iree_async_semaphore_multi_wait()` in base layer | Drivers no longer
implement this |
| **Executable cache** | `create_executable_cache(dev, id, loop, out)` |
`create_executable_cache(dev, id, out)` | `iree_loop_t` parameter
removed |

**Deleted**:
- `iree/hal/utils/semaphore_base.h` — bridge timepoint API between
driver semaphores and async system
- `iree/base/wait_handle.h`, `iree/base/event_pool.h` — replaced by
async/proactor infrastructure
- `experimental/web/`, `experimental/webgpu/` — see Emscripten note
above

### Review and verification

The branch was developed iteratively and then rebased into a clean
23-commit sequence. A four-arc cross-validated review (multiple models
with manual verification) covered the full diff. All changes verified
under ASAN on Linux (`//runtime/...`), plus Windows (IOCP) and macOS
(kqueue) for the platform-specific async backends. CUDA and HIP drivers
confirmed to compile.

ci-extra: all

---------

Co-authored-by: Claude <[email protected]>

iree candidate iree-3.11.0rc20260316

2026-03-16T10:24:28Z

[Codgen][ROCm] Fix vector distribution for transposed outputs (#23791)

Layer norm-style dispatches with a multi-output generic that has a
transposed output used to crash with `failed to distribute` on a
proprietary model.

Teach `shouldAttachLoweringConfig` to recognize non-identity output
indexing maps so the op gets a `lowering_config` and proper `to_layout`
anchors.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

iree candidate iree-3.11.0rc20260315

2026-03-15T10:22:08Z

[Codegen][ROCm] Fix crash in complex matmul configuration logic (#23790)

`getContractionHeuristicSeeds` called
`problem.aType.getIntOrFloatBitWidth()` unconditionally, but aType can
be `complex<f32>` from complex batch matmul dispatches.

Route complex contractions to the SIMT-based setContractConfig fallback
instead. This fixes a crash on a proprietary model.

I checked the numerics against numpy and cpu and rocm is as accurate as
cpu.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

iree candidate iree-3.11.0rc20260314

2026-03-14T09:53:25Z

[Codegen] Apply bounds to subgroup_id (#23768)

Since we're fixing to start using subgroup_id more often with PCF, apply
bounds to it based on the subgroup size (or sizes) and the number of
threads in the workgroup. Also extend subgroup_size handling to account
for known subgroup sizes instead of giving up completely when there
isn't a fixed choice made yet.

Also fix up some double-spaces in test attributes.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

iree candidate iree-3.11.0rc20260313

2026-03-13T10:38:22Z

[Codegen][Tuner] Add col_major parameter to MMAAttr/VirtualMMAAttr Py…

…thon binding (#23757)

PR #23633 removes the `attention_qk_matmul` and `attention_pv_matmul`
marker attributes from attention decomposition configs, replacing them
with a new `col_major = true` parameter on MMA intrinsics.

This PR updates the Python bindings to support the `col_major
`parameter, enabling the tuner to generate attention configs compatible
with the new approach.

After this PR, I will also add the required changes to the tuner side.

Assisted-by: [Claude Code](https://claude.ai/code)

Signed-off-by: Bangtian Liu <[email protected]>

iree candidate iree-3.11.0rc20260312

2026-03-12T19:21:16Z

Sort cmake libs/files lists (#23708)

Formatted using https://github.com/Hardcode84/3

---------

Signed-off-by: Ivan Butygin <[email protected]>