iree/async/: Proactor-based async I/O and causal frontier scheduling#23527
iree/async/: Proactor-based async I/O and causal frontier scheduling#23527
iree/async/: Proactor-based async I/O and causal frontier scheduling#23527Conversation
stellaraccident
left a comment
There was a problem hiding this comment.
I will admit that it feels "disquieting" to just LGTM something of this scale, but that is what I am going to do. I spent several hours reviewing this both in terms of design and detail with AI assistance, and I endorse it -- both in terms of approach and code quality. Key points for the latter is that the author produced a better tested system with more comprehensive controls and followthrough than I have seen from a human in a long time.
|
I'll at least comment that the idea of the system here makes sense and has a correct vibe, so it's probably fine. I'm going to be somewhat hesitant that this is landing in one chunk instead of as a structured sequence of commits that people might try to understand |
Hesitance is ok. But this is fine to land. I was hesitant reviewing it but we're not going to be arbitrarily splitting it to appease the humans. It is completely understandable with AI assistance. |
Co-Authored-By: Claude <[email protected]>
Foundation types for the async I/O subsystem: - types.h: Core typedefs (wait modes, socket flags, operation types) - primitive.h: Platform fd/handle wrappers with lifecycle helpers - affinity.h: CPU affinity masks for NUMA-aware scheduling - span.h: Non-owning buffer spans for vectored I/O - address.h: Network address abstraction (IPv4/IPv6/Unix/abstract) with platform-specific resolution (POSIX, Win32, generic fallback) - region.h/region.c: Buffer region descriptors for scatter/gather - slab.h/slab.c: Fixed-slot slab allocator for pool-backed objects Also includes PlatformSelect support in bazel_to_cmake_converter.py for target_compatible_with constraints in CMake generation. Co-Authored-By: Claude <[email protected]>
Cross-platform event signaling for waking proactor poll loops: - event.h/event.c: Wraps eventfd (Linux) or pipe (macOS/BSD) with create/set/reset/destroy lifecycle. Platform differences are abstracted: Linux uses a single fd for both monitoring and signaling; macOS/BSD uses a pipe pair (read end for monitoring, write end for signaling). - event_pool.h/event_pool.c: Pre-allocated pool of events for amortized allocation cost. Uses slab-based storage with acquire/release semantics for pool entries. - event_pool_test.cc: Tests for pool lifecycle, acquire/release patterns, and concurrent access. Co-Authored-By: Claude <[email protected]>
Thread-safe signaling primitives and cross-primitive relay system:
- notification.h/notification.c: Epoch-based notification with
atomic signal/query. Wraps platform futex or eventfd for
efficient cross-thread wakeup. Supports wait tokens to detect
signals that occur after a query snapshot.
- relay.h: Connects event sources to sinks ("when X happens,
trigger Y"). Sources can be PRIMITIVE (fd readability) or
NOTIFICATION (epoch advance). Sinks can be SIGNAL_PRIMITIVE
(write to eventfd) or SIGNAL_NOTIFICATION (signal notification).
The relay interface is implemented by each backend with
platform-specific dispatch (io_uring multishot poll, POSIX
event_set registration, etc.).
Co-Authored-By: Claude <[email protected]>
Causal ordering and timeline synchronization primitives: - frontier.h/frontier.c: Causal frontiers represent a snapshot of semaphore timeline positions. Used to propagate ordering guarantees across machines in the remoting layer. Supports merge, advance, and comparison operations. - frontier_tracker.h/frontier_tracker.c: Tracks frontier state across multiple concurrent operations, coalescing updates and detecting when frontiers have advanced past target positions. - semaphore.h/semaphore.c: Timeline semaphores with monotonically increasing uint64 values. Supports signal, query, wait, and failure propagation. Waiters are woken via notification when the timeline advances. Failed semaphores propagate their error status to all current and future waiters. Includes unit tests and benchmarks for all three components. Co-Authored-By: Claude <[email protected]>
Async operation descriptors and resource types:
- operation.h: Base operation type with completion callback, flags
(LINKED for kernel-chained SQEs), and type discriminator for all
async operations.
- operations/: Per-category operation structs with documentation of
platform availability, threading model, and optimization paths:
- scheduling.h: nop, timer, event wait, sequence, notification
wait/signal
- net.h: socket accept/connect/send/recv/sendto/recvfrom/close
with multishot and zerocopy support
- file.h: read/write/readv/writev/fsync
- futex.h: futex wait/wake (io_uring 6.7+)
- message.h: internal proactor messages
- semaphore.h: semaphore wait/signal with frontier propagation
- socket.h/socket.c: Socket abstraction over platform fd/handle
with address binding, state machine tracking, and option helpers.
- file.h/file.c: File descriptor wrapper with open/close lifecycle.
- buffer_pool.h/buffer_pool.c: Pooled buffer management for
io_uring provided buffers and recv slab allocation.
Co-Authored-By: Claude <[email protected]>
Proactor interface and backend-shared implementation utilities: - proactor.h: The central async I/O interface. Defines the vtable contract for submit/poll/cancel with thread-safety guarantees (submit is thread-safe via MPSC queue; poll has single-thread ownership). Documents the operation lifecycle, resource retention model, and platform capability negotiation. - proactor.c/proactor_platform.c: Dispatches to platform backends via static registration (alwayslink pattern). - api.h: Public API umbrella header. - README.md: Architecture documentation covering the proactor model, operation lifecycle, data flow, and backend contracts. Shared utilities (runtime/src/iree/async/util/): - completion_pool: Pre-allocated completion entries for poll results - continuation: Chained status-returning callbacks - intrusive_list: Doubly-linked list without allocation - message_pool: Pool for internal proactor messages - operation_pool: Typed operation slab allocator - proactor_thread: Helper for running a proactor on a dedicated thread with proper lifecycle management - ready_pool: Lock-free MPSC pool for ready completions - sequence_emulation: Backend-agnostic sequence operation dispatch supporting both LINK path (kernel-chained SQEs) and emulation path (step_fn callbacks between steps) - signal: SIGPIPE suppression and signal handling utilities - test_base.h: Test fixture base for proactor-backed tests Co-Authored-By: Claude <[email protected]>
Conformance test suite for validating proactor backend implementations. Tests are backend-agnostic: each test suite is compiled against every registered backend via link-time composition (alwayslink registration libraries + shared test_main/benchmark_main entry points). CTS framework (cts/util/): - registry.h/registry.cc: Backend registration and discovery with CarrierPairFactory for test isolation - test_base.h: GTest fixture with proactor lifecycle management - benchmark_base.h: Google Benchmark fixture with proactor setup - socket_test_base.h/socket_test_util.h: TCP/UDP test helpers Test suites: - core/: Lifecycle, nop dispatch, timer, error propagation, resource exhaustion, sequence (LINK + emulation paths) - event/: Event create/set/reset, event pool acquire/release, event source POSIX integration - sync/: Notification signal/wait, relay (primitive and notification sources/sinks), semaphore (sync query, async wait, linked sequences), cancellation, fence POSIX integration, signal handling - socket/: TCP lifecycle, message passing, send flags, TCP control (shutdown/reset), TCP transfer (large payloads, vectored I/O), UDP send/recv, multishot recv, zerocopy send, POSIX-specific (abstract sockets, SO_REUSEPORT), Win32-specific (WSA) - buffer/: Buffer pool allocation, registration - futex/: Futex wait/wake (io_uring 6.7+) Benchmarks: dispatch scalability, sequence overhead, relay fan-out, event pool throughput, socket throughput, futex latency. Also adds IREE_EXPECT_NOT_OK and IREE_ASSERT_NOT_OK test macros to status_matchers.h. Co-Authored-By: Claude <[email protected]>
io_uring proactor backend exploiting Linux 5.1+ kernel async I/O: - uring.h/uring.c: Low-level io_uring ring management with direct syscall wrappers (no liburing dependency). Handles SQ/CQ ring mapping, feature probing, and SQE/CQE iteration. - proactor.h/proactor.c: Full proactor implementation with MPSC pending_operations queue for thread-safe submit, SQE batching, and CQE processing. Supports SINGLE_ISSUER mode for optimal kernel performance. - proactor_submit.c: Operation-to-SQE translation for all operation types including linked SQE chains (IOSQE_IO_LINK). - proactor_probe.c: Runtime feature detection for io_uring capabilities (multishot accept, provided buffers, futex ops). - proactor_registration.c: Static backend registration. - notification.h/notification.c: io_uring-specific notification using IORING_OP_FUTEX_WAIT (6.7+) or eventfd+POLL_ADD fallback. - relay.h/relay.c: Relay implementation using multishot POLL_ADD with async POLL_REMOVE and ZOMBIE state for safe teardown. - socket.h/socket.c: Socket helpers for multishot accept, provided-buffer recv, and zerocopy send. - buffer_ring.h/buffer_ring.c: io_uring provided buffer ring management (IORING_REGISTER_PBUF_RING). - sparse_table.h/sparse_table.c: Sparse fd-to-state mapping for tracking active operations per file descriptor. Also includes platform/linux/ signal handling helpers (shared between io_uring and POSIX backends) and CTS backend registration. Co-Authored-By: Claude <[email protected]>
POSIX proactor backend using poll/epoll/kqueue for broad platform coverage (Linux, macOS, BSD, and any POSIX system): - proactor.h/proactor.c: Full proactor implementation with three event_set backends (poll, epoll, kqueue) selected at compile time. All fd_map, event_set, and timer_list mutations are poll-thread-exclusive (no locking needed). Cross-thread submit uses MPSC atomic slist. Resource retention is centralized in push_pending/complete_on_submit. - event_set.h: Abstraction over poll/epoll/kqueue event notification with add/modify/remove/wait operations. - event_set_poll.c: poll(2) backend (portable fallback). - event_set_epoll.c: epoll(7) backend (Linux, O(1) readiness). - event_set_kqueue.c: kqueue(2) backend (macOS/BSD). - fd_map.h/fd_map.c: File descriptor to operation mapping with generation tracking for ABA safety on fd reuse. - poll_set.h/poll_set.c: Dynamic pollfd array management for the poll(2) backend. - fence.h/fence.c: POSIX fence using eventfd or pipe for cross-thread signaling of proactor completion. - relay.h/relay.c: POSIX relay with synchronous lifecycle (no ZOMBIE state). Two dispatch paths: primitive sources use event_set fd monitoring; notification sources use epoch scanning. - socket.h/socket.c: POSIX socket helpers for non-blocking I/O, vectored read/write, and connect-in-progress handling. - signal.h/signal.c: SIGPIPE suppression for socket writes to peers that have closed their read end. - timer_list.h/timer_list.c: Sorted doubly-linked timer list for userspace deadline management with O(n) insert, O(1) expiry. - wake.h/wake.c: Wake pipe/eventfd for interrupting blocked poll/epoll/kqueue calls from other threads. - worker.h/worker.c: Worker thread pool for multi-backend POSIX proactor configurations (poll + epoll on same system). Includes CTS backend registration for poll and epoll variants. Co-Authored-By: Claude <[email protected]>
Windows IOCP proactor implementation with full CTS coverage (251 pass, 14 skip — same futex skip profile as POSIX). Completion-based backend using GetQueuedCompletionStatusEx with no worker threads. All I/O operations use overlapped structures posted to a single completion port, with the poll loop draining completions and dispatching to operation-specific handlers. Implements: - Socket I/O: TCP connect/accept/recv/send, UDP recvfrom/sendto, multishot recv via recv pool, Winsock import - File I/O: read/write with overlapped offsets, buffer registration - Timers: CreateThreadpoolTimer with IOCP completion posting - Events: CreateEvent-based with RegisterWaitForSingleObject - Sequences: linked operation chains with completion dependencies - Notifications: WaitOnAddress/WakeByAddress futex-style wait/signal with async poll-loop epoch comparison - Semaphores: timeline semaphore with async WAIT/SIGNAL operations - Relays: NOTIFICATION source → SIGNAL_PRIMITIVE (SetEvent) or SIGNAL_NOTIFICATION sinks, one-shot and persistent modes - Signals: console control handler (Ctrl+C/Break/Close/Shutdown/Logoff) - Cancellation: CancelIoEx for socket/file ops, timer/event cancellation - Messaging: inter-proactor message passing via completion port Split into proactor.c (poll loop, lifecycle, dispatch) and proactor_submit.c (operation submission from arbitrary threads via completion port posting). Co-Authored-By: Claude <[email protected]>
Round 1 (Thread Safety & Worker Pool): - Add missing `enqueued` CAS guard to semaphore wait tracker, preventing MPSC slist corruption when multiple timepoints in an ALL-mode wait independently try to enqueue the same tracker. - Fix completion pool exhaustion paths that silently dropped operations: process_operation_chain (both multishot and normal paths), timer expiry, and timer cancellation now dispatch directly when the pool is empty, matching the drain_completion_queue behavior. Round 2 (Event Set Backend Correctness): - Batch kqueue filter operations into single kevent() calls to prevent orphaned filters on partial failure (e.g., EVFILT_READ registered but EVFILT_WRITE fails, leaving the fd in an inconsistent state). - Remove unreliable event coalescing in kqueue next_ready() — kevent(2) does not guarantee adjacent events for the same fd ident. The proactor dispatch loop already handles duplicate fd dispatches correctly. - Use kqueue1(O_CLOEXEC) on FreeBSD/NetBSD for atomic close-on-exec. macOS/OpenBSD lack kqueue1; documented the kqueue()+fcntl() fallback. - Add per-BSD platform macros (IREE_PLATFORM_FREEBSD, IREE_PLATFORM_NETBSD, IREE_PLATFORM_OPENBSD, IREE_PLATFORM_DRAGONFLYBSD) to target_platform.h. Co-Authored-By: Claude <[email protected]>
Extract shared iree_async_posix_socket_create_accepted() helper for accepted socket initialization — both submit_socket_accept and execute_accept were duplicating struct init inline and missing the TRACE debug_label setup. Fix non-Linux paths of iree_posix_accept() and iree_posix_socket() in compat.h to check all fcntl() return values; previously a failed F_SETFL would silently return a blocking socket that would hang the poll thread. Co-Authored-By: Claude <[email protected]>
Add 6 CTS cancellation tests for EVENT_WAIT operations covering cancel before signal, cancel after completion, double cancel, event reuse after cancel, signal/cancel race, and cross-thread cancel with blocking poll. Fix io_uring EVENT_WAIT cancellation: when the kernel cancels a linked POLL_ADD+READ pair via ASYNC_CANCEL, only the POLL_ADD head gets a CQE (-ECANCELED) — the linked READ subordinate is silently discarded with no CQE. Add TAG_LINKED_POLL to dispatch the user callback from the POLL_ADD error CQE directly. Also fix cancel() to use wake() instead of ring_submit() to avoid SINGLE_ISSUER violation from background threads. Fix IOCP EVENT_WAIT cancellation: add pending_event_wait_cancellation_count with drain function (same pattern as timer cancellations) to call UnregisterWaitEx for cancelled waits. Fix carrier dispatch to check the CANCELLED flag. Add counter decrements in pending_queue drain for both timer and event wait cancellation counters to prevent stale scans. Co-Authored-By: Claude <[email protected]>
Fix unchecked fcntl() returns in wake.c pipe fallback path. The pipe fallback (used on non-Linux platforms where eventfd is unavailable) called fcntl(F_SETFL, O_NONBLOCK) and fcntl(F_SETFD, FD_CLOEXEC) without checking return values. On failure, the wake pipe would remain in blocking mode, causing the poll thread's drain loop to block indefinitely. Same pattern as the compat.h fix in round 3 — now all non-Linux fcntl paths in the POSIX backend check returns and fail loudly. Co-Authored-By: Claude <[email protected]>
import_fence is called from arbitrary threads (GPU driver completion callbacks, application threads) but both POSIX and io_uring backends violated the poll-thread-only contract for internal data structures. POSIX: import_fence directly mutated fd_map and event_set from the caller's thread — a data race since these structures have no concurrency protection. Fixed by deferring registration to the poll thread via a new pending_fence_imports MPSC queue (same pattern as pending_queue and pending_semaphore_waits). The poll thread drains the queue at all three drain sites, registering fence fds in fd_map/event_set. io_uring: import_fence called ring_submit() (io_uring_enter) from the caller's thread — a SINGLE_ISSUER violation since only the poll thread may enter the ring. Fixed by replacing ring_submit() with wake(): the SQE is already committed under the SQ lock, and the poll thread's next ring_submit() flushes it. Adds cross-thread CTS tests (ImportFence_CrossThreadImportDuringPoll, ImportFence_MultipleCrossThreadImports) that catch both races under TSAN. Co-Authored-By: Claude <[email protected]>
Fix semaphore wait tracker double-enqueue: the io_uring tracker was missing the enqueued CAS guard that POSIX has. In ALL-mode waits where multiple semaphores fail, multiple error callbacks would push the same tracker to the MPSC slist, creating a self-loop that hangs drain_pending_semaphore_waits. Fix continuation chain overflow: submit_continuation_chain used a fixed 16-element stack array and silently dropped operations beyond index 15. Now uses stack for the common case and heap-allocates for longer chains. Co-Authored-By: Claude <[email protected]>
SQ ring correctness: remove bogus SQE rollback after ring_submit in register_event_source, submit_signal_poll, and register_relay. ring_submit advances *sq_tail (kernel-visible) before io_uring_enter; rolling back sq_local_tail after that creates a desync that corrupts the ring on the next submission (uint32_t underflow in to_submit). Keep ring_submit for SQ slot reclaim but ignore its return value. ACCEPT: set accept_flags to SOCK_NONBLOCK | SOCK_CLOEXEC so accepted sockets get both flags atomically, matching the socket create path. Previously accepted sockets lacked FD_CLOEXEC entirely. wait_cqe: convert EINTR recursive retry to a loop with absolute deadline tracking. The old code restarted the full timeout duration on each EINTR, causing unbounded waits under frequent signals. Fix misleading defs.h comments: TIMEOUT_REALTIME comment incorrectly claimed iree_time_now() uses CLOCK_REALTIME (it uses CLOCK_MONOTONIC); MSG_RING comment incorrectly placed command types in msg_ring_flags (they go in sqe->addr). Clean up dead code: remove unreachable double break in MESSAGE switch case and simplify process_internal_cqe ternary (? 0 : 0 → return 0). Co-Authored-By: Claude <[email protected]>
Relay lifecycle fixes: - Fix cancel tag encoding in unregister_relay (INTERNAL_MARKER|2 produced tag=0 instead of TAG_CANCEL; now uses iree_io_uring_internal_encode) - Fix PENDING_REARM/FAULTED relay leak on unregister (no in-flight kernel op means cancel SQE finds nothing; now cleans up immediately) - Fix zombie multishot relay leak when POLL_REMOVE can't be submitted due to SQ pressure (self-healing: each CQE delivery retries POLL_REMOVE) - Add ERROR_SENSITIVE + futex source validation (poll event flags vs futex error codes have incompatible semantics) - Fix drain_source to detect hard errors (EBADF/EIO would busy-loop the multishot poll; now faults the relay on non-EAGAIN errors) - Extract relay cleanup into shared helper used by unregister and CQE handler Buffer registration: - Add group_id overflow detection (uint16_t wraps after 65535 allocations, silently reusing in-use IDs; now fails with RESOURCE_EXHAUSTED) Quality: - Replace local #define duplicates in relay.c with proactor.h macros - Convert all relay flag tests to iree_any_bit_set() - Add IREE_ASSERT bounds check in sparse_table_release - Fix buffer_ring_free comment to explain PBUF_RING safety rationale Co-Authored-By: Claude <[email protected]>
Cross-validated review (flash + deep models + manual) focused on error handling, performance, and quality. Fixes: Correctness: - Fix LA57-unsafe timepoint index encoding: semaphore wait tracker pointer+index packing used 48-bit mask while the internal encoding uses 56-bit for 5-level paging safety. Aligned to 56-bit/8-bit split. - Fix incorrect rollback count in two-SQE failure: when poll_sqe succeeded but read_sqe failed, the rollback missed the partially- allocated slot, leaving a zeroed SQE that would produce user_data=0. - Fix uint32 overflow in buffer ring address calculation: i*buffer_size was uint32*uint32, wrapping at 4GB before pointer promotion. Quality: - Add TAG_NOP_PLACEHOLDER for semaphore wait NOP SQE (was reusing TAG_CANCEL, a semantic category error and maintenance hazard). - Remove dead epoch-check code in NOTIFICATION_WAIT EAGAIN handler. - Remove dead satisfied_index variable with "future work" comment. - Add IREE_ASSERT on eventfd write in notification_signal. - Add IREE_ASSERT(false) to default case in process_internal_cqe. - Add IREE_ASSERT for kernel-provided buffer_index bounds. - Rewrite stale "linked-hardlink" comment on SEMAPHORE_WAIT NOP. Co-Authored-By: Claude <[email protected]>
…loop The CQE processing loops called ring_cq_ready() (one atomic load of cq_tail) then ring_peek_cqe() (second atomic load of cq_tail) per CQE. Since peek_cqe already returns NULL when the ring is empty, use it directly as the loop condition, halving the atomic loads on the kernel-shared cacheline in the hot CQE processing path. Co-Authored-By: Claude <[email protected]>
Software operations (SEMAPHORE_SIGNAL, SEMAPHORE_WAIT) no longer produce kernel SQEs. They execute in Phase 3 of submit (outside the SQ lock) with callbacks delivered via MPSC on the poll thread. This fixes two bugs: - Signal cascade and tracker allocation running under the SQ spinlock during the SQE fill loop, blocking all concurrent submitters. - Eager signal ordering where linked chains like [RECV -> SIGNAL] fired the signal during submit() before the RECV SQE was submitted. Key changes: - Phase 3 in proactor_submit.c executes software ops after SQ lock release - dispatch_continuation_chain iteratively walks mixed chains, pushing software completions to MPSC and submitting kernel stages to the ring - cancel_continuation_chain_to_mpsc retains ops before pushing CANCELLED - Three MPSC drain points in the poll loop preserve callback ordering - process_cqe detaches continuation before trigger callback, dispatches after - Runtime count <= 255 validation replaces debug-only IREE_ASSERT - Dead code removed: TAG_NOP_PLACEHOLDER, inline cancel_continuation_chain, SEMAPHORE_SIGNAL CQE special case, dispatch_linked/tracker_continuation CTS: 6 new tests for mixed software/kernel chains covering ordering, timing, error propagation, and pure-software chains. Co-Authored-By: Claude <[email protected]>
POSIX and IOCP backends encoded {tracker_pointer, semaphore_index} in
timepoint user_data using a 48-bit/16-bit split. On LA57 (5-level
paging) systems, heap addresses can use bits 48-55, causing silent
pointer truncation. Changed to 56-bit/8-bit split matching the
already-fixed io_uring backend. Added count <= 255 runtime validation
returning OUT_OF_RANGE before tracker allocation in both backends.
kqueue event_set_add/modify batched multiple EV_ADD entries in a single
kevent() call but kevent(2) processes entries sequentially, not
atomically. If the first EV_ADD succeeded and the second failed, the
first filter was left orphaned. iree_kqueue_submit_changes now scans
all EV_ERROR results first, then submits compensating EV_DELETE for any
successfully-applied EV_ADD entries before returning the error.
Co-Authored-By: Claude <[email protected]>
Three design docs covering IREE's vector clock frontier system: - Causal dependency tracking with frontiers, semaphores, and axes - Comparison with binary event systems (CUDA/HIP) - Multi-model scheduling scenarios (voice chat, RAG, etc.) Interactive visualizer with step-through simulation, DAG rendering, timeline view, width/depth/GPU scaling, hover-to-highlight edges, and dark mode support (syncs with mkdocs Material theme toggle via postMessage, falls back to prefers-color-scheme for standalone use). Co-Authored-By: Claude <[email protected]>
Mark variables used only in iree_make_status format strings or IREE_ASSERT with IREE_ATTRIBUTE_UNUSED. These are stripped when IREE_STATUS_FEATURES==0 (runtime_small config), making the variables appear unused to -Wunused-variable. Co-Authored-By: Claude <[email protected]>
When overlapped I/O calls (WSASend, WSARecv, AcceptEx, ConnectEx, WSASendTo, WSARecvFrom, ReadFile, WriteFile) fail synchronously (error != *_IO_PENDING), the carrier was freed but outstanding_carrier_count was not decremented. This caused an assertion failure at proactor destroy time. Add iree_async_proactor_iocp_release_carrier() as the symmetric counterpart to allocate_carrier, and use it in all 9 submit error paths. Previously each site did iree_allocator_free without the decrement. Co-Authored-By: Claude <[email protected]>
accept4() requires _GNU_SOURCE for its glibc declaration. Without it, CMake builds (which don't define _GNU_SOURCE globally) fail with undeclared identifier errors. Define it at the top of both .c consumers and in compat.h as defense in depth. Co-Authored-By: Claude <[email protected]>
…sed tests Two macOS CI failures: Relay ERROR_SENSITIVE (bd-2u0h): On macOS kqueue, pipe close delivers EVFILT_READ+EV_EOF which translates to POLLIN|POLLHUP. The relay dispatch checked !(revents & POLLIN) to suppress the sink, but POLLIN is always set alongside POLLHUP on kqueue, so the suppression never triggered. Fix: check POLLERR|POLLHUP unconditionally — the connection is dead either way. Connect-refused tests (bd-132s): Five tests connected to hardcoded port 1 expecting ECONNREFUSED, but this is not portable. macOS stealth mode, FreeBSD tcp.blackhole, and Docker/bwrap sandboxes can silently drop SYN packets to closed ports instead of sending RST, causing the connect to hang forever. Fix: add CreateRefusedAddress() helper that binds a listener to an ephemeral port, records the address, then closes the listener. The kernel knows this port was just in LISTEN state and sends RST immediately, regardless of firewall configuration. Co-Authored-By: Claude <[email protected]>
The previous formatting commit (bd956b8) ran local clang-format v21 instead of pre-commit's pinned v18.1.3. The two versions disagree on JS formatting (brace spacing, import grouping, continuation indent, ternary placement, object literal expansion), producing ~2000 lines of diff in the visualizer JS files. Also fix markdownlint violations: convert bold pseudo-headings to actual headings in scenarios.md (MD036), and suppress MD013 around HTML blocks in index.md that inherently exceed 80 chars. Co-Authored-By: Claude <[email protected]>
…ispatch On macOS, poll() can deliver POLLERR alone (without POLLOUT) for a failed async connect. The operation chain dispatcher only matched operations whose requested events (POLLOUT for connect) intersected revents, so POLLERR-only events were silently ignored — the connect handler never fired and the callback never arrived, causing CI test timeouts on macos-14. Include POLLERR and POLLHUP in the event match condition so that error events trigger all pending operations on the fd. Every I/O handler already handles non-EAGAIN errors correctly (getsockopt(SO_ERROR) for connect, EPIPE from writev for send, EOF from readv for recv), so this is safe. The kqueue backend was unaffected because iree_kevent_to_poll_events() always translates the filter type (EVFILT_WRITE → POLLOUT), ensuring error events include the direction flag. Co-Authored-By: Claude <[email protected]>
…ironments io_uring_setup can fail with EPERM (seccomp/sysctl blocks it in containers), ENOMEM (RLIMIT_MEMLOCK exhausted on ARM CI), or ENOSYS (kernel too old). Previously these mapped to PERMISSION_DENIED, RESOURCE_EXHAUSTED, and UNIMPLEMENTED respectively, but the API contract promises UNAVAILABLE for all "not usable" conditions. All test skip logic checks for UNAVAILABLE, so these tests failed instead of skipping. Map all io_uring_setup failures to IREE_STATUS_UNAVAILABLE (errno preserved in the status message for diagnostics). Add POSIX proactor fallback on Linux when io_uring is unavailable, and wire up POSIX as the default on macOS/BSD (previously returned UNAVAILABLE unconditionally on non-Linux non-Windows). Add platform/posix/api.h (parallel to io_uring/api.h and iocp/api.h) so proactor_platform.c can call the POSIX create function without pulling in internal POSIX proactor headers. Co-Authored-By: Claude <[email protected]>
CtsTestBase::SetUp() calls GTEST_SKIP() and returns when the backend is unavailable, leaving proactor_ as nullptr. SignalTest::SetUp() continued past the base class return and called iree_async_proactor_supports_signals(proactor_), which dereferences proactor_->vtable — SEGV on NULL. Add a null guard after the base class SetUp call. Co-Authored-By: Claude <[email protected]>
…architectures defs.h only listed x86_64 and aarch64 using raw compiler macros, so the RISC-V 64 CI cross-compile hit the #error. All three architecture families (x86, ARM, RISC-V) use 425/426/427 for both 32-bit and 64-bit variants. Switch to IREE_ARCH_* macros from target_platform.h and cover all six. Co-Authored-By: Claude <[email protected]>
GCC's __atomic_load builtin rejects const-qualified pointers (fixed in GCC 13 with __typeof_unqual__). Cast away const to match the existing pattern used in atomic_freelist.h and buffer.c. Co-Authored-By: Claude <[email protected]>
proactor.c: Capture errno before any subsequent calls can clobber it, and include fd, iovec count, total requested bytes, and raw errno in writev failure messages for easier debugging of macOS-specific send failures. tcp_transfer_test.cc: Replace silent failure returns with ADD_FAILURE messages that report exact offsets, status strings, and tracker state. Add IREE_ASSERT_OK checks on accept/connect status and socket query_failure before data transfer begins. Co-Authored-By: Claude <[email protected]>
The POSIX proactor's eager writev during SOCKET_SEND/SOCKET_SENDTO submit converted EAGAIN (socket send buffer full) to a terminal UNAVAILABLE error. This broke large transfers when the data exceeded the socket buffer capacity. Replace the EAGAIN→UNAVAILABLE conversion with push_pending, deferring the operation to the poll loop for POLLOUT-driven retry. This matches io_uring's internal async-poll behavior and the existing recv deferred path. Fix transfer tests that had stack-local operation structs inside loops: when a send defers via push_pending, the operation is registered in the fd_map until completion. Stack-local ops going out of scope while registered corrupt the fd_map's operation chain via dangling next pointers. Also fix stale CompletionTracker (missing Reset between iterations) in chunked transfer test. Co-Authored-By: Claude <[email protected]>
macOS XNU may perform a graceful FIN close even when unread data is present in the receive buffer, unlike Linux which sends RST. Without RST, the subsequent small eager sends (17 bytes each) all succeed because the kernel buffer absorbs them before the FIN->data->RST round-trip propagates back. Create the client socket with LINGER_ZERO so close() sends RST deterministically. The test still verifies that sends to a RST'd peer produce errors — it just forces the RST reliably across platforms. Co-Authored-By: Claude <[email protected]>
EVENT_WAIT and NOTIFICATION_WAIT use linked POLL_ADD+READ SQE pairs. IOSQE_CQE_SKIP_SUCCESS was set unconditionally on POLL_ADD without probing FEAT_CQE_SKIP. On kernels that don't support the flag (pre-5.17), the unknown bit causes POLL_ADD to fail with -EINVAL, generating two CQEs (POLL_ADD error + READ -ECANCELED). The TAG_LINKED_POLL handler processed the first CQE (release_resources + callback), and the normal handler processed the second (another release_resources + callback), causing a double-release that freed the event while still referenced. Fix: remove IOSQE_CQE_SKIP_SUCCESS and simplify TAG_LINKED_POLL to always return 0. The linked READ CQE now handles resource release and user callback for both success and failure on all kernel versions. Co-Authored-By: Claude <[email protected]>
…ering bazel query //... operates pre-analysis and returns all targets regardless of platform constraints. When piped through xargs to bazel test, these become "explicitly requested" targets — Bazel errors on incompatible explicitly-requested targets (exit code 123) instead of skipping them. This caused 23 Windows-only IOCP targets to fail on Linux CI hosts even though all 1719 tests passed. Replace query with cquery, which evaluates target_compatible_with constraints during the loading/analysis phase and filters incompatible targets from the output. Stderr is redirected to /dev/null because cquery emits analysis warnings that are not actionable in this context. Co-Authored-By: Claude <[email protected]>
GCC's glibc annotates read() and write() with __attribute__((warn_unused_result)). Unlike clang, GCC does not suppress this warning with (void) casts, causing build failures with -Werror. Rather than just suppressing the warning, use belt-and-suspenders assertions that catch lifecycle bugs while allowing benign conditions: - Signal handler (signal.c): assign to variable + (void) cast. Cannot assert in async-signal-safe context (no malloc/log/abort allowed). EAGAIN means signal already pending (coalesced by design). - All other eventfd/pipe write and read sites: assign return value and IREE_ASSERT(result >= 0 || errno == EAGAIN). EAGAIN is benign (already signaled / spurious wake), but EBADF indicates a use-after-close lifecycle bug that would otherwise cause silent hangs. IREE_ASSERT compiles out in production builds, so no runtime overhead. Co-Authored-By: Claude <[email protected]>
reinterpret_cast only works for pointer↔integer conversions, not integer↔integer. SOCKET (UINT_PTR) and uintptr_t are both integers, so static_cast is the correct cast. Co-Authored-By: Claude <[email protected]>
The cquery --output=label format includes config hashes (e.g. "(a1b2c3)") that xargs splits into spurious target patterns, and includes targets with target_compatible_with constraints for other platforms. Both cause errors when the labels are explicitly listed to bazel test via xargs. Switch to Starlark output that produces clean canonical labels and filters out targets with IncompatiblePlatformProvider. Co-Authored-By: Claude <[email protected]>
…nt count pattern The worker main loop uses an event count (iree_notification_t) for efficient sleeping. The event count guarantees no lost wakeups for conditions checked between prepare_wait and commit_wait. The worker checked for new work between prepare/commit (correct), but the exit condition (state != RUNNING) was only checked at the loop top — outside the protected region. When request_exit's post() fired before a worker called prepare_wait, the worker captured the post-increment epoch as its wait_token. commit_wait then saw epoch == token and slept forever, since no subsequent posts would arrive. On the futex path (Linux without sanitizers) the race window is nanoseconds wide. On the pthread_cond path (macOS always, Linux under TSAN) it's microseconds to milliseconds — wide enough to hit reliably across 204 tests x 4 workers = 816 shutdown cycles per test run. The fix adds a state check between prepare_wait and commit_wait, completing the event count pattern for the exit condition. Verified: 0/10000 TSAN timeouts after fix vs 40% timeout rate before. Co-Authored-By: Claude <[email protected]>
…l execution Abstract namespace sockets are process-global on Linux: when multiple test processes run simultaneously (--runs_per_test or sharded CI), the hardcoded names like @iree_cts_abstract_test collide with EADDRINUSE. Generate unique names using PID + atomic counter so each test invocation gets its own namespace, regardless of how many processes run in parallel. Co-Authored-By: Claude <[email protected]>
5e33600 to
53ac961
Compare
Co-Authored-By: Claude <[email protected]>
Android is Linux at the kernel level (both IREE_PLATFORM_ANDROID and IREE_PLATFORM_LINUX are defined), so the POSIX proactor backend works unchanged. But the Bazel target_compatible_with selects only listed @platforms//os:linux and @platforms//os:macos, excluding Android. The generated CMake guards inherited this gap, causing the Android ARM64 CI to fail: the parent platform library requested iree::async::platform::posix but the target was not defined for CMAKE_SYSTEM_NAME "Android". Add @platforms//os:android compatibility to the POSIX backend, timer_list, and linux:signal (signalfd available on Android NDK 18+). Route Android to the POSIX backend in the platform select (no io_uring on Android). Co-Authored-By: Claude <[email protected]>
Two Android cross-compile failures: 1. proactor_platform.c calls iree_async_proactor_create_io_uring under IREE_PLATFORM_LINUX, which Android also defines. Android kernels do not expose io_uring, so the symbol is unresolved at link time. Guard the io_uring path with !IREE_PLATFORM_ANDROID so Android falls through to the POSIX-only else branch. 2. registration_test.cc uses memfd_create which requires Android API level 30+. The CI targets android-29. Define IREE_TEST_HAS_MEMFD that checks __ANDROID_API__ >= 30 and use it to conditionally compile the three dmabuf tests that depend on memfd_create. Co-Authored-By: Claude <[email protected]>
|
macos failures are an existing busted mlir C API issue ( |
|
web/emscripten build failures in the "samples" nightly workflow might be due to this PR: https://github.com/iree-org/iree/actions/runs/22293301576/job/64484902727#step:9:247 |
For the humans, from the human: this is a few weeks of deep feature branch work with a team of @claude's. It's gone through dozens of review cycles with mixed model teams and had quite a bit of stress testing. The design is convergent with subsequent work on the iree/net/ layer (which is built on top of this) as well as the remote HAL (which uses both). The future AMDGPU backend (and eventually all HAL drivers) will be natively built on iree/async/ for all their internal scheduling, enabling us to do distributed async execution of heterogeneous CPU/NVMe/NIC/NPU/GPU workloads. The existing iree/task/ system will be rebased on this soonish to replace its current polling infrastructure, and iree_loop_t will be upgraded to integrate better for wait operations. For now, this is a complete foundation across our platforms of interest and enough to unblock the AMDGPU and remote HAL efforts.
This PR introduces
iree/async/, a completion-based (proactor pattern) async I/O layer that serves as the foundation for IREE's networking, storage, and distributed scheduling. It depends only oniree/base/and provides the substrate that HAL drivers, networking, task executors, and the VM runtime will build on.Why
ML inference at scale moves large tensors between GPUs, across networks, and through storage with latency budgets measured in microseconds. The data that matters — model weights, activations, KV caches — lives in GPU VRAM. Moving it between machines for distributed inference or to NVMe for checkpointing should not require the CPU to touch every byte. Modern hardware can already do this: NICs read directly from GPU VRAM (GPUDirect RDMA), NVMe controllers write to GPU memory (GPU Direct Storage), and GPUs access each other's memory across PCIe or NVLink. The software layer's job is to orchestrate these transfers, not participate in them.
The existing approach of layering vendor-specific libraries (NCCL, RCCL) with traditional reactor I/O (select/poll/epoll + read/write) cannot express the pipelines we need. A reactor tells you "this fd is ready" and then you make a separate syscall to do the I/O — every transition costs two syscalls and a copy through kernel buffers. You cannot express "wait for GPU completion, then send the result over TCP, then write to NVMe" as a single atomic submission. Layering multiple runtime systems means multiple threading models, multiple synchronization primitives, and multiple memory management systems — each adding latency at every boundary.
A completion-based proactor that handles all I/O through one submission/completion interface eliminates these boundaries. On io_uring, an entire pipeline — GPU fence wait, network send from registered GPU memory, disk write — can execute as linked SQEs in kernel space with zero userspace transitions between steps. The proactor is not in the data path; hardware and kernel handle the transfers directly.
Causal frontier scheduling
Beyond the I/O layer, this PR introduces a causal dependency tracking system based on vector clock frontiers. Timeline semaphores (the bridge between GPU queues and async I/O) carry frontier metadata: sparse vectors of
(axis, epoch)pairs where each axis identifies a causal source (a GPU queue, a collective operation, a host thread) and each epoch marks a position on that timeline.When a GPU queue signals a semaphore, the signal carries the queue's current frontier — a compact summary of everything that happened before the signal. When another queue waits on that semaphore, it inherits the frontier through a merge (component-wise maximum). Causal knowledge propagates transitively: if queue C waits on a semaphore signaled by queue B, which previously waited on a semaphore signaled by queue A, queue C's frontier reflects A's work without any direct interaction.
This enables three capabilities that binary events and standalone timeline semaphores cannot provide:
Wait elision: When a queue's local frontier already dominates an operation's dependency frontier, the device wait is skipped entirely. Sequential single-queue workloads pay zero synchronization cost — every wait is elided because the queue's own epoch already implies all prerequisites.
O(1) buffer reuse: When a buffer is freed, the deallocating queue's current frontier becomes the buffer's death frontier. Another queue can safely reuse the buffer by checking frontier dominance — one comparison instead of tracking every operation that touched the buffer. A weight tensor read by hundreds of operations has one death frontier, not hundreds of per-operation reference counts.
Remote pipeline scheduling: A remote machine receiving a frontier can locally determine whether prerequisites are satisfied across all contributing queues — including queues on other machines it has never communicated with directly — without round-trips to the originating devices. Entire multi-stage, multi-device pipelines can be submitted atomically before any work begins, and hardware FIFO ordering ensures correct execution.
Collective operations (all-reduce, all-gather) compress N device axes into a single collective channel axis, so tensor parallelism across 8 GPUs costs one frontier entry regardless of device count.
The async scheduling design docs include an interactive visualizer that renders DAGs, frontier propagation, and semaphore state across configurable scenarios — from laptop (3 concurrent models) to datacenter (multi-node MI300X cluster with RDMA) — with step-through execution showing exactly how frontiers flow through pipelines.
What's here
Core API (
proactor.h,operation.h,semaphore.h,frontier.h): The proactor manages async operation submission and completion dispatch through a vtable-dispatched interface. Operations are caller-owned, intrusive structs — no proactor allocation on submit. Semaphores provide cross-layer timeline synchronization with frontier-carrying signals. All operations carry status with rich annotations and stack traces; there are no silent failures.Operation types: Sockets (TCP/UDP/Unix, with accept, connect, recv, send, sendto, recvfrom, close), files (positioned pread/pwrite with open, read, write, close), events (cross-thread signaling), notifications (level-triggered epoch-based wakeup), timers, semaphore wait/signal, futex wait/wake, sequences (linked operation chains), and cross-proactor messages. Operations support multishot delivery (persistent accept/recv) and linked chaining (kernel-side sequences on io_uring, callback-emulated elsewhere).
Sockets (
socket.h): Immutable configuration at creation (REUSE_ADDR, REUSE_PORT, NO_DELAY, KEEPALIVE, ZERO_COPY), then bind/listen synchronously, then all I/O is async. Imported sockets from existing file descriptors. Sticky failure state — once a socket encounters an error, subsequent operations complete immediately with the recorded failure.Memory registration (
region.h,span.h,slab.h): Registered memory regions for zero-copy I/O. Buffer registration pins memory and pre-computes backend handles so I/O operations reference memory by handle rather than re-mapping on every operation. Scatter-gather spans are non-owning value types; the proactor retains regions for in-flight operations automatically. Slab registration for fixed-size slot allocation with io_uring provided buffer ring integration.Relays (
relay.h): Declarative source-to-sink event dataflow. Connect a readable fd or notification epoch advance to an eventfd write or notification signal. On io_uring, certain source/sink combinations execute entirely in kernel space via linked SQEs.Device fence bridging: Import sync_file fds from GPU drivers to advance async semaphores when GPU work completes. Export semaphore values as sync_file fds for GPU command buffers to wait on. The proactor bridges between kernel device synchronization and the async scheduling system, enabling ahead-of-time pipeline construction across GPU and I/O boundaries.
Signal handling: Process-wide signal subscription through the proactor — signalfd on Linux, self-pipe on other POSIX platforms. SIGINT, SIGTERM, SIGHUP, SIGQUIT, SIGUSR1, SIGUSR2 dispatched as callbacks from within poll().
Platform backends
io_uring (Linux 5.1+): The primary production backend. Direct syscalls, no liburing dependency. Exploits fixed files and registered buffers (avoid per-op fd lookup and page pinning), provided buffer rings for kernel-selected multishot recv buffers, linked SQEs for zero-round-trip operation sequences, zero-copy send (SEND_ZC), MSG_RING for cross-proactor messaging, futex ops (6.7+) for kernel-side semaphore waits in link chains, and sync_file fd polling for device fence import. Submit fills SQEs under a spinlock from any thread; io_uring_enter is called only from the poll thread (SINGLE_ISSUER).
POSIX (Linux epoll, macOS/BSD kqueue, fallback poll()): Broad-coverage backend with pluggable event notification. Emulates linked operations, multishot, and other io_uring features with per-step poll round-trips — functionally equivalent API, same behavioral contract, higher per-step latency. The proactive scheduling API costs nothing extra on POSIX while enabling zero-round-trip execution on io_uring. Platform-default selection: epoll on Linux, kqueue on macOS/BSD, poll() elsewhere.
IOCP (Windows): I/O Completion Ports backend. Closer in behavior to io_uring than the POSIX backend — completion-based rather than readiness-based. Socket operations, timer queue, and the full operation type set.
All backends report capabilities at runtime (
query_capabilities()). Callers discover what's available — multishot, fixed files, registered buffers, linked operations, zero-copy send, dmabuf, device fences, absolute timeouts, futex operations, cross-proactor messaging — and adapt their code paths. "Emulated" in the capability matrix means the API works but uses a software fallback rather than a kernel-optimized path.Testing
A conformance test suite (CTS) validates all backends against shared test suites. Tests are written once and run against every registered backend configuration — 5 io_uring configurations with different capability masks, plus per-platform and per-feature POSIX configurations, plus IOCP. Tag-based filtering ensures tests only run against backends that support the features they exercise.
Test suites cover core operations, socket I/O (TCP, UDP, Unix, multishot, zero-copy), file I/O, events, notifications, semaphores (async/sync/linked), relays, fences, cancellation, error propagation, and resource exhaustion. Benchmarks measure dispatch scalability, sequence overhead, relay fan-out, socket throughput, and event pool performance.
Thread safety model
The proactor's event loop is caller-driven:
poll()has single-thread ownership, callbacks fire on the poll thread.submit(),cancel(),wake(), andsend_message()are thread-safe from any thread. Semaphore signal/query and event set are thread-safe. Notification signal is both thread-safe and async-signal-safe. A utility wrapper (proactor_thread.h) provides optional dedicated-thread operation for applications that want it.Design docs
runtime/src/iree/async/README.md— full API documentation with architecture diagrams, ownership rules, code examples, and the capability matrixdocs/.../async-scheduling/— causal frontier design document with interactive visualizer, multi-device scheduling scenarios (laptop through datacenter), and comparison with binary events and standalone timeline semaphoresci-extra: all