Rust Coroutines and the Abstraction Tax Your Profiler Won’t Show You

The async/await syntax landed in stable Rust in 2019 and immediately became the default answer to concurrent I/O. It was the right call for ecosystem growth. It was a deliberate compromise for systems-level engineers who needed something the standard model quietly refuses to give: full, auditable control over where execution pauses, what state gets saved, and who pays for the context transition. That cost exists. It’s just hidden behind generated code you didn’t write and can’t easily inspect.

This isn’t a tutorial. It’s a dissection.

Waker overhead analysis

Every Future in Rust’s async model gets polled. When it’s not ready, it registers a Waker — a handle the executor uses to reschedule the task. Sounds clean. The problem is how that waker is constructed and passed. Under the hood, RawWaker is a fat pointer: data pointer plus a vtable. That vtable carries four function pointers — clone, wake, wake_by_ref, drop. Four pointer dereferences minimum per wakeup cycle. For a futures chain three levels deep, you’re chasing pointers through memory that may not be in cache.

This is vtable bloat in practice, not in theory.

The deeper issue is implicit state propagation. When an async function awaits, the compiler snapshots the local variables into an anonymous struct — a generated state machine you never see. That struct gets heap-allocated (in most executors) and accessed through a Box<dyn Future>. Opaque type erasure is the mechanism; indirection is the cost. You lose static dispatch, you lose inlining opportunities, and the optimizer has less to work with. The zero-cost abstraction guarantee only holds if the compiler can see through the abstraction — and here it often can’t.

// What the runtime actually juggles per wakeup
struct RawWaker {
    data: *const (),
    vtable: &'static RawWakerVTable,
}

struct RawWakerVTable {
    clone:        unsafe fn(*const ()) -> RawWaker,
    wake:         unsafe fn(*const ()),
    wake_by_ref:  unsafe fn(*const ()),
    drop:         unsafe fn(*const ()),
}
// Four fn pointers. Four potential cache misses.
// Per. Task. Per. Poll.

Static dispatch optimization

The fix isn’t to avoid wakers — it’s to know when the dynamic dispatch is actually necessary. In a single-threaded embedded executor with a fixed task count, every future type is known at compile time. You can build a waker that’s a no-op or a direct index into a task array. No vtable. No heap. No indirection. The standard machinery assumes a general-purpose runtime; when your runtime isn’t general-purpose, you’re carrying weight that buys you nothing.

Manual Future implementation Rust

The async keyword is a code generator. It takes what looks like linear logic and emits a state machine enum — one variant per suspension point. That’s all it does. The compiler isn’t magic; it’s a pattern you can replicate by hand, with full visibility into every byte of generated state. Moving away from async means you write that enum yourself. You decide what gets saved. You decide the memory layout.

Related materials
Rust Debugging

When Rust Lies: Debugging Memory, CPU and Async Failures in Prod Memory safety guarantees get you to production. They don't keep you there. Rust's ownership model eliminates entire categories of bugs — use-after-free, data races,...

[read more →]

Here’s where the standard abstraction breaks down: the generated state machine includes everything in scope at each await point — whether you need it after resumption or not. The compiler is conservative. It saves state you won’t touch again because proving liveness across suspension points is hard. You write it manually, you save exactly what the next state needs. That’s the difference between a 48-byte state struct and a 200-byte one on a microcontroller with 256KB of RAM.

use core::task::{Context, Poll};
use core::pin::Pin;
use core::future::Future;

enum ReadFuture {
    Init,
    Waiting { buf: [u8; 64], pos: usize },
    Done,
}

impl Future for ReadFuture {
    type Output = usize;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll {
        match *self {
            ReadFuture::Init => {
                *self = ReadFuture::Waiting { buf: [0u8; 64], pos: 0 };
                cx.waker().wake_by_ref();
                Poll::Pending
            }
            ReadFuture::Waiting { ref buf, pos } => {
                // check hardware register, not a runtime queue
                Poll::Ready(pos)
            }
            ReadFuture::Done => panic!("polled after completion"),
        }
    }
}

Zero-allocation concurrency

This code doesn’t allocate. The state lives where you put it — stack, static, memory-mapped region. No executor owns it through a Box. The poll contract is the same as the standard library expects, so it slots into any executor that speaks Future. What changed is that you control the state transitions explicitly, and the compiler has no room to insert phantom saves. That’s zero-allocation concurrency without a runtime policy enforcing it — just architecture.

Stackless state machine logic

A stackful coroutine saves the entire call stack on suspension — registers, return addresses, local frames. That’s 4KB to 8KB minimum per coroutine on most platforms. Stackless means you save only what the state machine enum carries. The instruction pointer equivalent is the enum variant itself: Init means “start here,” Waiting means “resume from this branch.” There’s no hidden stack frame. There’s no dedicated stack segment. The memory layout is exactly the size of the largest variant, aligned to its strictest field.

On bare metal, this distinction is the difference between supporting 500 concurrent tasks and supporting 5.

// Memory layout: compiler picks largest variant + alignment padding
enum SensorPoll {
    Idle,                          // 0 bytes of payload
    Reading { register: u32,       // 4 bytes
              retries: u8 },       // 1 byte + 3 padding
    Faulted { code: u32 },         // 4 bytes
}
// sizeof(SensorPoll) == 8 bytes total, not 4KB
// No stack allocation. No guard pages. No context switch overhead.

Memory layout alignment

Saving register states without a stack isn’t a limitation — it’s a constraint that forces honest design. If your coroutine needs 12 local variables across a suspension point, your state enum tells you that explicitly. There’s no hiding it behind an opaque generated struct. The enum is the contract: every field in every variant is something the system agreed to keep alive across a yield. When that enum gets large, it’s a design signal, not a compiler artifact. You fix the architecture, not the annotation.

No-std async executor

Tokio is a production-grade runtime built for servers. It has a thread pool, a work-stealing scheduler, a timer wheel, and a reactor that wraps epoll or kqueue. That’s roughly 50,000 lines of infrastructure you drag into your binary the moment you add it as a dependency. On a bare-metal STM32 with no OS, no allocator, and 128KB of flash, that’s not a tradeoff — it’s a non-starter.

The minimal executor isn’t a stripped-down Tokio. It’s a different animal entirely.

Related materials
When Rust Makes Sense

Engineering Perspective: When Rust Makes Sense Rust is not a novelty; it’s a tool for precise control over memory, concurrency, and latency in real systems. When to use Rust is determined by measurable constraints: high-load...

[read more →]

What you actually need is a poll loop, a fixed task list, and a waker implementation that doesn’t require heap allocation. The entire executor can fit in under 60 lines of Rust. No trait objects for the task list — you know your task types at compile time, so you use an array of concrete state machines. No dynamic waker registration — when a task signals readiness, it writes to a bitmask. The poll loop reads the bitmask, iterates over ready tasks, calls poll directly. That’s it. That’s the runtime.

#![no_std]
#![no_main]

use core::task::{RawWaker, RawWakerVTable, Waker, Context, Poll};
use core::future::Future;
use core::pin::Pin;

static READY_MASK: core::sync::atomic::AtomicU32 =
    core::sync::atomic::AtomicU32::new(0);

unsafe fn noop_clone(p: *const ()) -> RawWaker {
    RawWaker::new(p, &NOOP_VTABLE)
}
unsafe fn noop_wake(p: *const ()) {
    let idx = p as usize;
    READY_MASK.fetch_or(1 << idx,
        core::sync::atomic::Ordering::Release);
}
unsafe fn noop_drop(_: *const ()) {}

static NOOP_VTABLE: RawWakerVTable =
    RawWakerVTable::new(noop_clone, noop_wake, noop_wake, noop_drop);

Instruction pointer

The waker here carries the task index as a raw pointer — an abuse of the API that the contract technically permits. On wake, it sets a bit in a static atomic bitmask. The poll loop checks that mask each iteration. No heap. No Arc. No cross-thread synchronization beyond a single atomic store. The “instruction pointer” for each task is the enum variant it left in — that’s all the resume address you need when there’s no real stack to restore.

This works because the contract between executor and future is narrow: call poll, get Ready or Pending, reschedule if Pending. Everything else — priority queues, fairness, timer integration — is policy layered on top of that contract. Strip the policy and the mechanism is trivially small. Most embedded executors that claim to be “lightweight” are still carrying that policy weight. A real no-std executor discards it entirely and rebuilds only what the hardware demands.

The cost you pay is expressiveness. You can’t dynamically spawn tasks. Your task count is fixed at compile time. Priorities are whatever order you poll the bitmask bits. For a sensor fusion loop running on Cortex-M4, those aren’t bugs — they’re features. Determinism and auditability matter more than flexibility when the system has no recovery path.

Custom suspension points

The await keyword hardcodes where a function yields. The compiler inserts the suspension point; you get no say in the conditions under which it fires. That sounds like a minor complaint until you’re writing a DMA transfer routine where you need to yield only after confirming the descriptor ring advanced — not after the I/O request was submitted, not after an interrupt fired, but after a specific hardware register bit transitions from 0 to 1 under a memory barrier.

Standard async gives you no hook for that. You wrap it in a custom Future anyway, which means you’re already writing manual poll logic — but now you’re also carrying the async machinery around it for no reason.

enum DmaTransfer {
    Armed { descriptor: u32 },
    Polling { descriptor: u32, deadline: u32 },
    Complete { bytes: usize },
    Faulted,
}

impl Future for DmaTransfer {
    type Output = Result<usize, ()>;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll {
        match *self {
            DmaTransfer::Armed { descriptor } => {
                start_dma(descriptor);
                *self = DmaTransfer::Polling {
                    descriptor,
                    deadline: current_tick() + 1000,
                };
                cx.waker().wake_by_ref();
                Poll::Pending
            }
            DmaTransfer::Polling { descriptor, deadline } => {
                if dma_complete(descriptor) {
                    let n = dma_byte_count(descriptor);
                    *self = DmaTransfer::Complete { bytes: n };
                    Poll::Ready(Ok(n))
                } else if current_tick() > deadline {
                    *self = DmaTransfer::Faulted;
                    Poll::Ready(Err(()))
                } else {
                    cx.waker().wake_by_ref();
                    Poll::Pending
                }
            }
            DmaTransfer::Complete { bytes } => Poll::Ready(Ok(bytes)),
            DmaTransfer::Faulted => Poll::Ready(Err(())),
        }
    }
}

Explicit control over suspension

This is explicit control over suspension. The state machine encodes the hardware protocol directly: arm the DMA, poll the completion bit, enforce a deadline, fault cleanly on timeout. Each variant is a documented checkpoint in the transfer lifecycle. There’s no implicit state propagation — no hidden variable saved across an await point that you forgot was there. When the system auditor asks “what is this task doing at cycle 47,000,” you point to the enum variant. That’s the answer.

Related materials
Rust Profiling

Rust Performance Profiling: Why Your Fast Code Is Lying to You Rust gives you control over memory, zero-cost abstractions, and a compiler that feels like it's on your side. So why does your service still...

[read more →]

By moving away from generated code, we gain state determinism. In safety-critical systems, being able to map every possible byte of memory to a known hardware state isn’t just a “nice to have” — it’s a certification requirement. Manual state machines transform “magic” async logic into a traceable execution graph.

Custom suspension points

Custom suspension points change how you think about driver architecture. Instead of writing a function that blocks until an operation completes, you write a state machine that advances when the hardware is ready. The driver becomes a description of valid hardware states and the transitions between them — which is what it always should have been. The OS kernel scheduler, the interrupt controller, and the task executor all speak the same language: poll, yield, resume. No magic. No hidden runtime. Just state and transitions.

This approach effectively eliminates the async-sync impedance mismatch. When the driver is a state machine, it doesn’t care if it’s being polled by a high-level executor or a simple interrupt handler. You’ve decoupled the logic from the execution strategy, achieving true runtime-agnostic code.

The honesty of manual implementation

The compiler lies to you about how cheap async is at the systems boundary. The vtable is there. The implicit saves are there. The opaque type erasure is there. None of it disappears because the syntax looks clean. Going manual doesn’t mean going primitive — it means going honest. You trade ergonomics for auditability, and on systems where auditability is the only acceptable currency, that trade is obvious.

The engineers who will push Rust into real-time kernels and safety-critical firmware aren’t waiting for a better async runtime. They’re already writing enums. They are building the Effector-level control that the standard library is still trying to figure out how to stabilize.

Written by: