Jekyll2026-02-10T19:33:57+01:00https://modulovalue.com/feed.xmlmodulovalueA blog by Modestas Valauskas.243,000 words dictated in 39 days, speech-to-text changed how I work2026-02-05T00:00:00+01:002026-02-05T00:00:00+01:00https://modulovalue.com/blog/voxtral-transcribe-and-wispr-flowI have been dictating everything for the past 39 days. Code prompts, messages, emails, notes. In that time I have spoken 243,554 words, which is roughly the length of two books. I would never have typed that many words in the same timeframe.

This post is about two things: the dictation tool that made this possible, and a new speech-to-text API that I built a test app for.

Wispr Flow

Wispr Flow* is a macOS (and Windows, and iOS) dictation app that runs in the background and works in any application. You hold a key, speak, and it types out what you said. It is not a simple transcription tool. It auto-edits filler words, adds punctuation, and matches the tone and formatting of the app you are using. It has a custom dictionary so it learns your terminology, which is important if you work with domain-specific terms.

I have been using it every single day since I got it over a month ago. I cannot recommend it strongly enough. It was genuinely life-changing.

My Wispr Flow statistics after 39 days of use.

The numbers: 39-day daily streak, 129 words per minute (top 2% of all Flow users), 243,554 total words dictated across 58 different apps.

What surprised me

I became more comfortable talking to people. As a developer, I am not used to talking to people all day. Most of my communication was typed. After five weeks of constant dictation, I noticed I was significantly more comfortable in conversations, meetings, and even casual interactions. Speaking became the default mode of expression rather than something I had to switch into.

I say what I think without filtering. When you type, there is a natural bottleneck. You think of something, then you figure out how to type it, then you type it. Dictation removes the middle step. I can say whatever comes to mind, and it flows out. This matters more than it sounds, because the bottleneck is not typing speed, it is the cognitive overhead of translating thoughts into keystrokes.

Staying in flow is effortless. You do not need your hands. You can be moving around. You can be looking at a different monitor. You can have multiple things open and narrate what you are doing without ever breaking your attention to type. I recently bought two additional monitors specifically so I can keep different contexts visible simultaneously instead of switching between desktops.

You do not even need to be at your desk. I reprogrammed a presentation clicker (a laser pointer with a Karabiner-Elements configuration) so that one button triggers Fn+Space (which activates Wispr Flow) and another button sends Enter. I can walk around my room, press a button, speak, press the button again, and the text appears. If I want to submit, I press the other button. It is the laziest, most effective input method I have ever used. I highly recommend trying something like this.

A reprogrammed presentation clicker for hands-free dictation.

Here is the Karabiner-Elements configuration. It remaps the clicker's tab key to Fn+Space (which triggers Wispr Flow) and down_arrow to Enter, scoped to the clicker's specific device ID so it does not affect any other keyboard:

{
  "description": "Laser Pointer Remaps",
  "manipulators": [
    {
      "type": "basic",
      "from": {
        "key_code": "tab"
      },
      "to": [
        {
          "key_code": "spacebar",
          "modifiers": ["fn"]
        }
      ],
      "conditions": [
        {
          "type": "device_if",
          "identifiers": [
            {
              "vendor_id": 4643,
              "product_id": 15975
            }
          ]
        }
      ]
    },
    {
      "type": "basic",
      "from": {
        "key_code": "down_arrow"
      },
      "to": [
        {
          "key_code": "return_or_enter"
        }
      ],
      "conditions": [
        {
          "type": "device_if",
          "identifiers": [
            {
              "vendor_id": 4643,
              "product_id": 15975
            }
          ]
        }
      ]
    }
  ]
}

Why this matters for AI

I pair Wispr Flow with ChatGPT Pro and Claude's 20x plan.

When you dictate prompts instead of typing them, you provide far more context. You can discover what you need to say while you are saying it. You do not need to correct yourself or be precise on the first try, because if you provide more context as you speak, the AI understands how you arrived at your conclusion. It can synthesize a better result than if you had carefully typed out a polished, minimal prompt.

Traditionally, when we speak, we filter ourselves. We are afraid of being redundant or imprecise or embarrassing ourselves. But when you are talking to a machine, there is no social pressure. You can speak like a child, without thinking about presentation, and provide far more context than you would ever type. The machine can handle the noise and extract the signal. The result is better output, consistently, than what I get from typed prompts.

Voxtral Transcribe 2

Mistral recently released Voxtral Transcribe 2, their speech-to-text API. It offers transcription in 13 languages with speaker diarization, context biasing, and word-level timestamps. The pricing is $0.003 per minute for the batch API and $0.006 per minute for the real-time API. To put that in perspective, transcribing one hour of audio costs $0.18 with the batch API or $0.36 with the real-time API. My 243,554 words at 129 words per minute amount to roughly 1,888 minutes of speech. Transcribing all of that would have cost about $5.66 with the batch API or $11.33 with the real-time API. That is cheaper than a month of Wispr Flow Pro at $15/month, and I used it heavily every single day. You can also self-host Voxtral since the real-time model's weights are open under Apache 2.0. The raw economics make it very practical for any application that needs speech-to-text.

That said, Wispr Flow is worth every cent. It works on macOS and iOS, it auto-edits your speech, it learns your vocabulary, and it just works everywhere. The value is in the polish and integration, not just the transcription. But it is interesting to see that the underlying transcription itself has become so cheap that building on top of these APIs is very accessible.

Wispr Flow currently seems to use Whisper under the hood (the "Wispr" in the name is a play on "Whisper"). I am curious whether they will adopt an API like Voxtral, or whether the competitive landscape will push these models to converge. Voxtral does not yet have the equivalent of Wispr's custom dictionary, but Mistral does offer context biasing, which lets you provide a list of terms to improve recognition accuracy for domain-specific vocabulary.

Trying out the API

I wanted to try the Voxtral API myself. The demos on Mistral's site did not work for me because they do not support a hold-to-record interaction, which is the main use case I care about. So I threw together a simple browser-based test page to play with it.

The Voxtral Transcribe test app.

It is a single HTML page. You enter your Mistral API key, hold a button, speak, and get the transcription back with timestamps, speaker labels, and raw JSON. There is no backend. Your API key stays in your browser's localStorage and audio is sent directly to Mistral's API.

Try it here or view the source on GitHub.

You can also download the page to your desktop and run it locally. Nothing is shared, nothing is collected.

Features:

  • Hold-to-record and toggle modes
  • Speaker diarization (identifies and labels who is speaking at any given moment, so a recording of a conversation comes back with "Speaker 0" and "Speaker 1" labels rather than a single block of text)
  • Segment and word-level timestamps
  • Language selection (13 languages)
  • Context biasing for custom terminology
  • Copy text and raw JSON output

What I observed by building this is that Voxtral is noticeably faster than what Wispr Flow currently uses. With Wispr Flow, roughly every 30-40 messages I have to wait several seconds for the transcription to come through, and roughly every 60 to 70 messages fail completely. It is annoying, but the overall value Wispr Flow provides is good enough that I keep using it despite these issues.

The Voxtral API felt faster and I am very excited to see how reliable it becomes. If someone builds a dictation tool on top of this API, please let me know. I will be your first paying customer.

I hope more applications adopt APIs like this, and I hope it pushes the entire speech-to-text space forward. I cannot imagine going back to typing everything.


Addendum (February 10, 2026):

After this post went live, Nikita Efimov reached out to share DictationSolutions, a collection of dictation tools including WhisperInk. It is a Windows application, so I have not been able to test it, but it is nice to see people building in this space.

I also open-sourced VoxtralDictate, the macOS menu bar app I built on top of Mistral's Voxtral API. Press a keyboard combination to start recording, press it again to stop, and the transcript is pasted at your cursor. In practice, the API turned out to be noticeably slower than my initial tests suggested. I am not sure if I am doing something wrong, but compared to Wispr Flow the experience is much worse. And beyond raw speed, getting the user experience right for a dictation tool is a lot of work that I do not want to put in, since this is not something I want to build and maintain as an application. What I would love to see is an open-source dictation tool that gets the UI and UX right and makes the underlying speech-to-text API configurable, so you can swap between different providers. I am not aware of such a tool.

In other news, someone posted on Hacker News a Rust implementation of Voxtral Mini 4B that runs in the browser via WebAssembly and WebGPU. The quantized model is about 2.5 GB. I am curious whether this is how it becomes feasible to run speech-to-text locally, or on a dedicated inexpensive server. I would love to see a dictation tool that performs consistently and runs entirely locally.


* The Wispr Flow link above is a referral link. You get a free month if you use it. I do not care, I am already paying for it. But if you want to use it, feel free.

]]>
Modestas Valauskas
Benchmarking my parser generator against LLVM: I have a new target2026-01-18T00:00:00+01:002026-01-18T00:00:00+01:00https://modulovalue.com/blog/benchmarking-against-llvm-parserThis is a follow-up to my previous post, I built a 2x faster lexer, then discovered I/O was the real bottleneck. In that post, I benchmarked my ARM64 assembly lexer against the official Dart scanner. This time, I wanted to answer a different question: how fast should a parser generator be able to go?

I found my answer by benchmarking against LLVM.

The setup

I have been working on a parser generator that produces LALR(k) parsers. To test it, I wrote a grammar for LLVM's textual IR format, a language complex enough to require LALR(2). My test corpus consists of 12,161 LLVM IR files totaling 112 MB, extracted from the LLVM test suite.

I benchmarked three lexers and three parsers:

Lexers:

  • A Dart-based lexer (generated from a specification by my lexer generator)
  • An ARM64 assembly lexer (generated from a specification by my lexer generator)
  • LLVM's official lexer (called via FFI)

Parsers:

  • A table-driven recursive ascent parser in Dart (interpreted, not optimized)
  • A table-driven recursive ascent parser (code-generated from a specification by my parser generator)
  • LLVM's official parser (called via FFI)

A note on terminology: most programmers are familiar with recursive descent, a top-down parsing technique where each grammar rule becomes a function that calls other rule functions. Recursive ascent is the bottom-up counterpart: the call stack mirrors the LR parse stack, and functions return by "ascending" to parent rules after reducing a production. Both of my parsers use this technique to implement LALR parsing.

Both of my parsers are fully deterministic. I was able to implement a deterministic LLVM IR parser, which means no backtracking and no non-deterministic steps. This also means that generalized parsing algorithms like GLR or GLL would not help here, as they are designed to handle ambiguous or (more broadly) non-deterministic grammars. Similarly, PEG parsers use ordered choice to implicitly disambiguate grammars, but since my grammar is LALR(2) and is therefore deterministic, this offers no advantage.

The results

First, let me note that I/O remains a significant cost. Loading 12,161 files into memory took 2.5 seconds, nearly four times longer than the fastest parser. The lessons from my previous post still apply.

Lexer comparison

Lexer Time Throughput
LLVM (FFI) 248ms 451 MB/s
ARM64 ASM 293ms 382 MB/s
Dart 653ms 172 MB/s

My assembly lexer is only 1.18x slower than LLVM's. This is encouraging.

The Dart lexer is 2.6x slower, but this is not a reflection of Dart as a language. The Dart lexer is table-driven rather than direct-coded, and it exists as a quick proof of concept so I do not have to run a code generation step during development. There are many optimization techniques I could apply, like compiling state transitions into functions, but that defeats the purpose of having a fast iteration loop.

Parser comparison

Here is where things get interesting.

Parser Time Throughput Notes
LLVM (FFI) 647ms 173 MB/s Combined lex + parse
Generated recursive ascent 3,745ms 30 MB/s Parse only, tokens pre-lexed
Non-generated recursive ascent 6,560ms 17 MB/s Parse only, tokens pre-lexed

Important caveat: my parser only collects parse events without building any data structure. It also parses whitespace, which LLVM skips. This is a deliberate design choice: one of my goals is lossless parsing, where the original source can be reconstructed from the parse tree. This enables tools like pretty printers, incremental parsing, and educational tooling that can answer questions like "which grammar rules produced this range of code?" But it does add overhead compared to LLVM's approach.

LLVM's parser does far more in other ways: it builds a complete intermediate representation (IR) with type checking, symbol resolution, and validation. Despite doing less semantic work, my parser is significantly slower. This makes LLVM's speed even more impressive, and it means the real gap is even larger than these numbers suggest. To truly compete, I would need to not only match LLVM's parsing speed but also add IR construction on top.

LLVM lexes and parses 112 MB in 647 milliseconds. My best parser, using pre-lexed tokens, takes 3.7 seconds just to parse. The full pipeline comparison is stark:

Pipeline Time vs LLVM
LLVM lex + parse 647ms 1x
Dart lex + generated parser 4,398ms 6.8x slower
Dart lex + non-generated parser 7,213ms 11x slower

To beat LLVM, I would need to parse in under 400 milliseconds (since my assembly lexer takes about 250ms). My current parser takes 3.7 seconds. That is a 9x gap.

Why LLVM is so fast

LLVM's parser is a hand-written recursive descent parser. It does not generate an intermediate parse tree or abstract syntax tree. Instead, it constructs LLVM's intermediate representation (IR) directly during parsing.

This is not a tree. It is a graph. Instructions reference other instructions. Basic blocks reference other basic blocks. Functions reference global values. The parser builds this graph incrementally as it descends through the grammar.

This design has significant implications for parser generators.

The problem with fusing parsing and IR construction

The key insight is not that bottom-up parsers are inherently slow. The problem is that LLVM fuses parsing with IR construction, and this fusion is difficult to achieve with bottom-up parsing.

In a top-down parser (LL family, including recursive descent), you process nodes in pre-order: parents before children. When you enter a function, you can create the function object immediately. When you encounter instructions, you attach them to the already-existing function. The parent context is always available.

In a bottom-up parser (LR family, including recursive ascent), you process nodes in post-order: children before parents. When you reduce a production, all children have been parsed, but the parent does not exist yet. You cannot attach children to a parent that has not been created.

This means a bottom-up parser cannot easily inline the construction of a graph structure during parsing. You have two choices:

  1. Build an intermediate tree first, then transform it into the final representation in a separate pass. This adds object creation overhead and an extra traversal.

  2. Collect parse events and defer construction. This is what my parser does. It avoids intermediate tree allocation, but to build the final IR, you would need a separate pass over those events. Whether it is even practical to build a graph structure like LLVM's IR from bottom-up parse events is unclear.

LLVM avoids both of these costs by building the IR directly as it parses. The IR is the intermediate representation, but there is no intermediate step between parsing and the IR, no AST, no extra pass. The recursive descent structure naturally provides the parent context needed to build the graph incrementally.

Could a bottom-up parser achieve the same fusion? Perhaps, but it would not be straightforward. The post-order nature of bottom-up parsing means you would need creative workarounds to maintain parent context during reductions. I have not found an elegant solution.

Is LLVM's lexer regular?

Here is an interesting observation: LLVM's lexer appears to be mostly regular. There are no nested comments, no string interpolation. In principle, it should be a pure deterministic finite automaton with no additional state.

This is unusual. Consider Dart, which I lexed in my previous post. To correctly lex Dart, you need a stack:

  • C-style block comments can be nested (/* outer /* inner */ still outer */)
  • String interpolation creates nested lexical contexts ("Hello ${name.toUpperCase()}")

LLVM IR has neither. The lexical structure should be regular.

The cost of hand-written lexers

However, looking at the actual implementation, there are some context-sensitive parts. The lexer has a flag that controls how colons in identifiers are handled:

// If we stopped due to a colon, unless we were directed to ignore it,
// this really is a label.
if (!IgnoreColonInIdentifiers && *CurPtr == ':') {
  StrVal.assign(StartChar-1, CurPtr++);
  return lltok::LabelStr;
}

llvm/lib/AsmParser/LLLexer.cpp, lines 509-514

The parser sets this flag when parsing summary entries:

// For summary entries, colons should be treated as distinct tokens,
// not an indication of the end of a label token.
Lex.setIgnoreColonInIdentifiers(true);
// ... parsing code ...
Lex.setIgnoreColonInIdentifiers(false);

llvm/lib/AsmParser/LLParser.cpp, lines 1097-1106

And again when parsing memory attributes:

// We use syntax like memory(argmem: read), so the colon should not be
// interpreted as a label terminator.
Lex.setIgnoreColonInIdentifiers(true);

llvm/lib/AsmParser/LLParser.cpp, lines 2581-2588

This is unfortunate. It introduces coupling between the parser and lexer that would not exist if the lexical grammar were truly regular.

This is the kind of issue that would not arise if the lexer were generated from a specification. A formal specification forces you to make the lexical grammar explicit, and a generator will not even support context-sensitive rules. Hand-written lexers make it too easy to add "just one flag" to handle an edge case, and over time these accumulate. The LLVM team has full control over the IR format and regularly introduces breaking changes, so this could be cleaned up. But the temptation to add a quick fix is always there.

SIMD and the case for generation

Despite these impurities, the lexer is close enough to regular that SIMD techniques should still apply. Parsing regular languages with SIMD is a well-understood problem. Projects like Intel Hyperscan and its fork Vectorscan (which adds ARM NEON support) demonstrate that SIMD-accelerated regex matching can achieve remarkable throughput. Academic work on SIMD-accelerated regular expression matching shows 2-5x speedups over scalar code. Other projects like Rejit and RoaringRegex explore similar territory.

This is actually an argument for lexer generation. A hand-written lexer is unlikely to use SIMD intrinsics for every token type, and it is prone to accumulating context-sensitive hacks over time. A lexer generator could automatically emit SIMD-optimized code while enforcing that the lexical grammar remains regular. The fact that my current generated lexer is slower than LLVM's hand-written one does not mean generation is a dead end. It means there is room for improvement, and SIMD is one concrete path forward.

Real-world programming languages are messier. They have nested comments, string interpolation, heredocs, and other context-sensitive lexical structures that genuinely require a stack (looking at you, JavaScript, and your context-sensitive regex syntax). But for languages whose lexical structure could be regular, generation offers both performance opportunities and protection against accidental complexity.

What this means for my project

I now have a clear target: LLVM processes 112 MB of complex IR in 647ms, achieving 173 MB/s for combined lexing and parsing.

If I can match LLVM's performance, I will have done something significant. LLVM is one of the largest and most optimized compiler projects in the world, with contributions from Apple, Google, ARM, Intel, and dozens of other companies. It is not a low bar.

But I think it is achievable. The lexing side is already close (293ms vs 248ms). The parsing side needs work.

To close the gap, I see several paths:

  1. Optimize the current parser. My table-driven parser has not been optimized. Direct-coded parsing instead of table interpretation should offer a significant speedup.

  2. SIMD tricks. Projects like simdjson achieve remarkable speeds by processing multiple bytes in parallel. Some of these techniques could be applied to parser generators.

  3. Generate recursive descent parsers. My parser generator currently produces LALR(k) parsers. I could also support generating recursive descent (LL) parsers, which would allow fusing parsing with IR construction. Since LLVM IR requires LALR(2), I suspect I would need at least LL(2) to handle the same language, though I am not entirely certain this would be sufficient.

  4. Profile-guided optimization. If I generate parsers that compile to LLVM IR, I could use LLVM's PGO to optimize them. Using LLVM to beat LLVM has a certain poetic appeal.

The more interesting question is whether I can match LLVM's performance while still generating parsers from grammars, rather than writing them by hand. That would be genuinely useful. Once you have a context-free grammar that corresponds precisely to the implementation, you can generate pretty printers from a simple specification, build IR builders for any programming language that are proven to match the implementation, build IDE tooling, implement incremental parsing, and more. Changes to the grammar could be proven to be additive rather than breaking. Breaking changes could be introduced in a backwards-compatible manner with proper documentation.

Of course, migrating from a hand-written parser to a generated one is a difficult sell when it is several times slower. But I believe this is a solvable problem, and I am actively working on it. If you have ideas, feedback, or just want to follow along, feel free to reach out.

Conclusion

I set out to find a target for parser performance. I found one: LLVM processes 173 MB/s. My generated parser currently achieves 30 MB/s, a 5.8x gap. And since my parser does less work than LLVM's (no IR construction), the effective gap is even larger.

The gap exists not because bottom-up parsing is inherently slow, but because LLVM's architecture fuses parsing with IR construction in a way that requires pre-order traversal. To compete, I may need to generate top-down parsers, or find SIMD tricks that make the parsing paradigm less relevant.

I am not there yet. But I have a target now.

]]>
Modestas Valauskas
I built a 2x faster lexer, then discovered I/O was the real bottleneck2026-01-13T00:00:00+01:002026-01-13T00:00:00+01:00https://modulovalue.com/blog/syscall-overhead-tar-gz-io-performanceI built an ARM64 assembly lexer (well, I generated one from my own parser generator, but this post is not about that) that processes Dart code 2x faster than the official scanner, a result I achieved using statistical methods to reliably measure small performance differences. Then I benchmarked it on 104,000 files and discovered my lexer was not the bottleneck. I/O was. This is the story of how I accidentally learned why pub.dev stores packages as tar.gz files.

The setup

I wanted to benchmark my lexer against the official Dart scanner. The pub cache on my machine had 104,000 Dart files totaling 1.13 GB, a perfect test corpus. I wrote a benchmark that:

  1. Reads each file from disk
  2. Lexes it
  3. Measures time separately for I/O and lexing

Simple enough.

The first surprise: lexing is fast

Here are the results:

Metric ASM Lexer Official Dart
Lex time 2,807 ms 6,087 ms
Lex throughput 402 MB/s 185 MB/s

My lexer was 2.17x faster. Success! But wait:

Metric ASM Lexer Official Dart
I/O time 14,126 ms 14,606 ms
Total time 16,933 ms 20,693 ms
Total speedup 1.22x -

The total speedup was only 1.22x. My 2.17x lexer improvement was being swallowed by I/O. Reading files took 5x longer than lexing them.

The second surprise: the SSD is not the bottleneck

My MacBook has an NVMe SSD that can read at 5-7 GB/s. I was getting 80 MB/s. That is 1.5% of the theoretical maximum.

The problem was not the disk. It was the syscalls.

For 104,000 files, the operating system had to execute:

  • 104,000 open() calls
  • 104,000 read() calls
  • 104,000 close() calls

That is over 300,000 syscalls. Each syscall involves:

  • A context switch from user space to kernel space
  • Kernel bookkeeping and permission checks
  • A context switch back to user space

Each syscall costs roughly 1-5 microseconds. Multiply that by 300,000 and you get 0.3-1.5 seconds of pure overhead, before any actual disk I/O happens. Add filesystem metadata lookups, directory traversal, and you understand where the time goes.

I tried a few things that did not help much. Memory-mapping the files made things worse due to the per-file mmap/munmap overhead. Replacing Dart's file reading with direct FFI syscalls (open/read/close) only gave a 5% improvement. The problem was not Dart's I/O layer, it was the sheer number of syscalls.

The hypothesis

I have mirrored pub.dev several times in the past and noticed that all packages are stored as tar.gz archives. I never really understood why, but this problem reminded me of that fact. If syscalls are the problem, the solution is fewer syscalls. What if instead of 104,000 files, I had 1,351 files (one per package)?

I wrote a script to package each cached package into a tar.gz archive:

104,000 individual files -> 1,351 tar.gz archives
1.13 GB uncompressed     -> 169 MB compressed (6.66x ratio)

The results

Metric Individual Files tar.gz Archives
Files/Archives 104,000 1,351
Data on disk 1.13 GB 169 MB
I/O time 14,525 ms 339 ms
Decompress time - 4,507 ms
Lex time 2,968 ms 2,867 ms
Total time 17,493 ms 7,713 ms

The I/O speedup was 42.85x. Reading 1,351 sequential files instead of 104,000 random files reduced I/O from 14.5 seconds to 339 milliseconds.

The total speedup was 2.27x. Even with decompression overhead, the archive approach was more than twice as fast.

Breaking down the numbers

I/O: 14,525 ms to 339 ms

This is the syscall overhead in action. Going from 300,000+ syscalls to roughly 4,000 syscalls (open/read/close for 1,351 archives) eliminated most of the overhead.

Additionally, reading 1,351 files sequentially is far more cache-friendly than reading 104,000 files scattered across the filesystem. The OS can prefetch effectively, the SSD can batch operations, and the page cache stays warm.

Decompression: 4,507 ms

gzip decompression ran at about 250 MB/s using the archive package from pub.dev. This is now the new bottleneck. I did not put much effort into optimizing decompression, an FFI-based solution using native zlib could be significantly faster. Modern alternatives like lz4 or zstd might also help.

Compression ratio: 6.66x

Source code compresses well. The 1.13 GB of Dart code compressed to 169 MB. This means less data to read from disk, which helps even on fast SSDs.

Why pub.dev uses tar.gz

pub.dev versions page with download button

pub.dev package download showing flame-1.34.0.tar.gz

This experiment accidentally explains the pub.dev package format. When you run dart pub get, you download tar.gz archives, not individual files. The reasons are now obvious:

  1. Fewer HTTP requests. One request per package instead of hundreds.
  2. Bandwidth savings. 6-7x smaller downloads.
  3. Faster extraction. Sequential writes beat random writes.
  4. Reduced syscall overhead. Both on the server (fewer files to serve) and the client (fewer files to write).
  5. Atomicity. A package is either fully downloaded or not. No partial states.

The same principles apply to npm (tar.gz), Maven (JAR/ZIP), PyPI (wheel/tar.gz), and virtually every package manager.

The broader lesson

Modern storage is fast. NVMe SSDs can sustain gigabytes per second. But that speed is only accessible for sequential access to large files. The moment you introduce thousands of small files, syscall overhead dominates.

This matters for:

  • Build systems. Compiling a project with 10,000 source files? The filesystem overhead might exceed the compilation time.
  • Log processing. Millions of small log files? Concatenate them. Claude uses JSONL for this reason.
  • Backup systems. This is why rsync and tar exist.

What I would do differently

If I were optimizing this further:

  1. Use zstd instead of gzip. 4-5x faster decompression with similar compression ratios.
  2. Use uncompressed tar for local caching. Skip decompression entirely, still get the syscall reduction.
  3. Parallelize with isolates. Multiple cores decompressing multiple archives simultaneously.

Conclusion

I set out to benchmark a lexer and ended up learning about syscall overhead. The lexer was 2x faster. The I/O optimization was 43x faster.


Addendum: Reader Suggestions

Linux-Specific Optimizations

servermeta_net pointed out two Linux-specific approaches: disabling speculative execution mitigations (which could improve performance in syscall-heavy scenarios) and using io_uring for asynchronous I/O. I ran these benchmarks on macOS, which does not support io_uring, but these Linux capabilities are intriguing. A follow-up post exploring how I/O performance can be optimized on Linux may be in order.

king_geedorah elaborated on how io_uring could help with this specific workload: open the directory file descriptor, extract all filenames via readdir, then submit all openat requests as submission queue entries (SQEs) at once. This batches what would otherwise be 104,000 sequential open() syscalls into a single submission, letting the kernel process them concurrently. The io_uring_prep_openat function prepares these batched open operations. This is closer to the "load an entire directory into an array of file descriptors" primitive that this workload really needs.

macOS-Specific Optimizations

tsanderdev pointed out that macOS's kqueue could potentially improve performance for this workload. While kqueue is not equivalent to Linux's io_uring (it lacks the same syscall batching through a shared ring buffer), it may still offer some improvement over synchronous I/O. I have not benchmarked this yet.

macOS vs Linux Syscall Performance

arter45 noted that macOS may be significantly slower than Linux for certain syscalls, linking to a Stack Overflow question showing open() being 4x slower on macOS compared to an Ubuntu VM. jchw explained that Linux's VFS layer is aggressively optimized: it uses RCU (Read-Copy-Update) schemes liberally to make filesystem operations minimally contentious, and employs aggressive dentry caching. Linux also separates dentries and generic inodes, whereas BSD/UNIX systems consolidate these into vnode structures. This suggests my benchmark results on macOS may actually understate the syscall overhead problem on that platform relative to Linux, or alternatively, that Linux users might see smaller gains from the tar.gz approach since their baseline is already faster.

Is it really the syscalls?

ori_b pushed back on the claim that syscall overhead is the bottleneck. On a Ryzen machine, entering and exiting the kernel takes about 150 cycles (~50ns). Even at 1 microsecond per mode switch, 300,000 syscalls would account for only 0.3 seconds of the 14.5-second I/O time. That is roughly 2%. The remaining time likely comes from filesystem metadata lookups, inode resolution, directory traversal, and random seek latency. Even NVMe SSDs have ~50-100 microseconds of latency per random read, and 300,000 random reads at that latency would account for most of the measured I/O time. So the framing might be more precisely stated as "per-file overhead" rather than "syscall overhead" since the expensive part is the work happening inside each syscall, not the context switch itself. It is also worth noting that ori_b's numbers are from a Linux Ryzen machine, where syscalls are faster than on macOS (as discussed above), adding another variable. I do not currently have tooling to break down where the 14.5 seconds actually goes, so this is something I want to investigate in the future.

Avoiding lstat with getdents64

stabbles pointed out that when scanning directories, you can avoid separate lstat() calls by using the d_type field from getdents64(). On most popular filesystems (ext4, XFS, Btrfs), the kernel populates this field with the file type directly, so you do not need an additional syscall to determine if an entry is a file or directory. The caveat: some filesystems return DT_UNKNOWN, in which case you still need to call lstat(). For my workload of scanning the pub cache, this could eliminate tens of thousands of stat syscalls during the directory traversal phase, before even getting to the file opens.

Go Monorepo: 60x Speedup by Avoiding Disk I/O

ghthor shared a similar experience optimizing dependency graph analysis in a Go monorepo. Initial profiling pointed to GC pressure, but the real bottleneck was I/O from shelling out to go list, which performed stat calls and disk reads for every file. By replacing go list with a custom import parser using Go's standard library and reading file contents from git blobs (using git ls-files instead of disk stat calls), they reduced analysis time from 30-45 seconds to 500 milliseconds. This is a 60-90x improvement from the same fundamental insight: avoid per-file syscalls when you can batch or bypass them entirely.

Haiku's packagefs

smallstepforman described how Haiku OS solves this problem at the operating system level. Haiku packages are single compressed files that are never extracted. Instead, the OS uses packagefs, a virtual filesystem that presents the contents of all activated packages as a unified directory tree. Applications see normal paths like /usr/local/lib/foo.so, but the data is actually read from compressed package files in /system/packages. Install and uninstall are instant since you are just adding or removing a single file, not extracting or deleting thousands. This eliminates the syscall overhead entirely at the OS level rather than working around it at the application level. Haiku is an open-source OS recreating BeOS, known for its responsiveness and clean design. While not mainstream, its package architecture demonstrates that the "extract everything to disk" model most package managers use is not the only option.

SquashFS for Container Runtimes

stabbles suggested SquashFS with zstd compression as another alternative. It is used by various container runtimes and is popular in HPC environments where filesystems often have high latency. SquashFS can be mounted natively on Linux or via FUSE, letting you access files normally while the data stays compressed on disk. When questioned about syscall overhead, stabbles noted that even though syscall counts remain high, latency is reduced because the SquashFS file ensures files are stored close together, benefiting significantly from filesystem cache. This is a different tradeoff than tar.gz: you still pay per-file syscall costs, but you gain file locality and can use standard file APIs without explicit decompression. One commenter warned that when mounting a SquashFS image via a loop device, you should use losetup --direct-io=on to avoid double caching (the compressed backing file and the decompressed contents both being cached), which can reduce memory usage significantly.

SQLite as an Alternative

tsanderdev mentioned that this is also why SQLite can be much faster than a directory with lots of small files. I had completely forgotten about SQLite as an option. Storing file contents in a SQLite database would eliminate the syscall overhead while providing random access to individual files, something tar.gz does not offer.

This also explains something I have heard multiple times: Apple uses SQLite extensively for its applications, storing structured data and metadata in SQLite databases rather than as individual files. snej clarified that Apple's SQLite-based APIs (CoreData, SwiftData) are database APIs with an ORM and queries, not filesystem simulations. The Photos app, for example, uses SQLite for metadata and thumbnails, but the actual photos remain as individual files. Still, the principle holds for the data that is stored in SQLite: if 100,000 files on a modern Mac with NVMe storage takes 14 seconds to read, imagine what it was like on older, slower machines. The syscall overhead would have been even more punishing. For workloads where random access to many small records is needed, SQLite avoids those syscalls entirely. This is worth exploring.

Skip the Cleanup Syscalls

matthieum suggested a common trick used by batch compilers: never call free, close, or munmap, and instead let the OS reap all resources when the process ends. For a one-shot batch process like a compiler (or a lexer benchmark), there is no point in carefully releasing resources that the OS will reclaim anyway.

GabrielDosReis added a caveat: depending on the workload, you might actually need to call close, or you could run out of file descriptors. On macOS, you can check your limits with:

$ launchctl limit maxfiles
maxfiles    256            unlimited

$ sysctl kern.maxfilesperproc
kern.maxfilesperproc: 61440

The first number (256) is the soft limit per process, the second is the hard limit. kern.maxfilesperproc shows the kernel's per-process maximum. With 104,000 files, skipping close calls would exhaust even the maximum limit. dinosaurdynasty noted that the low default soft limit is a historical artifact of the select() syscall, which can only handle file descriptors below 1024. Modern programs can simply raise their soft limit to the hard limit and not worry about it.

There is even a further optimization: use a wrapper process. The wrapper launches a worker process that does all the work. When the worker signals completion (via stdout or a pipe), the wrapper terminates immediately without waiting for its detached child. Any script waiting on the wrapper can now proceed, while the OS asynchronously reaps the worker's resources in the background. I had not considered this approach before, but it seems worth trying.

Dwedit noted that on Windows, a similar optimization is to call CloseHandle from a secondary thread, keeping the main thread unblocked while handles are being released.

Linker Strategies for Fast Exits

MaskRay added context about how production linkers handle this exact problem. The mold linker uses the wrapper process approach mentioned above, forking a child to do all the work while the parent exits immediately after the child signals completion. This lets build systems proceed without waiting for resource cleanup. The --no-fork flag disables this behavior for debugging. The wild linker follows the same pattern.

lld takes a different approach with two targeted hacks: async unlink to remove old output files in a background thread, and calling _exit instead of exit to skip the C runtime's cleanup routines (unless LLD_IN_TEST is set for testing).

MaskRay notes a tradeoff with the wrapper process approach: when the heavy work runs in a child process, the parent process of the linker (typically a build system) cannot accurately track resource usage of the actual work. This matters for build systems that monitor memory consumption or CPU time.

Why pub.dev Actually Uses tar.gz

Bob Nystrom from the Dart team clarified that my speculation about pub.dev's format choice was partially wrong. Fewer HTTP requests and bandwidth savings definitely factored into the decision, as did reduced storage space on the server. Atomicity is important too, though archives do not fully solve the problem since downloads and extracts can still fail. However, it is unlikely that the I/O performance benefits (faster extraction, reduced syscall overhead) were considered: pub extracts archives immediately after download, the extraction benefit only occurs once during pub get, that single extraction is a tiny fraction of a fairly expensive process, and pub never reads the files again except for the pubspec. The performance benefit I measured only applies when repeatedly reading from archives, which is not how pub works.

This raises an interesting question: what if pub did not extract archives at all? For a clean (non-incremental) compilation of a large project like the Dart Analyzer with hundreds of dependencies, the compiler needs to access thousands of files across many packages. If packages remained in an archive format with random access support (like ZIP), the syscall overhead from opening and closing all those files could potentially be reduced. Instead of thousands of open/read/close syscalls scattered across the filesystem, you would have one open call per package archive, then seeks within each archive. Whether the decompression overhead would outweigh the syscall savings is unclear, but it might be worth exploring for build systems where clean builds of large dependency trees are common.

Use dart:io for gzip Instead of package:archive

Simon Binder pointed out that dart:io already includes gzip support backed by zlib, so there is no need to use package:archive for decompression. Since dart:io does not support tar archives, I used package:archive for everything and did not think of mixing in dart:io's gzip support separately. Using dart:io's GZipCodec for decompression while only relying on package:archive for tar extraction could yield better performance. I will try this approach when I attempt to lex a bigger corpus.

TAR vs ZIP: Sequential vs Random Access

vanderZwan pointed out that ZIP files could provide SQLite-like random access benefits. This highlights a fundamental architectural difference between TAR and ZIP:

TAR was designed in 1979 for sequential tape drives. Each file's metadata is stored in a header immediately before its contents, with no central index. To find a specific file, you must read through the archive sequentially. When compressed as tar.gz, the entire stream is compressed together, so accessing any file requires decompressing everything before it. The format was standardized by POSIX (POSIX.1-1988 for ustar, POSIX.1-2001 for pax), is well-documented, and preserves Unix file attributes fully.

ZIP was designed in 1989 with a central directory stored at the end of the archive. This directory contains offsets to each file's location, enabling random access: read the central directory once, then seek directly to any file. Each file is compressed individually, so you can decompress just the file you need. This is why JAR files, OpenDocument files, and EPUB files all use the ZIP format internally.

Aspect TAR ZIP
Random access No (sequential only) Yes (central directory)
Standardization POSIX standard PKWARE-controlled specification
Unix permissions Fully preserved Limited support
Compression External (gzip, zstd, etc.) Built-in, per-file

There seems to be no widely-adopted Unix-native format that combines random access with proper Unix metadata support. TAR handles sequential access with full Unix semantics. ZIP handles random access but originated from MS-DOS and has inconsistent Unix permission support. What we lack is something like "ZIP for Unix": random access with proper ownership, permissions, extended attributes, and ACLs.

The closest answer is dar (Disk ARchive), designed explicitly as a tar replacement with modern features. It stores a catalogue index at the end of the archive for O(1) file extraction, preserves full Unix metadata including extended attributes and ACLs, supports per-file compression with choice of algorithm, and can isolate the catalogue separately for fast browsing without the full archive. However, dar has not achieved the ubiquity of tar or zip.

For my lexer benchmark, random access would not help since I process all files anyway. But for use cases requiring access to specific files within an archive, this architectural distinction matters.

Block-Based Compression

cb321 pointed out that there is a middle ground between uncompressed archives (random access but large) and fully compressed streams (small but sequential). Standard gzip compresses everything into a single block, so accessing any byte requires decompressing from the beginning. BGZF (Blocked GNU Zip Format), developed by genomics researchers for tools like samtools, compresses data in independent 64KB blocks. Each block is a valid gzip stream, so the file remains compatible with standard gunzip, but with an index you can seek directly to any block and decompress just that portion. This allows random access to multi-gigabyte genome files without decompressing terabytes of data. Zstd offers a similar seekable format with better compression ratios and faster decompression. For tar archives, combining block-based compression with an external file offset index could provide random access to individual files while still benefiting from compression.

RE2C: A Faster Approach to Lexer Generation

rurban mentioned that RE2C generates lexers that are roughly 10x faster than flex. The key difference is architectural: while flex generates table-driven lexers that look up transitions in arrays at runtime, RE2C generates direct-coded lexers where the finite automaton is encoded directly as conditional jumps and comparisons. This eliminates table lookup overhead and produces code that is both faster and easier for CPU branch predictors to handle.

RE2C also supports computed gotos (via the -g flag), a GCC/Clang extension that compiles switch statements into indirect jumps through a label address table. For lexers with many states, this can significantly reduce branch mispredictions. Other optimizations include DFA minimization and tunnel automaton construction.

My ARM64 assembly lexer currently uses a table-driven approach, so exploring direct-coded generation is an interesting avenue. Another option is profile-guided optimization: compiling the lexer to LLVM IR and using PGO to optimize hot paths based on real Dart code patterns, something I mentioned as a future direction in my LLVM parser benchmarking post. Part of my lexer's speed advantage over the official Dart scanner likely comes from simplicity: my lexer is pure, maintaining only a stack for lexer states across multiple finite automata, while the Dart scanner must construct a linked list of tokens, handle error recovery, and manage additional bookkeeping. Isolating how much of the performance difference comes from architecture versus feature set is something I want to investigate further.

Game Engine Archives: MPQ and CASC

Iggyhopper pointed out that Blizzard Entertainment solved this same problem decades ago with their MPQ archive format (Mo'PaQ, short for Mike O'Brien Pack). First deployed in Diablo in 1996, MPQ bundles game assets (textures, sounds, models, level data) into large archive files with built-in compression, encryption, and fast random access via hash table indexing. The format was used across StarCraft, Diablo II, Warcraft III, and World of Warcraft. At GDC Austin 2009, Blizzard co-founder Frank Pearce revealed that WoW contained 1.5 million assets, a number that has only grown across subsequent expansions. In 2014, Blizzard replaced MPQ with CASC (Content Addressable Storage Container) starting with Warlords of Draenor, adding self-maintaining integrity checks and faster patching. The same principle from this blog post applies: bundling assets into large archives avoids the per-file overhead that would make loading millions of individual files impractical for a real-time game.

Amdahl's Law

fun__friday pointed out that the main takeaway is to measure before you start optimizing something, referencing Amdahl's law. This is a fair point, and this blog post is a textbook illustration of it: when lexing accounts for only ~17% of total execution time, even a 2x improvement in lexing yields only a 1.22x overall speedup. The theoretical maximum speedup from improving just the lexing component is bounded by the fraction of time spent on everything else. Measure first, optimize second.

That said, from a "business" standpoint it makes sense to focus on the largest bottlenecks (by following, e.g., the Critical path method) and those parts that take up the most time. However, software can be reused, and making a single component faster can have significant benefits to other consumers of that component. A faster lexer benefits not just this benchmark but every tool that uses it: formatters, linters, analyzers, compilers. I think our software community thrives in part because we don't strictly follow the common sense that business optimization dictates.

The Limits of Profiling

Ameisen expanded on the "measure first" advice with an important caveat: measuring can itself be very difficult or misleading. Three cases stand out. First, "death by a thousand cuts," where many small inefficiencies individually appear as noise in a profiler but collectively add up to significant overhead. No single hotspot dominates, so there is nothing obvious to fix. Second, indirect task dependencies, where speeding up one component has cascading benefits that a profiler will not attribute to it. Ameisen gives the example of a sprite resampling mod where a faster hashing algorithm not only helps the render thread directly but also keeps worker threads fed with data sooner, reducing overall latency in ways that are invisible in a flat profile. Third, profilers show what is slow, not why it is slow. Cache invalidations from false sharing are a classic example: the profiler points at a slow memory access, but the actual cause (another thread writing to the same cache line) is hidden. The thing causing the slowdown and the thing made slow by it are different, and only the latter shows up in the profile. In a follow-up comment, Ameisen shared concrete examples: concurrent workers running 30% slower because their output data needed its own cache line but did not have it (false sharing), and a render thread gaining a 20% speedup from removing a safety branch that always passed, because the branch triggered an undocumented CPU pipeline flush when followed by a locked instruction. Neither issue showed up meaningfully in a profiler.


Discuss on Hacker News

Discuss on r/ProgrammingLanguages

Discuss on r/programming

Discuss on Lobsters

]]>
Modestas Valauskas
Statistical Methods for Reliable Benchmarks2026-01-06T00:00:00+01:002026-01-06T00:00:00+01:00https://modulovalue.com/blog/statistical-methods-for-reliable-benchmarksBenchmarking is critical for performance-sensitive code. Yet most developers approach it with surprisingly crude methods: run some code, measure the time, compare the average against another piece of code. This approach is fundamentally flawed, and the numbers it produces can be actively misleading.

The good news is that there are simple statistical techniques that give us a much better understanding of how code actually performs. These techniques apply to every language, but for this post I will focus on Dart. I have written a package called benchmark_harness_plus that implements everything discussed here.

The Problem with Averages

Consider a simple benchmark that runs 10 times:

Run 1:  5.0 us
Run 2:  5.1 us
Run 3:  4.9 us
Run 4:  5.0 us
Run 5:  5.2 us
Run 6:  4.8 us
Run 7:  5.0 us
Run 8:  5.1 us
Run 9:  4.9 us
Run 10: 50.0 us  <- GC pause

The mean (average) is 9.5 us. But does this represent typical performance? Absolutely not. Nine out of ten runs completed in about 5 us. The mean is nearly double the actual typical performance because a single garbage collection pause skewed everything.

This is not a contrived example. GC pauses, OS scheduling, CPU throttling, and background processes constantly interfere with measurements. In real benchmarks, outliers are the norm, not the exception.

The Solution: Median

The median is the middle value when samples are sorted. For the data above:

Sorted: [4.8, 4.9, 4.9, 5.0, 5.0, 5.0, 5.1, 5.1, 5.2, 50.0]
Median: 5.0 us  (average of the two middle values)

The median correctly reports 5.0 us, completely ignoring the outlier. This is why benchmark_harness_plus uses median as the primary comparison metric.

When to look at mean vs median:

The relationship between mean and median tells you about your data distribution:

  • Mean ≈ Median: Symmetric distribution, no significant outliers
  • Mean > Median: High outliers present (common in benchmarks, caused by GC and OS)
  • Mean < Median: Low outliers present (rare, might indicate measurement issues)

When you still need the mean

As editor_of_the_beast pointed out to me, referencing Marc Brooker's post Two Places the Mean Isn't Useless, the mean remains essential for capacity planning and throughput calculations. If you want to know how many requests per second your system can handle, you need the mean latency, outliers and all. Those GC pauses consume real time and affect actual throughput.

Little's Law (L = λ × W) only works with means, not medians or percentiles. If you need to calculate how many concurrent connections you can sustain, or how much buffer space you need, the mean is irreplaceable.

The distinction is this: for comparing which implementation is faster under typical conditions, use the median. For calculating system capacity where every millisecond counts toward the total, use the mean.

But how do I know if I can trust the results?

This is the question most benchmarking tools fail to answer. You get a number, but is it reliable? Could the next run produce something completely different?

The answer is the Coefficient of Variation (CV%).

CV% expresses the standard deviation as a percentage of the mean:

CV% = (standard deviation / mean) * 100

This normalizes variance across different scales. A standard deviation of 1.0 means very different things for a measurement of 10 us versus 1000 us. CV% makes them comparable.

Trust thresholds:

CV% Reliability What it means
< 10% Excellent Highly reliable. Trust exact ratios.
10-20% Good Rankings are reliable. Ratios are approximate.
20-50% Moderate Directional only. You know which is faster, but not by how much.
> 50% Poor Unreliable. The measurement is mostly noise.

When benchmark_harness_plus reports CV% > 50%, it warns you explicitly. You should not trust those numbers.

The Complete Picture

Here is what proper benchmark output looks like:

  Variant      |     median |       mean |    fastest |   stddev |    cv% |  vs base
  --------------------------------------------------------------------------------
  growable     |       1.24 |       1.31 |       1.05 |     0.15 |   11.5 |        -
  fixed-length |       0.52 |       0.53 |       0.50 |     0.02 |    3.8 |    2.38x
  generate     |       0.89 |       0.91 |       0.85 |     0.04 |    4.4 |    1.39x

  (times in microseconds per operation)

How to read this:

  1. Check CV% first. All values are under 20%, so these measurements are reliable.

  2. Compare medians. fixed-length (0.52 us) is fastest, growable (1.24 us) is slowest.

  3. Look at mean vs median. The growable variant has mean (1.31) > median (1.24), suggesting some high outliers. The others are close, indicating symmetric distributions.

  4. Check the ratios. fixed-length is 2.38x faster than growable. Because both have good CV%, this ratio is trustworthy.

What benchmark_harness_plus does differently

The standard benchmark_harness package reports a single mean value. benchmark_harness_plus implements several statistical best practices:

1. Multiple Samples

Instead of one measurement, the package collects multiple independent samples (default: 10). Each sample times many iterations of the code, then records the average time per operation. This gives us enough data points to compute meaningful statistics.

2. Proper Warmup

Before any measurements, each variant runs through a warmup phase (default: 500 iterations). This allows:

  • The Dart VM to JIT-compile hot paths
  • CPU caches to warm up
  • Lazy initialization to complete

Warmup results are discarded entirely.

3. Randomized Ordering

By default, the order of variants is randomized for each sample. This reduces systematic bias from:

  • CPU frequency scaling
  • Thermal throttling
  • Memory pressure changes over time

If variant A always runs before variant B, the second variant might consistently benefit from (or suffer from) the state left by the first.

4. Reliability Assessment

Every result includes CV%, and the package provides a reliability property that categorizes results as excellent, good, moderate, or poor. You no longer have to guess whether your numbers are meaningful.

Usage

import 'package:benchmark_harness_plus/benchmark_harness_plus.dart';

void main() {
  final benchmark = Benchmark(
    title: 'List Creation',
    variants: [
      BenchmarkVariant(
        name: 'growable',
        run: () {
          final list = <int>[];
          for (var i = 0; i < 100; i++) {
            list.add(i);
          }
        },
      ),
      BenchmarkVariant(
        name: 'fixed-length',
        run: () {
          final list = List<int>.filled(100, 0);
          for (var i = 0; i < 100; i++) {
            list[i] = i;
          }
        },
      ),
    ],
  );

  final results = benchmark.run(log: print);
  printResults(results, baselineName: 'growable');
}

The package includes three configuration presets:

BenchmarkConfig.quick     // Fast feedback during development
BenchmarkConfig.standard  // Normal benchmarking (default)
BenchmarkConfig.thorough  // Important performance decisions

You can also create custom configurations:

BenchmarkConfig(
  iterations: 5000,
  samples: 20,
  warmupIterations: 1000,
  randomizeOrder: true,
)

When Measurements Are Unreliable

If you see CV% values above 50%, your measurements are dominated by noise. Common causes:

Sub-microsecond operations. Very fast code is inherently difficult to measure accurately. Timer resolution becomes a limiting factor. Solution: increase iterations so each sample takes at least 10ms.

System interference. Background processes, browser tabs, other applications. Solution: close unnecessary programs, or accept that some variance is unavoidable.

Inconsistent input. If the code under test behaves differently based on input, and you are using random input, variance will be high. Solution: use deterministic test data.

The operation is genuinely variable. Some code has inherently variable performance (cache-dependent algorithms, I/O, network calls). In these cases, high CV% is not a measurement problem; it is telling you something true about the code.

Summary

The core techniques are simple:

  1. Use median, not mean. Median ignores outliers.
  2. Collect multiple samples. One measurement tells you almost nothing.
  3. Report CV%. Know whether you can trust your results.
  4. Warm up before measuring. Let the JIT do its work.
  5. Randomize variant order. Reduce systematic bias.

These principles apply to any language. For Dart, benchmark_harness_plus implements all of them with sensible defaults.

The package is available at pub.dev/packages/benchmark_harness_plus.


Addendum: The Case for the Fastest Time

Bob Nystrom from the Dart language team pointed out that the fastest time has a special property: it is an existence proof. If the machine ran the code that fast once, that represents what the code is actually capable of. Noise from GC, OS scheduling, and other interference can only add time, never subtract it. The minimum filters out that external noise and shows the algorithm's true potential.

This approach works well when comparing pure algorithms where you want to isolate the code's performance from system interference. For more complex cases involving throughput or real-world conditions, the noise is part of what you are measuring and should not be filtered out.

I have added a "fastest" column to benchmark_harness_plus (as of version 1.1.0) so this metric is now visible alongside median and mean.

Different Metrics for Different Questions

What has become clear from these discussions is that different metrics answer different questions:

  • Fastest (minimum): "How fast can this code run?" An existence proof of capability. Best for comparing pure algorithms where you want to isolate the code from system noise.

  • Median: "How fast does this code typically run?" Robust against outliers. Best for understanding typical performance under normal conditions.

  • Mean (average): "What is the total time cost?" Essential for capacity planning and throughput calculations where every millisecond counts toward the total.

There seems to be a gap in how we talk about benchmarking. We use the same word for very different activities: comparing algorithm efficiency, measuring system throughput, profiling latency distributions, and capacity planning. Each requires different statistical treatment, yet we often reach for the same crude tools.

Perhaps what we need is a clearer taxonomy of benchmarking types, with explicit guidance on which metrics matter for each. The fastest time, the median, and the mean are all valuable, but they answer fundamentally different questions. Knowing which question you are asking is the first step to getting a meaningful answer.

On GC Triggering

An earlier version of this package attempted to trigger garbage collection between variants by allocating and discarding memory. Vyacheslav Egorov from the Dart Compiler team pointed out that this is counterproductive: the GC is a complicated state machine driven by heuristics, and allocations can cause it to start concurrent marking, introducing more noise rather than reducing it.

The GC triggering logic has been removed as of version 1.2.0. A better approach for Dart 3.11+ is to use the dart:developer NativeRuntime API to record timeline events and check whether any GC occurred during the benchmark run, making GC visibility part of the report rather than trying to prevent it.


Discuss on r/dartlang

]]>
Modestas Valauskas
The Case for Snake Case: A Kolmogorov Complexity Argument2025-12-27T00:00:00+01:002025-12-27T00:00:00+01:00https://modulovalue.com/blog/snake-case-vs-camel-case-kolmogorov-complexitySoftware engineering is drowning in complexity. Much of it is unintended, implicit, and hidden beneath layers of convention we rarely question. Today, I want to examine one of these conventions: identifier naming. Specifically, I will argue that snake_case is objectively superior to camelCase, and I will use Kolmogorov complexity to make this case.

What is Kolmogorov Complexity?

Kolmogorov complexity measures the computational resources needed to specify an object. In practical terms, it asks: how much information, how many rules, how many external dependencies do we need to perform a given operation?

When we apply this lens to identifier naming conventions, the results are striking.

Parsing Identifiers: Where Complexity Hides

Consider the seemingly simple task of splitting an identifier into its component words. This operation is fundamental, both for tooling (linters, refactoring tools, documentation generators) and for human comprehension.

Snake Case: Minimal Complexity

components = identifier.split("_")

That is it. The entire algorithm fits in a single, trivial operation. The delimiter is explicit, unambiguous, and universal. The underscore character has the same meaning in ASCII, in Unicode, in every locale, in every context. No external knowledge is required. No lookup tables. No edge cases.

Camel Case: Hidden Complexity Explosion

# Good luck.

To split a camelCase identifier, you must:

  1. Find capitalization boundaries. This requires knowing which characters are "uppercase" and which are "lowercase."

  2. Consult the Unicode standard. Capitalization is not a property of characters in isolation. It is defined by the Unicode Standard, a specification that spans thousands of pages and is updated regularly. The uppercase/lowercase mapping for a single character can depend on locale, context, and version of the standard you are using.

  3. Handle abbreviations. Is XMLParser split as [XML, Parser] or [X, M, L, Parser]? What about parseHTTPSURL? The answer depends on implicit human knowledge, conventions that vary by codebase, team, and era. There is no algorithm that can reliably determine this without external context.

  4. Account for edge cases. What about iPhone? Or eBay? These are valid identifiers that violate the "rules" entirely.

The Kolmogorov complexity of camelCase parsing is not merely higher. It is unbounded in a practical sense, because it depends on an external, evolving standard (Unicode) and on implicit cultural knowledge that cannot be formalized.

Constructing Identifiers: The Same Story

Suppose you have a list of words and want to form an identifier.

Snake Case

identifier = "_".join(components)

Done. Append underscores between components. No transformation of the components themselves is required.

Camel Case

identifier = components[0].lower() + "".join(c.title() for c in components[1:])

This looks simple until you ask: what does title() actually do? The answer: it calls into Unicode case mapping. For the character "i", the uppercase form is "I" in most locales, but in Turkish it is "I" (with a dot). The title() function must either choose a locale, consult environment variables, or produce inconsistent results.

You have now introduced a dependency on:

  • The Unicode standard
  • Locale settings
  • Runtime environment configuration

Your identifier construction algorithm is no longer self-contained. Its Kolmogorov complexity has ballooned.

Why This Matters

Some might argue this is academic. Who cares about edge cases with Turkish "i" or abbreviations?

I argue that this matters deeply, for several reasons:

1. Tooling reliability. Every refactoring tool, every linter, every code search engine that works with identifiers must solve this problem. The ambiguity in camelCase means these tools are either incomplete, inconsistent, or carry massive hidden complexity.

2. Internationalization. Software is global. Identifiers increasingly contain Unicode characters. A naming convention that relies on capitalization is fundamentally tied to the Western alphabet's peculiar property of having case distinctions, a property that most of the world's writing systems do not share.

3. Cognitive load. When a human reads parseHTTPSURL, they must mentally segment it. Different readers will segment it differently. This ambiguity consumes cognitive resources that could be spent on understanding the actual logic.

4. The principle of least complexity. Unintended complexity is one of the greatest problems in software engineering today. It accumulates silently, manifesting as bugs, maintenance burden, and developer frustration. We should actively seek to minimize it.

An Objective Argument

I am not claiming snake_case is more aesthetically pleasing. Aesthetics are subjective. I am claiming that, by the objective measure of Kolmogorov complexity, snake_case requires fundamentally less information to parse and construct.

  • Snake case parsing: one operation, one delimiter, no external dependencies.
  • Camel case parsing: character classification, Unicode case mapping, abbreviation heuristics, cultural conventions.

The difference is not marginal. It is categorical.

Given the complexity argument above, one might wonder: how do major programming languages handle this? I surveyed the official style guides of twelve popular languages.

Language Variables/Functions Official Source
C++ snake_case Google C++ Style Guide
Python snake_case PEP 8
Rust snake_case RFC 430
Ruby snake_case Ruby Style Guide
Java camelCase Oracle Code Conventions
JavaScript camelCase MDN Guidelines
Go camelCase Effective Go
C# camelCase Microsoft Naming Guidelines
Swift camelCase Swift API Design Guidelines
Kotlin camelCase Kotlin Coding Conventions
PHP camelCase PSR-1
Dart camelCase Effective Dart: Style

The score is 8-4 in favor of camelCase. Does this invalidate my argument?

No. Popularity is not an argument for correctness. Many of these conventions were established decades ago, when ASCII dominance made capitalization seem trivial, when tooling was primitive, and when the hidden costs of implicit complexity were not yet understood.

Consider that Python, one of the most widely adopted languages of the past decade, chose snake_case. Rust, designed with modern sensibilities about safety and correctness, also chose snake_case. Ruby, known for developer happiness, chose snake_case.

The camelCase languages reveal a pattern of convention inheritance rather than deliberate design. Java popularized camelCase in the 1990s. JavaScript adopted "Java" in its name for marketing reasons (Brendan Eich himself considered it "a marketing ploy by Netscape") and likely copied the convention. C# was Microsoft's answer to Java, developed after Sun's lawsuit forced them to abandon their Java implementation. Dart was Google's attempt to replace JavaScript, as revealed in a leaked 2010 internal memo where the language (then called "Dash") was designed to "ultimately replace JavaScript as the lingua franca of web development." Go was designed for programmers "early in their careers" who are "most familiar with procedural languages, particularly from the C family." Swift inherited from Objective-C, which had used camelCase since the NeXT era. PHP started as a personal project ("Personal Home Page") and grew organically. In most cases, the choice was made to fit existing convention, not because someone analyzed the complexity tradeoffs.

Conclusion

The next time someone dismisses naming conventions as "just style," consider the hidden complexity beneath the surface. Snake case is not merely a preference. It is the convention with lower Kolmogorov complexity, fewer external dependencies, and less room for ambiguity.

In an industry that struggles daily with accidental complexity, choosing the simpler encoding for something as fundamental as identifiers is not pedantry. It is engineering discipline.

use_snake_case. Your future self, your tools, and your international colleagues will have one less thing to worry about.


Addendum

I received quite a lot of pushback on this post, which ironically gave me even better arguments to support my case.

1. Snake case eliminates the need for abbreviation style guides. With camelCase, every organization needs rules for handling abbreviations. The .NET guidelines differ from legacy Java guidelines, which differ from Google's guidelines. Is it HttpUrl or HTTPURL or HTTPUrl? With snake_case, it is simply http_url. No style guide needed. No debates.

2. I am arguing against camelCase, not PascalCase. Snake case and PascalCase can coexist peacefully. Use snake_case for variables and functions, PascalCase for types. The problem is specifically lowerCamelCase, which should be snake_case instead.

3. Snake case and camelCase can be combined meaningfully. Within snake_case, underscores serve as the primary separator. This frees up camelCase to denote something else entirely within each component. For example, parse_XMLDocument or get_userId could use camelCase to preserve domain-specific casing while still having unambiguous word boundaries. You gain an additional layer of expressiveness.

4. The complexity argument is real at the machine level. Try implementing split and join in assembly. For snake_case, you need a few bytes of code and a loop: scan for underscore, done. For camelCase, you need a Unicode parser and lookup tables that can exceed hundreds of kilobytes. The Kolmogorov complexity difference is not abstract theory; it manifests directly in binary size and execution complexity.

5. Snake case helps with semantic search. Someone suggested that splitting parseHTTPSURL into p a r s e h t t p s u r l would enable fuzzy matching. This does not hold up. LLMs are trained on tokens from real-world text, not individual characters. They can infer that https_url relates to HTTPS and URL, but h t t p s u r l introduces uncertainty. Vector databases using metric spaces like Levenshtein distance will rank p lower than https. Vector embeddings have the same problem: p is nowhere close to being a synonym of https. With snake_case, the components are already explicitly separated as meaningful tokens: parse_https_url. No character-level decomposition needed.

6. Yes, camelCase saves space. Someone noted that "technically camelCase is better in terms of space utilization but that is it." Well, I agree. That is indeed the one advantage.

]]>
Modestas Valauskas
IIFEs are Dart's most underrated feature2025-12-23T12:00:00+01:002025-12-23T12:00:00+01:00https://modulovalue.com/blog/iifes-are-darts-most-underrated-featureIIFEs in Dart are severely underrated and barely anyone seems to agree with me. This is a hill I'm willing to die on, and I've decided to collect my thoughts into a blog post that will hopefully get you on this hill, too.

IIFE stands for Immediately Invoked Function Expression. What does that even mean? Let's start at the beginning. Bear with me. If you already know what IIFEs are, feel free to skip to the use cases.

Table of contents:

What's an expression?

A programming language consists of different entities of abstraction. An expression is one such entity that fulfills the purpose of describing what values your program is going to produce. Examples include literals like 123 or 'hello', variable references, arithmetic like 1 + 2, and function calls.

What's a function?

A function is a collection of statements (and since expressions can be statements, also a collection of expressions). Functions have a name, parameters, a return type, and a body where statements live:

String greet(String name) {
  123; // expression statement
  final message = 'Hello, $name!'; // expression on the right-hand side
  return message;
}

Here, greet is the name, String name is the parameter, the return type is String, and everything between the curly braces is the body. The body contains three statements: an expression statement (123;), a variable declaration with an expression on the right-hand side, and a return statement.

There are many other places where expressions can exist, but these are the most relevant for now.

What's a function expression?

Dart supports anonymous functions by supporting expressions that are functions. Anonymous means no name, and the return type is implicit, it can't be specified and will be inferred automatically:

final a = () {};

final b = (String name) {
  final message = 'Hello, $name!';
  return message;
};

What's an invocation?

To "invoke" something means essentially to call or execute something.

void main() {
  print("Hello World");
}

In that example, print is a function that was invoked, or in other words, called. You can also be very explicit about calling something in Dart and call the call method of a Function:

void main() {
  print.call("Hello World");
}

Think of calling and invoking something as being one and the same thing.

What's immediacy?

What happens if we immediately invoke (call) a function expression? Let me present to you, an immediately invoked function expression:

void main() {
//vvvvv function expression
  () {}();
  //   ^^ invocation
}

That's an expression that is a function expression that is being invoked directly where it was defined.

Admittedly, this looks weird at first sight, but everything in programming does the first time you see it. The question is: what does it give us? And IIFEs give us a whole lot.

IIFE use cases

There are many common annoyances in Dart that are immediately (no pun intended) solved by using IIFEs.

Dart has no if-expression

Dart only supports if statements. Ternary expressions work for simple cases, but they become unreadable with multiple conditions or when you need to execute statements.

With an IIFE, you get an if-expression:

final result = () {
  if (condition1) {
    return "a";
  } else if (condition2) {
    return "b";
  } else {
    return "c";
  }
}();

This is especially useful when initializing final variables that depend on complex logic.

IIFEs reduce mental load by scoping things locally

Consider:

void foo() {
  // Many lines of unrelated code
  // ...
  () {
    final a = 123;
    final b = "abc";
    // ...
  }();
  // Many lines of other unrelated code
  // ...
}

Anything defined in the IIFE doesn't pollute the namespace that comes after it. Without it, you'd have to be careful not to declare or reuse/overwrite values that have been declared before, and you avoid problems where you accidentally shadow names used elsewhere.

Note: You might wonder why you need the ()() at all. Dart does support plain block statements { ... } for scoping without the function wrapper. But blocks are statements, not expressions, so you can only use them where statements are allowed. IIFEs give you scoping and an expression you can use anywhere.

This, for example, is an invalid program since foo is declared locally, but the intention is to use the global foo:

void foo() {}

void main() {
  final foo = 123;
  print(foo);
  foo(); // Error: 'foo' isn't a function
}

However, this is fine, since foo only exists within the IIFE:

void foo() {}

void main() {
  () {
    final foo = 123;
    print(foo);
  }();
  foo(); // Works: calls the global foo
}

Region indicators

Most IDEs support collapsible regions. IIFEs are a natural way to tell your IDE that you want something to be collapsible.

Collapsible IIFE in IDE

The markers on the left give you the opportunity to collapse the whole body of a function expression. Why is that useful? If you need to make sense of what's going on in your codebase, it helps to ignore things you've already determined are irrelevant to the problem at hand. Collapsible regions let you do exactly that.

IIFEs give you statements everywhere

For Flutter to be fun, you should know what an IIFE is.

Widget build(BuildContext context) {
  return Column(
    children: [
      () {
        // Complex logic here
        if (isLoading) {
          return CircularProgressIndicator();
        }
        return Text(data);
      }(),
      // More widgets...
    ],
  );
}

Flutter critics tend to complain about deeply nested widget trees. Well, if you use an IIFE, that's no longer a problem. You can always exchange a Flutter widget (which is an expression) for an IIFE, flatten the logic, and return the widget you need.

This also allows you to copy and paste a list of statements into places that support expressions and places that support statements interchangeably.

IIFEs also help reduce widget duplication. If multiple branches of a conditional return the same outer widget, an IIFE lets you factor it out:

// Before: duplicated Card across three branches
if (user.isPremium) {
  return Card(
    elevation: 4,
    margin: EdgeInsets.all(8),
    child: Column(children: [
      Icon(Icons.star, color: user.tier.color),
      Text(user.displayName),
      Text('Member since ${user.joinDate.year}'),
    ]),
  );
} else if (user.isTrialActive) {
  return Card(
    elevation: 4,
    margin: EdgeInsets.all(8),
    child: Column(children: [
      Icon(Icons.hourglass_top),
      Text(user.displayName),
      Text('${user.trialDaysLeft} days left'),
    ]),
  );
} else {
  return Card(
    elevation: 4,
    margin: EdgeInsets.all(8),
    child: Column(children: [
      Icon(Icons.person),
      Text(user.displayName),
      Text('Upgrade to premium'),
    ]),
  );
}

// After: Card and Column factored out with IIFE
Card(
  elevation: 4,
  margin: EdgeInsets.all(8),
  child: Column(
    children: () {
      if (user.isPremium) {
        final memberDuration = DateTime.now().difference(user.joinDate);
        final years = memberDuration.inDays ~/ 365;
        final badge = years >= 5 ? Icons.diamond : Icons.star;
        return [
          Icon(badge, color: user.tier.color),
          Text(user.displayName),
          Text('Member for $years years'),
        ];
      } else if (user.isTrialActive) {
        final daysLeft = user.trialEnd.difference(DateTime.now()).inDays;
        final isUrgent = daysLeft <= 3;
        return [
          Icon(Icons.hourglass_top, color: isUrgent ? Colors.red : null),
          Text(user.displayName),
          Text('$daysLeft days left'),
        ];
      } else {
        return [
          Icon(Icons.person),
          Text(user.displayName),
          Text('Upgrade to premium'),
        ];
      }
    }(),
  ),
)

Yes, you could put the outer widget in a new function, but that adds complexity and increases the mental load. Do you make it public/private? What do you call it? Where do you put it? In my view, having such helper functions is unhelpful since they will only be used in one place. You don't need them at all when you use IIFEs.

IIFEs can return null in widget lists

As u/Dustlay pointed out, IIFEs have an advantage over Builder widgets: they can return null. A WidgetBuilder must return a Widget, so you'd need at least an empty SizedBox(). But in a Row or Column, that empty widget can mess with spacing.

Column(
  children: [
    Text('Header'),
    // Builder can't return null - you'd need SizedBox() which affects spacing
    // Builder(builder: (context) => showExtra ? ExtraWidget() : SizedBox()),

    // IIFE with ?() can return null
    ?() {
      if (!showExtra) return null;
      final data = computeSomething();
      return ExtraWidget(data: data);
    }(),
    Text('Footer'),
  ],
)

The ? is the null-aware expression operator applied to an IIFE. When the IIFE returns null, the element is omitted from the list entirely. No phantom SizedBox taking up space or interfering with MainAxisAlignment.spaceBetween.

Try-catch as an expression

Dart has no try-catch expression. With an IIFE, you can handle errors and return a value in one go:

final config = () {
  try {
    return jsonDecode(configString);
  } catch (e) {
    return defaultConfig;
  }
}();

This is particularly useful for parsing, file operations, or any fallible initialization where you want a guaranteed value.

Late final with complex initialization

When a late final field needs more than a simple expression to initialize:

class DataProcessor {
  late final Map<String, Handler> _handlers = () {
    final map = <String, Handler>{};
    for (final type in supportedTypes) {
      map[type.name] = type.createHandler();
      map['${type.name}_legacy'] = type.createLegacyHandler();
    }
    return map;
  }();
}

The alternative would be initializing in a constructor or a separate method, but the IIFE keeps the initialization logic right where the field is declared.

More advanced: together with late, IIFEs help you define Excel-style data flow graphs inside of classes without having to add a ton of constructor, initialization, or function declaration boilerplate.

Null-safe value extraction

When you need to safely extract a value through multiple nullable layers:

final userName = () {
  final user = response.data?.user;
  if (user == null) return 'Anonymous';
  if (user.displayName?.isNotEmpty == true) {
    return user.displayName!;
  }
  return user.email?.split('@').first ?? 'User ${user.id}';
}();

This is cleaner than deeply nested ternaries or spreading the logic across multiple statements that pollute your scope.

Switch expressions only support expressions

Dart 3's switch expressions have a limitation: each arm can only contain a single expression. You can't execute statements, declare variables, or add debug logging inside a switch arm.

// This doesn't work - statements aren't allowed in switch expressions
final result = switch (status) {
  Status.loading => {
    print('Loading...'); // Error: statements not allowed
    return LoadingWidget();
  },
  Status.error => ErrorWidget(),
  Status.success => SuccessWidget(),
};

IIFEs solve this elegantly:

final result = switch (status) {
  Status.loading => () {
    print('Loading state entered');
    final message = computeLoadingMessage();
    return LoadingWidget(message: message);
  }(),
  Status.error => () {
    logError(status.error);
    return ErrorWidget(retry: handleRetry);
  }(),
  Status.success => SuccessWidget(data: status.data),
};

Without IIFEs, you'd need to extract each complex arm into a separate function, scattering related logic across your codebase. The IIFE keeps everything inline and readable.

Debug-only code with assert

Dart's assert statements are removed entirely in production builds. Since assert takes an expression, you can use an IIFE to run arbitrary debug-only code with zero production overhead. This is a pattern commonly used by Flutter. Thanks to u/SchandalRwartz for pointing this out.

Here's an example from the Flutter engine:

assert(() {
  // In debug mode, register the schedule frame extension.
  developer.registerExtension('ext.ui.window.scheduleFrame', _scheduleFrame);

  // In debug mode, allow shaders to be reinitialized.
  developer.registerExtension('ext.ui.window.reinitializeShader', _reinitializeShader);

  return true;
}());

The IIFE returns true so the assertion passes, but the real purpose is executing the statements inside. In production, the entire assert statement disappears, including the IIFE and all its side effects.

This is useful for debug logging, registering development tools, or running expensive validation that you only want during development.

Performance

A common concern: "Don't IIFEs create overhead?" I analyzed this by examining what both the Dart VM and dart2js produce.

If-else benchmark

void main() {
  for (int i = 0; i < 100000; i++) { getX(); getY(); }
  print('Done warming up');
  final sw1 = Stopwatch()..start();
  for (int i = 0; i < 10000000; i++) { getX(); }
  sw1.stop();
  final sw2 = Stopwatch()..start();
  for (int i = 0; i < 10000000; i++) { getY(); }
  sw2.stop();
  print('IIFE: ${sw1.elapsedMicroseconds}us');
  print('Traditional: ${sw2.elapsedMicroseconds}us');
}

@pragma('vm:never-inline')
int getX() {  // IIFE version
  final result = () {
    if (condition()) { return expensive() * 2; }
    else { return 42; }
  }();
  return result;
}

@pragma('vm:never-inline')
int getY() {  // Traditional version
  final int result;
  if (condition()) { result = expensive() * 2; }
  else { result = 42; }
  return result;
}

@pragma('vm:prefer-inline')
bool condition() => DateTime.now().millisecondsSinceEpoch % 2 == 0;
@pragma('vm:prefer-inline')
int expensive() => DateTime.now().microsecondsSinceEpoch;

Run with dart --print-flow-graph-optimized benchmark.dart to see the optimized IL:

getX (IIFE):

B1[function entry]:2
    CheckStackOverflow:8(stack=0, loop=0)
    v75 <- StaticCall:16( _getCurrentMicros@0150898<0> ) T{int}
    Branch if RelationalOp:12(<, v75 T{_Smi}, v41) T{bool} goto (28, 29)
    ...
    Branch if TestInt(v87, v26) goto (5, 6)
B5[target]:20  // condition() returned true
    v68 <- StaticCall:16( _getCurrentMicros@0150898<0> ) T{int}
    v18 <- BinarySmiOp:24(<<, v68 T{_Smi}, v26) T{_Smi}  // expensive() * 2
    goto B7
B6[target]:30  // condition() returned false
    goto B7
B7[join]:19 pred(B5, B6) {
    v27 <- phi(v18 T{_Smi}, v24 T{_Smi})  // v24 is constant 42
}
    DartReturn:22(v27)

getY (Traditional):

B1[function entry]:2
    CheckStackOverflow:8(stack=0, loop=0)
    v57 <- StaticCall:16( _getCurrentMicros@0150898<0> ) T{int}
    Branch if RelationalOp:12(<, v57 T{_Smi}, v22) T{bool} goto (26, 27)
    ...
    Branch if TestInt(v69, v23) goto (3, 4)
B3[target]:22  // condition() returned true
    v50 <- StaticCall:16( _getCurrentMicros@0150898<0> ) T{int}
    v78 <- BinarySmiOp:26(<<, v50 T{_Smi}, v23) T{_Smi}  // expensive() * 2
    goto B5
B4[target]:28  // condition() returned false
    goto B5
B5[join]:32 pred(B3, B4) {
    v5 <- phi(v78 T{_Smi}, v4)  // v4 is constant 42
}
    DartReturn:40(v5)

The structure is identical. The IIFE is completely inlined with no closure allocation or call overhead.

Switch expression benchmark

enum Status { loading, error, success }

void main() {
  final statuses = [Status.loading, Status.error, Status.success];
  for (int i = 0; i < 100000; i++) {
    getWithIIFE(statuses[i % 3]);
    getTraditional(statuses[i % 3]);
  }
  print('Done warming up');
  final sw1 = Stopwatch()..start();
  for (int i = 0; i < 10000000; i++) { getWithIIFE(statuses[i % 3]); }
  sw1.stop();
  final sw2 = Stopwatch()..start();
  for (int i = 0; i < 10000000; i++) { getTraditional(statuses[i % 3]); }
  sw2.stop();
  print('Switch+IIFE: ${sw1.elapsedMicroseconds}us');
  print('Traditional: ${sw2.elapsedMicroseconds}us');
}

@pragma('vm:never-inline')
String getWithIIFE(Status status) {  // Switch expression with IIFEs
  return switch (status) {
    Status.loading => () {
      final msg = computeMessage('load');
      return 'Loading: $msg';
    }(),
    Status.error => () {
      final code = getErrorCode();
      return 'Error $code: ${computeMessage('err')}';
    }(),
    Status.success => 'OK',
  };
}

@pragma('vm:never-inline')
String getTraditional(Status status) {  // Traditional switch statement
  switch (status) {
    case Status.loading: return 'Loading: ${computeMessage('load')}';
    case Status.error: return 'Error ${getErrorCode()}: ${computeMessage('err')}';
    case Status.success: return 'OK';
  }
}

@pragma('vm:prefer-inline')
String computeMessage(String prefix) => '$prefix-${DateTime.now().millisecond}';
@pragma('vm:prefer-inline')
int getErrorCode() => DateTime.now().second;

The optimized IL for both versions is structurally identical, just like the if-else case. The VM inlines IIFEs in switch expression arms just as effectively.

dart2js output

Compile with dart compile js -O2 -o out.js file.dart and inspect the output:

Traditional version (inlined directly):

// Cleaned up:
getY() {
  var result;
  if (condition()) {
    result = expensive() * 2;
  } else {
    result = 42;
  }
  return result;
}

// Actual minified output:
// cg(){var t,s=A.cb()
// if(A.ak(new A.H(Date.now(),0,!1))>500){Date.now()
// t=0}else t=42
// A.ci("IIFE: "+s+", Traditional: "+t)}

IIFE version (closure as prototype method):

// Cleaned up:
getX() {
  return new A.ai().$0();  // <-- allocates closure, calls $0
}

A.ai.prototype = {
  $0() {
    if (condition()) {
      return expensive() * 2;
    } else {
      return 42;
    }
  }
}

// Actual minified output:
// cb(){return new A.ai().$0()},
// A.ai.prototype={
// $0(){if(A.ak(new A.H(Date.now(),0,!1))>500){Date.now()
// return 0}else return 42},
// $S:0}

The IIFE version has a small overhead: new A.ai().$0() allocates a closure object and calls through $0. However, V8 and other modern JS engines inline these aggressively, so benchmarks show no measurable difference in practice.

Bottom line: IIFEs have zero runtime cost in optimized Dart VM code, and negligible cost in JavaScript. Use them freely.

Conclusion

IIFEs are a simple concept with broad applications. They give you expressions where Dart only offers statements, scoping where Dart gives you a flat namespace, and flexibility where the language is rigid.

The syntax () {}() might look odd at first, but once you internalize it, you'll start seeing opportunities everywhere: that complex ternary that's getting out of hand, that variable leaking into scope where it shouldn't, that widget tree begging for a bit of logic.

Dart 3 added switch expressions and if-case patterns, which cover some of the ground IIFEs used to own. But IIFEs remain more general. They're not a feature the language gave you; they're a pattern that emerges from first principles. And patterns that emerge from first principles tend to stick around.

Give IIFEs a chance. Your code will thank you.


PS: IIFEs emerged as a concept in the JavaScript world (MDN reference) where they are very common. The thing about Dart is that IIFEs are even cleaner than in JS. To use a function expression with a function body, you don't need any function keywords in Dart, they just work. And you can immediately call them without wrapping parentheses. That's actually very cool, and I can't imagine a cleaner syntax for IIFEs than what Dart offers. I'd love to see the Dart community not reinvent the wheel, but actually use the wheel we have, because our wheel is even better.

PPS: IIFEs increase the expressivity of statements and expressions by providing a way to merge them. Let's assume we don't know what IIFEs are and our language consists only of statements and expressions: We can't put arbitrary statements into arbitrary expressions with statements and expressions in our language only. Once we add IIFEs to our vocabulary, we can always put arbitrary statements into arbitrary expressions. Similar to how expression statements allow you to put expressions into statements, IIFEs can be seen as a "Statement Expression" because they allow you to put statements into expressions.


Discuss on Reddit

]]>
Modestas Valauskas
I failed to run Dart on the web (but FYI you can run Linux on the web)2025-12-21T14:00:00+01:002025-12-21T14:00:00+01:00https://modulovalue.com/blog/i-failed-to-run-dart-on-the-webI had what I thought was a fun idea: what if you could compile and run Dart code entirely in the browser? No server, no cloud VM, just pure client-side execution.

The plan was simple:

  1. Use v86, a JavaScript x86 emulator, to boot Linux in the browser
  2. Use Monaco Editor for a nice code editing experience
  3. Download the Dart SDK into the virtual Linux environment
  4. Compile and run Dart code

I got steps 1 and 2 working. Step 3 is where it all fell apart.

What I Built

Here's the working prototype, the actual demo running live right here in this blog post:

Go ahead, try it. On the left, you have a code editor. On the right, a Linux terminal. Write a script, click "Run Script", and watch it execute in a real Linux environment, all running in your browser.

The fascinating part? That's not a terminal emulator pretending to be Linux. That's actual Linux. The output shows a real root filesystem with /bin, /dev, /proc, and all the standard directories you'd expect. (Open in a new tab if you prefer a full-page view.)

The v86 Magic

v86 is an x86 emulator written in JavaScript and WebAssembly. It can boot real operating systems: Linux, Windows 98, FreeDOS, and others. The emulation is accurate enough to run unmodified operating system images.

If you've never seen this before, take a moment to appreciate how wild it is. Your browser is running a complete x86 CPU emulation, which is running a Linux kernel, which is running a shell, which is executing your commands. All of this happens entirely client-side.

Why It Failed

The Dart SDK requires glibc (the GNU C Library). Most tiny Linux distributions use musl libc instead because it's much smaller. The distributions that support glibc are significantly larger and take much longer to boot.

I tried several approaches:

Alpine Linux: Fast and small, but uses musl. Dart doesn't work.

Buildroot with glibc: I could configure it to use glibc, but the resulting image was too large and boot times were unacceptable for a web demo.

Debian/Ubuntu minimal: Way too large. Boot times measured in minutes, not seconds.

The fundamental problem is that the Dart SDK is designed for real machines with real resources. It expects glibc, it expects fast disk I/O, and it expects more memory than a browser-based x86 emulator can reasonably provide.

What I Learned

Even though the project didn't achieve its goal, I found the exploration worthwhile.

v86 is impressive. The fact that you can boot Linux in a browser is remarkable. For educational purposes, lightweight Linux tools, or just showing off what WebAssembly can do, it's a fantastic project.

The web platform keeps surprising me. Between WebAssembly, SharedArrayBuffer, and modern JavaScript APIs, browsers can do things that would have seemed impossible a decade ago.

Some tools just aren't meant for constrained environments. The Dart SDK is designed for development machines. Trying to squeeze it into a browser-emulated Linux with limited resources was always going to be a stretch.

A Challenge

I'm setting this aside for now, but maybe you can get it to work. If you manage to get Dart running in the browser via v86 (or any other way), I'd love to hear about it.

Addendum: It was brought to my attention by Norbert Kozsir that Mike Diarmid managed to compile the Dart VM to WASM: see his tweet. Maybe he'll actually share his approach with the world? The challenge remains: there's no open source way to do this.

Addendum 2: u/Journeyj012 pointed out that Linux also runs in a PDF. Few people know, but PDF supports JavaScript. Check out linuxpdf on GitHub.

Discuss on Reddit: r/linux

Links:

]]>
Modestas Valauskas
Concepts Reader: a backup plan for my life2025-12-20T14:00:00+01:002025-12-20T14:00:00+01:00https://modulovalue.com/blog/concepts-reader-a-backup-plan-for-my-lifeI manage my entire life in Concepts, the infinite canvas drawing app for iPad.

Open full app in new tab | Source code on GitHub

Not notes. Not lists. Graphs.

I draw what I call TODO graphs: dependency diagrams where tasks, ideas, and projects connect to each other visually. Every area of my life, work, personal projects, long-term goals, exists as interconnected nodes on an infinite canvas. This isn't a productivity hack. It's how my brain works.

Example of a TODO graph in Concepts

An example board. Center: current TODOs. Periphery: research papers I'm reading or implementing. The graphs have topmost nodes with no dependencies, those are tasks I can work on today.

The Concepts team has been stellar. Responsive support, thoughtful updates, a product that genuinely feels like it was made for people who think visually. I've been a happy user for years.

But recently, I've been worried.

The Problem with "Infinite"

My boards have grown. A lot. Some of them are now gigabytes in size. I've already had to buy a new iPad once because older devices couldn't handle the memory requirements. The app occasionally struggles with rendering performance on my largest canvases.

This isn't a complaint about Concepts. They're doing impressive work pushing the boundaries of what's possible on an iPad. But the reality is that my use case is extreme, and there's always a chance that future updates, iOS changes, or simply the accumulation of more data will eventually break things beyond repair.

And if that happens? My "life" is trapped in a proprietary format.

Well, it happened before.

One day my board simply wouldn't open on my iPad anymore. I felt lost. Luckily, you can AirDrop boards to your computer, and I discovered that .concept files are just ZIP archives. I tried compressing the embedded images to reduce the file size. That didn't work. The runtime apparently uses raw pixel data, so compression doesn't help. I had to manually replace images with smaller versions until the board would load again.

That was the first time I started to have trust issues.

Recently, performance problems appeared. Scrolling became sluggish. I couldn't reproduce the issues in other boards. Support asked me to share my board so they could investigate.

Share my board? Share my life with them? No.

Building a Safety Net

So I wrote a viewer. Just a proof of concept to answer one question: if Concepts stops working for me tomorrow, can I still access my data?

The answer is yes.

The app above is written in Flutter. Drop a .concept file onto it, and it renders all your strokes and images. Pan around, zoom in, see your work.

It's not Concepts. You can't edit anything. But you can see everything. And sometimes that's enough.

What's Inside a .concept File

Turns out, .concept files are just ZIP archives with a specific structure:

  • Strokes.plist contains all the stroke data in Apple's binary plist format
  • Resources.plist maps image references to actual files
  • Drawing.plist stores the document's transform matrix
  • ImportedImages/ holds embedded images

The stroke data itself is straightforward: arrays of points with pressure, color, and brush information. Parsing it required writing a binary plist parser, but the format is well-documented.

Why Flutter?

I wanted this to run everywhere. Web, desktop, mobile. Flutter made that trivial. The same codebase works on all platforms, and the web version means anyone can try it without installing anything.

Work in Progress

The viewer isn't feature complete yet. Some strokes render incorrectly. Concepts is a complex app that supports many use cases, and supporting them all is not straightforward.

If you find bugs, please report them on GitHub. A minimal repro file helps a lot.

The Lesson

This isn't about incompatibility or Concepts doing something wrong. Concepts was never designed to be a production-grade research tool or a graph visualization system. I've been surprised time and time again by how far it goes, how much it handles, how well it scales. The team has exceeded every reasonable expectation.

But the real question is: what happens when you reach the limits of your tool and there's no alternative that supports your use case?

That's why having a backup plan matters. Not because your tool will fail you, but because you might outgrow it. A read-only viewer won't replace what you've built your workflow around, but it means your data isn't trapped if you ever need to move on.

Addendum: PDF Support

I've added PDF rendering support to the viewer, and I'm impressed.

Flutter renders PDFs much better than Concepts does, without the stutters I experience in Concepts when navigating large boards with many PDF pages. This surprised me. Concepts claims to have a custom renderer, so I assumed they would be highly optimized for exactly these use cases. Apparently, you don't need a custom renderer if you're using a good framework like Flutter.

There are a couple of possible explanations. I'm rendering PDF pages at a lower resolution than Concepts does. Additionally, on a macOS desktop, I have all my RAM available, whereas iPads heavily limit RAM usage per app, even if the device has more RAM than is currently being used. There's an upper limit that each app can use, regardless of what's actually available.

Links:

]]>
Modestas Valauskas
Understanding Dart class modifiers by using lattices2025-12-18T10:00:00+01:002025-12-18T10:00:00+01:00https://modulovalue.com/blog/understanding-dart-class-modifiers-latticesDart 3.0 introduced class modifiers, and at first glance, the combinations can feel overwhelming. base, final, interface, mixin. How do they all fit together? What combinations are valid? Which ones are redundant?

It turns out there's an elegant way to understand the entire system: lattice theory.

The Four Capabilities

Every Dart type has some combination of four fundamental capabilities:

  1. Extendable: can be used with extends
  2. Implementable: can be used with implements
  3. Mixinable: can be used with with
  4. Constructable: can be instantiated directly

Each class modifier combination enables or restricts these capabilities. The lattice below shows how all valid combinations relate to each other.

The Class Modifiers Lattice

Dart Class Modifiers Lattice

Open full diagram in new tab ↗

Reading the Lattice

The lattice flows from bottom to top:

  • Bottom: Nothing (no capabilities)
  • Top: mixin class (all four capabilities)

Each arrow represents adding one capability. Follow arrows upward to see how adding capabilities transforms one modifier combination into another.

Color Coding

Node backgrounds:

  • Yellow: Existed before Dart 3.0
  • Green: New with class modifiers
  • Red: Impossible combinations

Arrow colors represent which capability is being added:

  • Orange: Mixinable
  • Teal: Extendable
  • Blue: Implementable
  • Brown: Constructable

Key Insights

1. mixin class is Maximum Capability

A plain mixin class has all four capabilities. It's the most permissive type you can declare. This is why it sits at the top of the lattice.

2. Some Combinations Are Impossible

Notice the red nodes. For example, final base mixin class is impossible because mixin classes must be extendable (that's how mixins work), but final prevents extension.

3. Pre-Dart 3.0 Types Were Limited

The yellow nodes show what existed before: class, abstract class, and mixin. The green nodes are all the new combinations that class modifiers enable.

Practical Examples

// All four capabilities
mixin class A {}

// Remove implementability, must extend, can't just implement
base mixin class B {}

// Remove extendability and mixinability, can only implement
interface class C {}

// Remove everything except constructability
final class D {}

Why Lattices?

Lattices aren't just a visualization trick. They reveal the algebraic structure of the type system. The fact that Dart's class modifiers form a clean lattice means the design is internally consistent. There are no arbitrary restrictions or special cases.

Understanding this structure helps you:

  • Remember which combinations are valid
  • Predict what capabilities a type has
  • Choose the right modifier for your use case

The next time you're unsure which class modifier to use, think about which capabilities you want to allow, and find the corresponding node in the lattice.

Conclusion

Complex systems with many interacting options are hard to reason about. When you have four independent boolean capabilities, you get 2^4 = 16 possible combinations. Trying to understand these combinations through documentation alone quickly becomes overwhelming.

Lattices provide a way to model these combinations visually and algebraically. Instead of memorizing rules, you can see the relationships. Each node is a valid state, each edge is a transition, and the structure itself encodes the constraints.

As a fun aside: this hierarchy is actually a 4-dimensional cube (a tesseract). Each of the four capabilities corresponds to one dimension, and moving along an edge means toggling that capability on or off. But since we can't intuitively grasp 4-dimensional geometric objects (at least I can't, though some claim they can), the lattice representation serves as a more accessible algebraic structure to aid intuition.

PS: To be complete when it comes to class modifiers, we would also have to discuss sealed classes. They don't fit into this system. In my view, they are a separate feature and should be discussed separately.

PPS: Here's the official documentation for class modifiers, the spec, and the officially maintained ANTLR grammar showing the syntax of modifiers.

Addendum: Robert Nystrom, the lead designer of this feature, pointed out that despite mixin class having the most capabilities, the Dart team doesn't think it should be used often. It is mostly there for backwards compatibility.

Discuss on Reddit: r/dartlang

]]>
Modestas Valauskas