Oskar Wickström

There and Back Again: From Quickstrom to Bombadil

2026-01-28T00:00:00+01:00

Today I’m announcing and open-sourcing the Bombadil project — a brand new property-based browser testing framework. I started working on this when joining Antithesis two months ago. While still in its infancy, we’ve decided to build it in the open and share our progress from the start.

We decided on the name Bombadil last week. A few days later, this exploded:

It's going to be tough for startups when all the Lord of the Rings names are taken and the only thing left is something like Bombadil AI.
— Patrick Collison (@patrickc) January 25, 2026

While Quickstrom proved its worth, finding bugs in more than half of the TodoMVC apps, Bombadil aims to improve on its shortcomings while also envisioning a more ambitious future for generative testing of web apps: faster and smarter state space exploration, a modern and usable specification language, better tools for reproducing and debugging test failures, and a better distribution story.

I consider Bombadil the successor of Quickstrom. After years of trying to sustain Quickstrom through various models, I now have a much better answer: building it at Antithesis, making something valuable that is open and free to use, while also strengthening the company’s commercial offering. Bombadil can be used locally or in CI to test your web apps early. Power users can take it further and run Bombadil within Antithesis and its deterministic hypervisor. That gives you perfect reproductions of failed tests. You can even combine Bombadil with other workloads in Antithesis and test your entire stack deterministically. Isn’t that the holy grail of testing?

Bombadil is built from scratch with a focus on accessibility: a new specification language and better tooling for writing and maintaining specs. Right now, I’m working on a specification DSL in TypeScript. It’s based on linear temporal logic, just as Quickstrom, but aims to be a lot more ergonomic. Here’s an example that verifies that error messages eventually disappear:

const errorMessage = extract(  (state) =>  state.document.body.querySelector(".error")  ?.textContent   ?? null );  export const errorEventuallyDisappears = always(  condition(() => errorMessage !== null).implies(  eventually(() => errorMessage === null)  .within(5, "seconds")  ) );

Both the original hacky PureScript DSL and the bespoke language Specstrom were huge obstacles to adoption. Today, TypeScript is a widely adopted language among quality-minded web developers, and if you’re not into that, Bombadil will work with plain JavaScript too.

We’re writing Bombadil in Rust, leveraging its excellent ecosystem to ship a single statically-linked executable — download for your platform, point it at a Chromium-based browser, and off you go!

Bombadil is early but usable. Check it out on GitHub, try it on your projects, and let us know what breaks. Also let us know what it breaks in your systems! Join us on Discord for help and discussion, or follow development on Twitter/X. This is just the beginning — we’re actively seeking feedback and early adopters to help shape where this goes.

Computer Says No: Error Reporting for LTL

2025-11-01T00:00:00+01:00

Quickstrom is a property-based testing tool for web applications, using QuickLTL for specifying the intended behavior. QuickLTL is a linear temporal logic (LTL) over finite traces, especially suited for testing. As with many other logic systems, when a formula evaluates to false — like when a counterexample to a safety property is found or a liveness property cannot be shown to hold — the computer says no. That is, you get “false” or “test failed”, perhaps along with a trace. Understanding complex bugs in stateful systems then comes down to staring at the specification alongside the trace, hoping you can somehow pin down what went wrong. It’s not great.

Instead, we should have helpful error messages explaining why a property does not hold; which parts of the specification failed and which concrete values from the trace were involved. Not false, unsat, or even assertion error: x != y. We should get the full story. I started exploring this space a few years ago when I worked actively on Quickstrom, but for some reason it went on the shelf half-finished. Time to tie up the loose ends!

The starting point was Picostrom, a minimal Haskell version of the checker in Quickstrom, and Error Reporting Logic (ERL), a paper introducing a way of rendering natural-language messages to explain propositional logic counterexamples. I ported it to Rust mostly to see what it turned into, and extended it with error reporting supporting temporal operators. The code is available at codeberg.org/owi/picostrom-rs under the MIT license.

Between the start of my work and picking it back up now, A Language for Explaining Counterexamples was published, which looks closely related, although it’s focused on model checking with CTL. If you’re interested in other related work, check out A Systematic Literature Review on Counterexample Explanation in Model Checking.

All right, let’s dive in!

QuickLTL and Picostrom

A quick recap on QuickLTL is in order before we go into the Picostrom code. QuickLTL operates on finite traces, making it suitable for testing. It’s a four-valued logic, meaning that a formula evaluates to one of these values:

definitely true
definitely false
probably true
probably false

It extends propositional logic with temporal operators, much like LTL:

next_d(P): P must hold in the next state, demanding a next state is available. This forces the evaluator to draw a next state.
next_f(P): P must hold in the next state, defaulting to definitely false if no next state is available.
next_t(P): P must hold in the next state, defaulting to probably true if no next state is available.
eventually_N(P): P must hold in the current or a future state. It demands at least N states, evaluating on all available states, finally defaulting to probably false.
always_N(P): P must hold in the current and all future states. It demands at least N states, evaluating on all available states, finally defaulting to probably true.

You can think of eventually_N(P) as unfolding into a sequence of N nested next_d, wrapping an infinite sequence of next_f, all connected by ∨. Let’s define that inductively with a coinductive base case:

$$ \begin{align} \text{eventually}_0(P) & = P \lor \text{next}_F(\text{eventually}_0(P)) \\ \text{eventually}_(N + 1)(P) & = P \lor \text{next}_D(\text{eventually}_N(P)) \\ \end{align} $$

And similarly, always_N(P) can be defined as:

$$ \begin{align} \text{always}_0(P) & = P \land \text{next}_T(\text{always}_0(P)) \\ \text{always}_(N + 1)(P) & = P \land \text{next}_D(\text{always}_N(P)) \\ \end{align} $$

This is essentially how the evaluator expands these temporal operators, but for error reporting reasons, not exactly.

Finally, there are atoms, which are domain-specific expressions embedded in the AST, evaluating to ⊤ or ⊥. The AST is parameterized on the atom type, so you can plug in an atom language of choice. An atom type must implement the Atom trait, which in simplified form looks like this:

trait Atom {  type State;   fn eval(&self, state: &Self::State) -> bool;   fn render(  &self,   mode: TextMode,   negated: bool,  ) -> String;   fn render_actual(  &self,   negated: bool,   state: &Self::State,  ) -> String; }

For testing the checker, and for this blog post, I’m using the following atom type:

enum TestAtom {  Literal(u64),  Select(Identifier),  Equals(Box<TestAtom>, Box<TestAtom>),  LessThan(Box<TestAtom>, Box<TestAtom>),  GreaterThan(Box<TestAtom>, Box<TestAtom>), }  enum Identifier {  A,  B,  C, }

Evaluation

The first step, like in ERL, is transforming the formula into negation normal form (NNF), which means pushing down all negations into the atoms:

enum Formula<Atom> {  Atomic {  negated: bool,  atom: Atom,  },  // There's no `Not` variant here!  ... }

This makes it much easier to construct readable sentences, in addition to another important upside which I’ll get to in a second. The NNF representation is the one used by the evaluator internally.

Next, the eval function takes an Atom::State and a Formula, and produces a Value:

enum Value<'a, A: Atom> {  True,  False { problem: Problem<'a, A> },  Residual(Residual<'a, A>), }

A value is either immediately true or false, meaning that we don’t need to evaluate on additional states, or a residual, which describes how to continue evaluating a formula when given a next state. Also note how the False variant holds a Problem, which is what we’d report as definitely false. The True variant doesn’t need to hold any such information, because due to NNF, it can’t be negated and “turned into a problem.”

I won’t cover every variant of the Residual type, but let’s take one example:

 enum Residual<'a, A: Atom> {  // ...  AndAlways {  start: Numbered<&'a A::State>,  left: Box<Residual<'a, A>>,  right: Box<Residual<'a, A>>,  },  // ... }

When such a value is returned, the evaluator checks if it’s possible to stop at this point, i.e. if there are no demanding operators in the residual. If not possible, it draws a new state and calls step on the residual. The step function is analogous to eval, also returning a Value, but it operates on a Residual rather than a Formula.

The AndAlways variant describes an ongoing evaluation of the always operator, where the left and right residuals are the operands of ∧ in the inductive definition I described earlier. The start field holds the starting state, which is used when rendering error messages. Similarly, the Residual enum has variants for ∨, ∧, ⟹, next, eventually, and a few others.

When the stop function deems it possible to stop evaluating, we get back a value of this type:

enum Stop<'a, A: Atom> {  True,  False(Problem<'a, A>), }

Those variants correspond to probably true and probably false. In the false case, we get a Problem which we can render. Recall how the Value type returned by eval and step also had True and False variants? Those are the definite cases.

Rendering Problems

The Problem type is a tree structure, mirroring the structure of the evaluated formula, but only containing the parts of it that contributed to its falsity.

enum Problem<'a, A: Atom> {  And {  left: Box<Problem<'a, A>>,  right: Box<Problem<'a, A>>,  },  Or {  left: Box<Problem<'a, A>>,  right: Box<Problem<'a, A>>,  },  Always {  state: Numbered<&'a A::State>,  problem: Box<Problem<'a, A>>,  },  Eventually {  state: Numbered<&'a A::State>,  formula: Box<Formula<A>>,  },  // A bunch of others... }

I’ve written a simple renderer that walks the Problem tree, constructing English error messages. When hitting the atoms, it uses the render and render_actual methods from the Atom trait I showed you before.

The mode is very much like in the ERL paper, i.e. whether it should be rendered in deontic (e.g. “x should equal 4”) or indicative (e.g. “x equals 4”) form:

enum TextMode {  Deontic,  Indicative, }

The render method should render the atom according to the mode, and render_actual should render relevant parts of the atom in a given state, like its variable assignments.

With all these pieces in place, we can finally render some error messages! Let’s say we have this formula:

eventually₁₀(B = 3 ∧ C = 4)

If we run a test and never see such a state, the rendered error would be:

Probably false: eventually B must equal 3 and C must equal 4, but it was not observed starting at state 0

Neat! This is the kind of error reporting I want for my stateful tests.

Implication

You can trace why some subformula is relevant by using implication. A common pattern in state machine specs and other safety properties is:

precondition ⟹ before ∧ next_t(after)

So, let’s say we have this formula:

always_N((A > 0) ⟹ (B > 5 ∧ next_t(C < 10)))

If B or C are false, the error includes the antecedent:

Definitely false: B must be greater than 5 and in the next state, C must be less than 10 since A is greater than 0, […]

Small Errors, Short Tests

Let’s consider a conjunction of two invariants. We could of course combine the two atomic propositions with conjunction inside a single always(...), but in this case we have the formula:

always(A < 3) ∧ always(B < C)

An error message, where both invariants fail, might look the following:

Definitely false: it must always be the case that A is less than 3 and it must always be the case that B is greater than C, but A=3 in state 3 and B=0 in state 3

If only the second invariant (B < C) fails, we get a smaller error:

Definitely false: it must always be the case that B is greater than C, but B=0 and C=0 in state 0

And, crucially, if one of the invariants fail before the other we also get a smaller error, ignoring the other invariant. While single-state conjunctions evaluate both sides, possibly creating composite errors, conjunctions over time short-circuit in order to stop tests as soon as possible.

Diagrams

Let’s say we have a failing safety property like the following:

next_d(always₈(B < C))

The textual error might be:

Definitely false: in the next state, it must always be the case that B is greater than C, but B=13 and C=15 in state 6

But with some tweaks we could also draw a diagram, using the Problem tree and the collected states:

Or for a liveness property like next_d(eventually₈(B = C)), where there is no counterexample at a particular state, we could draw a diagram showing how we give up after some time:

These are only sketches, but I think they show how the Problem data structure can be used in many interesting ways. What other visualizations would be possible? An interactive state space explorer could show how problems evolve as you navigate across time. You could generate spreadsheets or HTML documents, or maybe even annotate the relevant source code of some system-under-test? I think it depends a lot on the domain this is applied to.

No Loose Ends

It’s been great to finally finish this work! I’ve had a lot of fun working through the various head-scratchers in the evaluator, getting strange combinations of temporal operators to render readable error messages. I also enjoyed drawing the diagrams, and almost nerd-sniped myself into automating that. Maybe another day. I hope this is interesting or even useful to someone out there. LTL is really cool and should be used more!

The code, including many rendering tests cases, is available at codeberg.org/owi/picostrom-rs.

A special thanks goes to Divyanshu Ranjan for reviewing a draft of this post.

Programming in the Sun: A Year with the Daylight Computer

2025-10-10T00:00:00+02:00

I’ve been hinting on X/Twitter about my use of the Daylight DC-1 as a programming environment, and after about a year of use, it’s time to write about it in longer form. This isn’t a full product review, but rather an experience report on coding in sunlight. It’s also about the Boox Tab Ultra – which has a different type of display – and how it compares to the DC-1 for my use cases.

This is not a sponsored post.

Why do I even bother, you might ask? Sunlight makes me energetic and alert, which I need when I work. Living in the Nordics, 50% of the year is primarily dark, so any direct daylight I can get becomes really important. I usually run light mode on my Framework laptop during the day, but working in actual daylight with these displays, or plain old paper, is even better.

The Setup

Here are the main components of this coding environment:

Daylight DC-1: an Android-based tablet with a “Live Paper” display (Reflective LCD, not E-Ink)
8BitDo Retro Mechanical Keyboard: a mechanical Bluetooth-enabled keyboard, with Kailh key switches and USB-C charging and optional connection
Termux: a terminal emulator for Android, with a package collection based on apt
SSH, tmux, and Neovim: nothing surprising here

I use a slimmed-down version of my regular dotfiles, because this setup doesn’t use Nix. I’ve manually installed Neovim, tmux, and a few other essentials, using the package manager that comes with Termux. I’ve configured Termux to not show its virtual keyboard when a physical keyboard is connected (the Bluetooth keyboard). The Termux theme is “E-Ink” and the font is JetBrains Mono, all built into Termux. Neovim uses the built-in quiet colorscheme for maximum contrast.

Certain work requires a more capable environment, and in those cases I connect to my workstation using SSH and run tmux in there. For writing or simpler programming projects (I’ve even done Rust work with Cargo, for instance), the local Termux environment is fine.

Sometimes I want to go really minimalist, so I hide the tmux status bar and run Goyo in Neovim. Deep breaths. Feel the fresh air in your lungs. This is especially nice for writing blog posts like this one.

My blog editing works locally in Termux, with a live reloading Chrome in a split window, here during an evening writing session with the warm backlight enabled:

There’s the occasional Bluetooth connection problem with the 8BitDo keyboard. I also don’t love the layout, and I’m considering getting the Kinesis Freestyle2 Blue instead. I already have the wired version for my workstation, and the ergonomics are great.

Daylight DC-1 vs Boox Tab Ultra

What about the Boox? I’ve had this device for longer and I really like it too, but not for the same tasks. The E-Ink display is, quite frankly, a lot nicer to read on; EPUB books, research PDFs, web articles, etc. The 227 PPI instead of the Daylight’s 190 PPI makes a difference, and I like the look of E-Ink better overall.

However, the refresh rate and ghosting make it a bit frustrating for typing. Same goes for drawing, which I’ve used the Daylight for a lot. Most of my home renovation blueprints are sketched on the Daylight. The refresh rate makes it possible.

When reading at night with a more direct bedside lamp, often in combination with a subtle backlight, the Boox is much better. The Daylight screen can glare quite a bit, so the only option is backlight only. And at that point, a lot of the paperlike quality goes away.

You can also get some glare when there’s direct sunlight at a particular angle:

Even if I don’t write or program directly on the Boox, I’ve experimented with using it as a secondary display, like for the live reload blog preview:

To sum up, these devices are good for different things, in my experience. I’ve probably spent more time on the Boox, because I’ve had it for longer and I’ve read a lot on it, but the Daylight has been much better for typing and drawing.

Another thing I’d like to try is a larger E-Ink monitor for my workstation, like the one Zack is hacking on. I’m hoping this technology continues to improve on refresh rate, because I love E-Ink. Until then, the Daylight is a good compromise.

Finding Bugs in a Coding Agent with Lightweight DST

2025-08-28T00:00:00+02:00

Amp is a coding agent which I’ve been working on the last six months at Sourcegraph. And in the last couple of weeks, I’ve been building a testing rig inspired by Deterministic Simulation Testing (DST) to test the most crucial parts of the system. DST is closely related to fuzzing and property-based testing.

The goal is to get one of Amp’s most central pieces, the ThreadWorker, under heavy scrutiny. We’ve had a few perplexing bug reports, where users experienced corrupted threads, LLM API errors from invalid tool calls, and more vague issues like “it seems like it’s spinning forever.” Reproducing such problems manually is usually somewhere between impractical and impossible. I want to reproduce them deterministically, and in a way where we can debug and fix them. And beyond the known ones, I’d like to find the currently unknown ones before our users hit them.

Generative testing to the rescue!

Approach: Lightweight DST in TypeScript

Amp is written in TypeScript, which is an ecosystem currently not drowning in fuzzing tools. My starting point was using jsfuzz, which I hadn’t used before but it looked promising. However, I had a bunch of problems getting it to run together with our Bun stack. One could use fast-check, but as far as I can tell, the model-based testing they support doesn’t fit with our needs. We don’t have a model of the system, and we need to generate values in multiple places as the test runs. So, I decided to build something from scratch for our purposes.

I borrowed an idea I got from matklad last year: instead of passing a seeded PRNG to generate test input, we generate an entropy Buffer with random contents, and track our position in that array with a cursor. Drawing a random byte consumes the byte at the current position and increments the cursor. We don’t know up-front how many bytes we need for a given fuzzer, so the entropy buffer grows dynamically when needed, appending more random bytes. This, together with a bunch of methods for drawing different types of values, is packaged up in an Entropy class:

class Entropy {  random(count): UInt8Array { ... }  randomRange(minIncl: number, maxExcl: number): number { ... }  // ... lots of other stuff }

A fuzzer is an ES module written in TypeScript, exporting a single function:

export async function fuzz(entropy: Entropy) {  // test logic here }

Any exception thrown by fuzz is considered a test failure. We use the node:assert module for our test assertions, but it could be anything.

Another program, the fuzz runner, imports a built fuzzer module and runs as many tests it can before a given timeout. If it finds a failure, it prints out the command to reproduce that failure:

Fuzzing example.fuzzer.js iteration 1000... Fuzzing example.fuzzer.js iteration 2000... Fuzzer failed: AssertionError [ERR_ASSERTION]: 3 != 4 at [...] Reproduce with: bun --console-depth=10 scripts/fuzz.ts \ dist/example.fuzzer.js \ --verbose \ --reproduce=1493a513f88d0fd9325534c33f774831

Why use this Entropy rather than a seed? More about that at the end of the post!

The ThreadWorker Fuzzer

In the fuzzer for our ThreadWorker, we stub out all IO and other nondeterministic components, and we install fake timers to control when and how asynchronous code is run. In effect, we have determinism and simulation to run tests in, so I guess it qualifies as DST.

The test simulates a sequence of user actions (send message, cancel, resume, and wait). Similarly, it simulates responses from tool calls (like the agent reading a file) and from inference backends (like the Anthropic API). We inject faults and delays in both tool calls and inference requests to test our error handling and possible race conditions.

After all user actions have been executed, we make sure to approve any pending tool calls that require confirmation. Next, we tell the fake timer to run all outstanding timers until the queue is empty; like fast-forwarding until there’s nothing left to do. Finally, we check that the thread is idle, i.e. that there’s no ongoing inference and that all tool calls have terminated. This is a liveness property.

After the liveness property, we check a bunch of safety properties:

all messages posted by the user are present in the thread
all message pairs involving tools calls are valid according to Anthropic’s API specification
all tool calls have settled in expected terminal states

Some of these are targeted at specific known bugs, while some are more general but have found bugs we did not expect.

Here’s a highly simplified version of the fuzzer:

export async function fuzz(entropy: Entropy) {  const clock = sinon.useFakeTimers({  loopLimit: 1_000_000,  })  const worker = setup() // including stubbing IO, etc   try {  const resumed = worker.resume()  await clock.runAllAsync()  await resumed   async function run() {  for (let round = 0; round < entropy.randomRange(1, 50); round++) {  const action = await generateNextAction(entropy, worker)  switch (action.type) {  case 'user-message':  await worker.handle({  ...action,  type: 'user:message',  })  break  case 'cancel':  await worker.cancel()  break  case 'resume':  await worker.resume()  break  case 'sleep':  await sleep(action.milliseconds)  break  case 'approve': {  await approveTool(action.threadID, action.toolUseID)  break  }  }  }   // Approve any remaining tool uses to ensure termination into an   // idle thread state  const blockedTools = await blockedToolUses()  await Promise.all(blockedTools.map(approve))  }   const done = run()  await clock.runAllAsync()  await done   // check liveness and safety properties  // ...  } finally {  sinon.restore()  } }

Now, let’s dig into the findings!

Results

Given I’ve been working on this for about a week in total, I’m very happy with the outcome. Here are some issues the fuzzer found:

Corrupted thread due to eagerly starting tool calls during streaming

While streaming tool use blocks from the Anthropic API, we invoked tools eagerly, while not all of them were finished streaming. This, in combination with how state was managed, led to tool results being incorrectly split across messages. Anthropic’s API would reject any further requests, and the thread would essentially be corrupted. This was reported by a user and was the first issue we found and fixed using the fuzzer.

Another variation, which the fuzzer also found, this was a race condition where user messages interfered at a particular timing with ongoing tool calls, splitting them up incorrectly.

Subagent tool calls not terminating when subthread tool calls were rejected

Due to a recent change in behavior, where we don’t run inference automatically after tool call rejection, subagents could end up never signalling their termination, which led to the main thread never reaching an idle state.

I confirmed this in both VSCode and the CLI: infinite spinners, indeed.

Tool calls blocked on user not getting cancelled after user message

Due to how some tool calls require confirmation, like reading files outside the workspace or running some shell commands, in combination how we represent and track termination of tools, there’s a possibility for such tools to be resumed and then, after an immediate user cancellation, not be properly cancelled. This leads to incorrect mutations of the thread data.

I’ve not yet found the cause of this issue, but it’s perfectly reproducible, so that’s a start.

Furthermore, we were able to verify an older bug fix, where Anthropic’s API would send an invalid message with an empty tool use block array. That used to get the agent into an infinite loop. With the fuzzer, we verified and improved the old fix which had missed another case.

How about number of test runs and timeouts? Most of these bugs were found almost immediately, i.e. within a second. The last one in the list above takes longer, around a minute normally. We run a short version of each fuzzer in every CI build, and longer runs on a nightly basis. This is up for a lot of tuning and experimentation.

Why the Entropy Buffer?

So why the entropy buffer instead of a seeded PRNG? The idea is to use that buffer to mutate the test input, instead of just bombarding with random data every time. If we can track which parts of the entropy was used where, we can make those slices “smaller” or “bigger.” We can use something like gradient descent or simulated annealing to optimize inputs, maximizing some objective function set by the fuzzer. Finally, we might be able to minimize inputs by manipulating the entropy.

In case the JavaScript community gets some powerful fuzzing framework like AFL+, that could also just be plugged in. Who knows, but I find this an interesting approach that’s worth exploring. I believe the entropy buffer approach is also similar to how Hypothesis works under the hood. Someone please correct me if that’s not the case.

Anyhow, that’s today’s report from the generative testing mines. Cheers!

Machine: Learning; Human: Unlearning;

2025-02-11T00:00:00+01:00

This last month has been fascinating. I guess LLMs have finally resonated with me on a deeper level. It wasn’t like I woke up and suddenly everything was different, but their impact is growing on me non-linearly, forcing me to rewire my brain.

I know there are probably tons of blog posts by the newly converted. I’m not trying to offer any grand insights, I’m just documenting my process and current ideas.

Gradually, Then Suddenly

I’ve been a typical sceptic of Copilot and similar tools. Sure, it’s nice to generate boilerplate and throw-away scripts, but that’s a minor part of what we do all day, right? I even took a break from using them for many months, and I’ve had serious qualms with their use in some areas outside coding.

After messing around with copilot.lua in Neovim, I tried Cursor. Their vision and what they’ve already built opened my eyes, especially the shadow workspaces, Tab, and the rule files. At the same time, a critical mass of friends and peers were building new products on top of these models; things I highly respect and can see massive value in.

Since then I’ve actively been looking for how I can use LLMs — beyond the auto-complete and chat interaction modes, beyond making me a slightly more productive developer. Don’t get me wrong, I love being alone coding for hours in a cozy room. It’s great. But I’m also curious to see how far I can push this myself, and of course how far and where the industry goes.

Taking on New Projects

One project that I’ve been working on goes under the working name site2doc. It’s a tool that converts entire web sites into EPUB and PDF books. Mostly because I want it for myself. I’d like to read online material, typeset minimalistically and beautifully, offline on my ebook reader. It turned out others want that too. There are great tools for converting single pages, but not for entire sites.

My main problem is that the web is highly unstructured and diverse. To be frank, a lot of sites have really bad markup. No titles at all, identical titles across pages,

elements used inconsistently. The list goes on. This makes it very difficult for site2doc to generate a useful table of contents.

A friend suggested using LLMs to extract the information, and I experimented with using screenshots as input to classify pages. Both Claude 3.5 Sonnet and Gemini 2.0 Flash performed well, but I haven’t been able to generalize the approach across many sites. There’s just too much variability in how websites are structured, and I’m not sure how to handle it. I’m open to suggestions!

The other project, temporarily named converge, is a bit closer to what everyone else is doing: using LLMs for programming. It’s an agent that, given some file or module, covers it with a generated test suite, and then goes on to optimize it. The key idea is that the original code is the source of truth. The particular optimization goal could be performance, resource usage, readability, safety, or robustness. So far I’ve focused only on performance, partly because evaluation is straightforward.

Going beyond example-based test suites, I’ve been thinking about how property-based testing (PBT) might fit in. The obvious approach is to have the LLM generate property tests rather than examples. I don’t know how well this would work, if the LLM can generate meaningful properties.

A more interesting way is to generate an oracle property that compares the behavior of new code generated by the LLM to the original code: ∀i.old(i) = new(i), where i is some generated input. This provides a rigorous way to verify that optimizations preserve the original behavior. I’m curious to see how PBT’s shrinking could guide the LLM to iteratively fix the generated code.

Another random idea: have the LLM explain existing code in natural language, then generate new code based only on the description. Run the old and new code side-by-side, and see how they differ, functionally and non-functionally.

I’ve only run converge on toy examples and snippets so far. I’m sure there are major challenges in applying it to larger code bases. Here’s what it currently does to Achilles numbers in Rosetta Code:

» converge -input AchillesNumbers.java time=2025-02-11T09:35:44.877+01:00 level=INFO msg="test suite created" time=2025-02-11T09:35:45.271+01:00 level=WARN msg="tests failed" attempt=1 time=2025-02-11T09:35:53.308+01:00 level=INFO msg="test suite modified" time=2025-02-11T09:35:53.709+01:00 level=WARN msg="tests failed" attempt=2 time=2025-02-11T09:36:02.457+01:00 level=INFO msg="test suite modified" time=2025-02-11T09:36:02.885+01:00 level=INFO msg="tests passed" time=2025-02-11T09:36:09.508+01:00 level=INFO msg="increasing benchmark iterations" duration=0.038339998573064804 attempt=0 time=2025-02-11T09:36:15.056+01:00 level=INFO msg="increasing benchmark iterations" duration=0.2295980006456375 attempt=1 time=2025-02-11T09:36:21.225+01:00 level=INFO msg="increasing benchmark iterations" duration=0.5998150110244751 attempt=2 time=2025-02-11T09:36:28.010+01:00 level=INFO msg="benchmark run" time=2025-02-11T09:36:35.737+01:00 level=INFO msg="code optimized" attempt=0 time=2025-02-11T09:38:19.094+01:00 level=INFO msg="tests failed" ... time=2025-02-11T09:38:33.798+01:00 level=INFO msg="code optimized" attempt=9 optimization succeeded: 1.870953s -> 1.337585s (-28.51%)

That took about three minutes. It does all sorts of tricks to make it faster, but one that caught my eye was the conversion from HashMap to HashSet, and finally to BitSet. Here’s part of the diff:

...  public class AchillesNumbers { - - private Map pps = new HashMap<>(); + private final BitSet pps = new BitSet(); + private final BitSet achillesCache = new BitSet(); + private static final byte[] SMALL_PRIMES = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47}; + private static final int[] POWERS_OF_TEN = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000}; + private static final short[] SQUARES = new short[317]; + private static final int[] CUBES = new int[47]; + private static final short[] TOTIENTS = new short[1000];   ...  }

Unlearning

I’m surprised to find myself as excited as I am now. I did not see it coming! Just during the last few days, I’ve realized how much I need to unlearn in order to make better use of what these models have learned.

I was implementing control flow, using Claude to generate various bits of code for the converge tool. Then I realized that, hey, maybe it should be the other way around? Claude plans the control flow, and my tool just provides the ways of interacting with the environment (modifying source files, running tests, etc). It’s not a revelation, but an example of how one might need to think differently.

Even more down to earth, things like saying “do X for me” rather than asking “how do I do X?”. Instead of asking for some chunk of code, I tell it to solve a problem for me. Of course, I still review the changes. Cursor and Cody have both been great ways of changing my thinking.

What other habits and thought patterns might need to change? I don’t know how programming will look in the future, but I’m actively working on keeping an open mind and hopefully playing a small role in shaping it.

Comment on Hacker News or X.

How I Built "The Monospace Web"

2024-09-26T00:00:00+02:00

Recently, I published The Monospace Web, a minimalist design exploration. It all started with this innocent post, yearning for a simpler web. Perhaps too typewriter-nostalgic, but it was an interesting starting point. After some hacking and sharing early screenshots, @noteed asked for grid alignment, and down the rabbit hole I went.

This morphed into a technical challenge, while still having that creative aspect that I started with. Subsequent screenshots with the fixed grid and responsive tables sparked a lot of interest. About a week later I published the source, and since then there’s been a lot of forks. People use it for their personal web sites, mostly, but also for apps.

I’d like to explain how it works. Not everything, just focusing on the most interesting parts.

The Fixed Grid

This design aligns everything, horizontally and vertically, to a fixed grid. Like a table with equal-size cells. Every text character should be exactly contained in a cell in that grid. Borders and other visual elements may span cells more freely.

The big idea here is to use the ch unit in CSS. I actually did not know about it before this experiment. The ch unit is described in CSS Values and Units Module Level 4:

Represents the typical advance measure of European alphanumeric characters, and measured as the used advance measure of the “0” (ZERO, U+0030) glyph in the font used to render it.

Further, it notes:

This measurement is an approximation (and in monospace fonts, an exact measure) of a single narrow glyph’s advance measure, thus allowing measurements based on an expected glyph count.

Fantastic! A cell is thus 1ch wide. And the cell height is equal to the line-height. In order to refer to the line height in calculations, it’s extracted as a CSS variable:

:root {  --line-height: 1.2rem; }

So far there’s no actual grid on the page. This is just a way of measuring and sizing elements based on the width and height of monospace characters. Every element must take up space evenly divisible by the dimensions of a cell; if it’s top left corner starts at a cell, then its bottom right must do so as well. That means that all elements, given that their individual sizes respect the grid, line up properly.

The Font

I’ve chosen JetBrains Mono for the font. It looks great, sure, but there’s a more important reason for this particular choice: it has good support for box-drawing characters at greater line heights. Most monospace fonts I tried broke at line heights above 110% or so. Lines and blocks were chopped up vertically. With JetBrains Mono, I can set it to 120% before it starts to become choppy.

I suspect Pragmata Pro or Berkeley Mono might work in this regard, but I haven’t tried them yet.

Also, if you want to use this design and with a greater line height, you can probably do so if you don’t need box-drawing characters. Then you may also consider pretty much any monospace font. Why not use web-safe ones, trimming down the page weight by about 600kB!

To avoid alternate glyphs for numbers, keeping everything aligned to the grid, I set:

:root {  font-variant-numeric: tabular-nums lining-nums; }

And, for a unified thickness of borders, text, and underlines:

:root {  --border-thickness: 2px;  font-weight: 500;  text-decoration-thickness: var(--border-thickness); }

This gives the design that thick and sturdy feel.

The Body

The body element is the main container in the page. It is at most 80 characters wide. (Huh, I wonder where that number came from?)

Now for one of the key tricks! I wanted this design to be reasonably responsive. For a viewport width smaller than 84ch (80ch and 4ch of padding), the body width needs to be some smaller width that is still evenly divisible by the cell dimensions. This can be accomplished with CSS rounding:

body {  padding: var(--line-height) 2ch;  max-width: calc(min(80ch, round(down, 100%, 1ch))); }

This way, the body shrinks in steps according to the grid.

The Horizontal Rule

Surprisingly, the custom horizontal rule styling was a fiddly enterprise. I wanted it to feel heavy, with double lines. The lines are vertically centered around the break between two cells:

To respect the grid, the top and bottom spacing needs to be calculated. But padding won’t work, and margin interacts with adjacent elements’ margins, so this required two elements:

hr {  position: relative;  display: block;  height: var(--line-height);  margin: calc(var(--line-height) * 1.5) 0;  border: none;  color: var(--text-color); }  hr:after {  display: block;  content: "";  position: absolute;  top: calc(var(--line-height) / 2 - var(--border-thickness));  left: 0;  width: 100%;  border-top: calc(var(--border-thickness) * 3) double var(--text-color);  height: 0; }

The hr itself is just a container that takes up space; 4 lines in total. The hr:after pseudo-element draws the two lines, using border-top-style, at the vertical center of the hr.

The Table

Table styling was probably the trickiest. Recalling the principles from the beginning, every character must be perfectly aligned with the grid. But I wanted vertical padding of table cells to be half the line height. A full line height worth of padding is way too airy.

This requires the table being horizontally offset by half the width of a character, and vertically offset by half the line height.

table {  position: relative;  top: calc(var(--line-height) / 2);  width: calc(round(down, 100%, 1ch));  border-collapse: collapse; }

Cell padding is calculated based on cell size and borders, to keep grid alignment:

th, td {  border: var(--border-thickness) solid var(--text-color);  padding:   calc((var(--line-height) / 2))  calc(1ch - var(--border-thickness) / 2)  calc((var(--line-height) / 2) - (var(--border-thickness)))  ;  line-height: var(--line-height); }

Finally, the first row must have slightly less vertical padding to compensate for the top border. This is hacky, and would be nicer to solve with some kind of negative margin on the table. But then I’d be back in margin interaction land, and I don’t like it there.

table tbody tr:first-child > * {  padding-top: calc(  (var(--line-height) / 2) - var(--border-thickness)  ); }

Another quirk is that columns need to have set sizes. All but one should use the width-min class, and the remaining should use width-auto. Otherwise, cells divide the available width in a way that doesn’t align with the grid.

The Layout Grid

I also included a grid class for showcasing how a grid layout helper could work. Much like the 12-column systems found in CSS frameworks, but funkier. To use it, you simply slap on a grid class on a container. It uses a glorious hack to count the children in pure CSS:

.grid > * {  flex: 0 0 calc(  round(  down,   (100% - (1ch build-feed.sh build-index.sh flake.lock flake.nix generate-redirects.sh LICENSE Makefile README.md src target watch.sh (var(--grid-cells) - 1))) / var(--grid-cells),  1ch  )  ); } .grid:has(> :last-child:nth-child(1)) { --grid-cells: 1; } .grid:has(> :last-child:nth-child(2)) { --grid-cells: 2; } .grid:has(> :last-child:nth-child(3)) { --grid-cells: 3; } .grid:has(> :last-child:nth-child(4)) { --grid-cells: 4; } .grid:has(> :last-child:nth-child(5)) { --grid-cells: 5; } .grid:has(> :last-child:nth-child(6)) { --grid-cells: 6; } .grid:has(> :last-child:nth-child(7)) { --grid-cells: 7; } .grid:has(> :last-child:nth-child(8)) { --grid-cells: 8; } .grid:has(> :last-child:nth-child(9)) { --grid-cells: 9; }

Look at it go!

Unlike with tables, the layout grid rows don’t have to fill the width. Depending on your viewport width, you’ll see a ragged right margin. However, by setting flex-grow: 1; on one of the children, that one grows to fill up the remaining width.

The Media Elements

Images and video grow to fill the width. But they have their own proportions, making vertical alignment a problem. How many multiples of the line height should the height of the media be? I couldn’t figure out a way to calculate this with CSS alone.

One option was a preprocessor step that would calculate and set the ratio of every such element as a CSS variable, and then have CSS calculate a padding-bottom based on the ratio and the width:

<img style="--ratio: 0.821377" ... >

However, I eventually settled for small chunk of JavaScript to calculate the difference, and set an appropriate padding-bottom. Ending up in JavaScript was inevitable, I suppose.

Summary

There are many small things I haven’t shown in detail, including:

the debug grid overlay, which you see in the screenshots
ordered list numbering
the tree-rendered list
various form controls and buttons
the custom details element

But I think I’ve covered the most significant bits. For a full tour, have a look at the source code.

There are still bugs, like alignment not working the same in all browsers, and not working at all in some cases. I’m not sure why yet, and I might try to fix it in at least Firefox and Chromium. Those are the ones I can test with easily.

This has been a fun project, and I’ve learned a bunch of new CSS tricks. Also, the amount of positive feedback has been overwhelming. Of course, there’s been some negative feedback as well, not to be dismissed. I do agree with the concern about legibility. Monospace fonts can be beautiful and very useful, but they’re probably not the best for prose or otherwise long bodies of text.

Some have asked for reusable themes or some form of published artifact. I won’t spend time maintaining such packages, but I encourage those who would. Fork it, tweak it, build new cool things with it!

Finally, I’ll note that I’m happy with how the overall feel of this design turned out, even setting aside the monospace aspect. Maybe it would work with a proportional, or perhaps semi-proportional, font as well?

A Flexible Minimalist Neovim for 2024

2024-08-12T00:00:00+02:00

In the eternal search of a better text editor, I’ve recently gone back to Neovim. I’ve taken the time to configure it myself, with as few plugins and other cruft as possible. My goal is a minimalist editing experience, tailored for exactly those tasks that I do regularly, and nothing more. In this post, I’ll give a brief tour of my setup and its motivations.

Over the years, I’ve been through a bunch of editors. Here are most of them, in roughly chronological order:

Adobe Dreamweaver
Sublime Text
Atom
Vim/Neovim
IntelliJ IDEA
VS Code
Emacs

The majority were used specifically for the work I was doing at the time. VS Code for web development, IntelliJ IDEA for JVM languages, Emacs for Lisps. Vim and Emacs have been the most generally applicable, and the ones that I’ve enjoyed the most.

I’ve also evaluated Zed recently, but it hasn’t quite stuck. The start-up time and responsiveness is impressive. However, it seems to insist on being a global application. I want my editor to be local to each project, with the correct PATH from direnv, and I want multiple isolated instances to run simultaneously. Maybe I’ll revisit it later.

Returning to Neovim

Yeah, so I’m back in Neovim. I actually started with the LazyVim distribution based on recommendation. On the positive side, it got me motivated to use Neovim again. But I had some frustrations with the distribution.

The start-up time wasn’t great. I guess it did some lazy loading of plugins to speed things up, but the experience was still that of an IDE taking its time to get ready. Not the Neovim experience I was hoping for.

More importantly, it was full of distractions; popups, status messages, news, and plugins I didn’t need. I guess it takes a batteries-included approach. That might make sense for beginners and those just getting into Neovim, but I realized quickly that I wanted something different.

Supposedly I could strip things out, but instead I decided to start from scratch and build the editor I wanted. One that I understand. Joran Dirk Greef talks about two different types of artists, sculptors and painters, and this is an exercise in painting.

More concretely, my main goals are:

Plugins: I want to keep plugins to an absolute minimum. My editor is meant for coding and writing, and for what I work on right now. I might add or remove plugins and configuration over time, and that’s fine.
User interface: It should be as minimalist as I can make it. Visual distractions kept at a minimum. This includes busy colorschemes, which I find add little value. A basic set of typographical conventions in monochrome works well for me.
Start-up time: With the way I use Neovim, it needs to start fast. I quit and start it all the time, jumping between directories, working on different parts of the system or a project. Often I put it in the background with C-z, but not always. Making it faster seems to be mainly an exercise in minimizing plugins.

I have to mention Nix, you know?

I manage dotfiles and other personal configuration using Nix and home-manager. The Neovim configuration is no exception, so I’ll include some of the Nix bits as well.

The vim.nix module declares that I want Neovim installed and exposed as vim:

programs.neovim = {  enable = true;  vimAlias = true;  ... };

In that attribute set, there are two other important parts; the plugins list and the extraConfig.

Plugins

Let’s start with the plugins:

plugins = with pkgs.vimPlugins; [  nvim-lspconfig  (nvim-treesitter.withPlugins(p: [  p.bash  p.json  p.lua  p.markdown  p.nix  p.python  p.rust  p.zig  p.vimdoc  ]))  conform-nvim  neogit  fzf-vim ];

Basically it’s five plugins, not counting the various treesitter parsers:

nvim-lspconfig: LSP is included in Neovim nowadays, but it doesn’t know about specific language servers. The lspconfig plugins helps with configuring Neovim for use with various servers.
nvim-treesitter: This provides better highlighting for Neovim. I’ve only included the languages I use right now. No nice-to-haves. I could probably remove this plugin, but I haven’t tried yet.
conform-nvim: Auto-formatting is useful and I don’t want to think about it. Of course, I could run :%!whatever-formatter, but I’d rather have the editor do it for me.
neogit: Magit was the reason I clung to Emacs for so long. Neogit is, for my purposes, a worthy replacement. It enabled me to finally make the switch.
fzf-vim: A wrapper around the awesome fzf fuzzy finder. I use :Files and :GFiles, as quick jump-to-file commands (think C-p in VS Code or Zed). They are bound to ff and gf, respectively. This might be another plugin I could do without, writing a small helper around fzf, or just making do with :find and **/ wildcards. On the other hand, I’m trying out :Buffers instead of stumbling around with :bnext and :bprev.

One great thing with the Nix setup is I don’t need a package manager in Neovim itself.

Batteries Are Included

Many things I don’t need plugins for. For instance, there are a ton of plugins for auto-completion, but Neovim has most of that built in, and I prefer triggering it manually:

File name completions with C-x C-f
Omnicomplete (rebound to C-Space in my case)
Buffer completions with C-x C-n
Spelling suggestions with C-x C-s or z=

I’ve tried various snippet engines many times, but not found them very useful. Most of my time is spent reading or modifying existing code, not churning out new boilerplate. Instead they tend to clutter the auto-completion list. Snippets might make more sense for things like HTML, but I don’t write HTML often, and in that case I’d prefer some emmet/zen-coding plugin.

You can get great mileage from learning how to use the Quickfix list. I’m no expert, but I prefer investing in composable primitives that I can reuse in different ways. Project-wide search-and-replace is such an example:

:grep whatever :cfdo s/whatever/something else/g | update :cfdo :bd

Here we search (:grep, which I’ve configured to use rg), substitute and save each file, and delete those buffers afterwards.

I also use :make and :compiler a lot. Neovim is cool.

Life in Monochrome

Maybe I’m just growing old, but I prefer a monochrome colorscheme. Right now I’m using the built-in quiet scheme with a few extra tweaks:

set termguicolors set bg=dark colorscheme quiet highlight Keyword gui=bold highlight Comment gui=italic highlight Constant guifg=#999999 highlight NormalFloat guibg=#333333

It’s black-and-white, but keywords are bold, comments are darker and italic, and literals are light gray. Here’s how it looks with some Zig code:

Maybe I’m coming off as nostalgic or conservative, but I do find it more readable this way.

Another thing I’m going to try soon is writing on the Daylight Computer, hopefully in Neovim. Being comfortable with a monochrome colorscheme should come in handy.

The Full Configuration

My config uses a Vimscript entrypoint (extraConfig in the Nix code). This part is based on my near-immortal config from the good old Vim days. Early on, it calls vim.loader.enable() to improve startup time. I use Lua scripts for configuring the plugins and related keymap bindings. Maybe everything could be Lua, but I haven’t gotten that far yet. However, it’s nice to have the base config somewhat portable; I can just copy-paste it onto a server or some other temporary environment and have a decent /usr/bin/vim experience.

You’ll find the full configuration in nix.vim and the Lua bits inside the vim/ directory.

That’s about it! I’m really happy with how fast and minimalistic it is. It starts in just above 100ms. And I can understand all of my configuration (even if I don’t understand all of Neovim.) Perhaps I’ve spent more time on it than I’ve saved, but at least I’m happy so far.

I’m writing and publishing this on my birthday. What a treat to find time to blog on such an occasion!

Statically Typed Functional Programming with Python 3.12

2024-05-23T00:00:00+02:00

Lately I’ve been messing around with Python 3.12, discovering new features around typing and pattern matching. Combined with dataclasses, they provide support for a style of programming that I’ve employed in Kotlin and Typescript at work. That style in turn is based on what I’d do in OCaml or Haskell, like modelling data with algebraic data types. However, the more advanced concepts from Haskell — and OCaml too, I guess — don’t transfer that well to mainstream languages.

What I’m describing in this post is a trade-off that I find comfortable to use in Python, especially with the new features that I’ll describe. Much of this works nicely in Kotlin and Typescript, too, with minor adaptions.

The principles I try to use are:

Declarative rather than imperative operations on data: Transform or fold data using map, reduce, or for-comprehensions instead of for-loops
Functions with destructuring pattern-matching to dispatch based on data: Use when statements rather than if instanceof(...) or inheritance and dynamic dispatch
Programs structured mainly around data and functions: Programs are trees of function invocations on data, rather than class hierarchies, dependency injection, and overuse of exceptions
Effects pushed to the outer layers of the program (hexagonal architecture): Within reasonable bounds, functions in the guts of programs are pure and return data, whereas the outer layers interpret that data and manage effects (IO, non-determinism, etc)

This list is not exhaustive, but I’m trying to keep this focused. Also, I’m intentionally not taking this in the direction of Haskell, with typeclass hierarchies, higher-kinded types, and so on. I don’t believe cramming such constructs in would benefit Python programs in practice. Furthermore, I won’t be talking about effects and hexagonal architecture in this post.

The examples are all type-annotated and checked with Pyright. You could do all of this without static type-checking, as far as I know.

Finally, note that I consider myself a Python rookie, as I mostly use it for small tools and scripts. The largest program I’ve written in Python is Quickstrom. This post is meant to inspire and trigger new ideas, not to tell anyone how to write Python code.

All right, let’s get started and see what’s possible!

Preliminaries

First, let’s get some boilerplate stuff out of the way. I’m running Python 3.12. Some of the things I’ll show can be done with earlier versions, but not all.

We’ll not be using any external packages, only modules from the standard library:

from typing import * from dataclasses import dataclass

You might not want to use wildcard imports in more serious programs, but it’s acceptable for these examples.

Pattern Matching

Let’s start with a classic example from the functional programming world: an evaluator for a simple expression-based language. It only supports a few operations in order to keep it simple. First, we the different types of expressions there are using dataclasses and a union type:

type Expr = int | bool | str | BinOp | Let | If   @dataclass class BinOp:  op: Literal["<"] | Literal[">"]  lhs: Expr  rhs: Expr   @dataclass class Let:  name: str  value: Expr  body: Expr   @dataclass class If:  cond: Expr  t: Expr  f: Expr

The syntax for type aliases and the union type operator (|) are both new additions. You could create type aliases before using regular top-level bindings, but mutually recursive types required some types to be enclosed in strings. Otherwise, Python would complain that the second type (e.g. BinOp in the code above) wasn’t defined. It’s a bit cleaner now.

Note that we existing primitive types from Python (int, bool, and str), combined with dataclasses for complex expressions. The str is interpreted as a reference to a name bound in the lexical scope, not as a string literal, as we’ll see in the following snippet.

The evaluator tracks bindings in the Env. Evaluating an expression results in a value that is either an integer or a boolean.

type Env = dict[str, Value]  type Value = int | bool

Now, let’s look at the eval function. Here we pattern-match on the expression, which is a union. For literals, we just return the value:

def eval(env: Env, expr: Expr) -> Value:  match expr:  case int() | bool():  return expr  ...

References are looked up in the environment:

case str():  return env[expr]

Let-bindings create a new environment with the new binding:

case Let(name, value, body):  new_env = env | {name: eval(env, value)}  return eval(new_env, body)

Finally, the BinOp and If branches pattern-match on the evaluated nested expressions to make sure they’re of the correct types:

case BinOp(op, lhs, rhs):  l = eval(env, lhs)  r = eval(env, rhs)  match op, l, r:  case "<", int(), int():  return l < r  case ">", int(), int():  return l > r  case _:  raise ValueError(  f"Invalid binary operation {op} on {lhs} and {rhs}"  )  case If(cond, t, f):  match eval(env, cond):  case True:  return eval(env, t)  case False:  return eval(env, f)  case c:  raise ValueError(f"Expected bool condition, got: {c}")

All right, let’s try it out:

>>> example = Let("x", 1, If(BinOp("<", "x", 2), 42, 0)) >>> eval({}, example) 42

Nice! But this is far from a robust evaluator. If we run it with a deep enough expression, we’d get a RecursionError saying that the maximum recursion depth was exceeded. This is a commonly occurring problem when writing recursive functions in Python.¹ The eval function could be rewritten with an explicit stack for operations and operands, but it’s a bit fiddly.

In some cases, you can restructure a recursive function as tail-recursive, and then manually convert it to a loop. Perhaps you could automatically optimize tail-calls, or use a trampoline. Some solutions avoid stack overflows, at the expense of increased heap memory usage. In simpler cases, combinators like map and reduce might suffice, instead of explicit recursion.

Either way, recursive functions and stack overflow is something to watch out for.

Generics

Since Python 3.12, it’s also much nicer to work with generic types. Previously, you had to define type variables before using them in type signatures. This felt very awkward to me.

Let’s look at an example that models a rose tree data type. To spice it up a little, I’m including a map function for both types of the tree nodes. This is equivalent to fmap in Haskell, but without the typeclass and higher-order types.

type RoseTree[T] = Branch[T] | Leaf[T]   @dataclass class Branch[A]:  branches: list[RoseTree[A]]   def map[B](self, f: Callable[[A], B]) -> Branch[B]:  return Branch([b.map(f) for b in self.branches])   @dataclass class Leaf[A]:  value: A   def map[B](self, f: Callable[[A], B]) -> Leaf[B]:  return Leaf(f(self.value))

Let’s print these trees using pattern matching. Here’s a function that’s written as a loop, maintaining a list of remaining sub-trees to print:

def print_tree[T](tree: RoseTree[T]):  trees = [(tree, 0)]  while trees:  match trees.pop(0):  case Branch(branches), level:  print(" " * level * 2 + "*")  trees = [(branch, level + 1) for branch in branches] + trees  case Leaf(value), level:  print(" " * level * 2 + "- " + repr(value))

It could be even simpler using plain recursion, but then we could run into stack depth issues again. Anyway, let’s try it out:

example = Branch(  [  Leaf(1),  Leaf(2),  Branch([Leaf(3), Leaf(4)]),  Branch([Leaf(5), Leaf(6)]),  Leaf(7),  ] ) >>> print_tree(example.map(str)) *  - '1'  - '2'  *  - '3'  - '4'  *  - '5'  - '6'  - '7'

As you can see from the repr being printed, all the values are mapped to strings.

Protocols

As a last example, I’d like to show how you can do basic structural subtyping using protocols. This is useful in cases where you don’t want to define all variants of a union in a single place. For instance, you might have many types of events that can be emitted in an application. Centrally defining each type of event adds unwanted coupling. Breaking apart a base class for events and the code that later on consumes the events decreases cohesion. In such cases, a protocol might be a better option.

We’ll need a new import:

from datetime import datetime

Now, consider the following events in module A:

@dataclass class Increment[Time]:  id: str  time: Time   def description(self: Self) -> str:  return "Incremented counter"   @dataclass class Reset[Time]:  id: str  time: Time   def description(self: Self) -> str:  return "Reset counter"

We made them generic just to showcase the combination of protocols and generics. The Time type parameter isn’t instantiated in any other way than datetime in this example.

In another module B — that doesn’t depend on A, and isn’t depended upon by A — the protocol is defined, along with the log_event function:

class Event[Time](Protocol):  id: str  time: Time   def description(self: Self) -> str: ...  def log_event(event: Event[datetime]):  print(f"Got {event.id} at {event.time}: {event.description()}")

Increment and Decrement both implement the Event protocol by virtue of being structurally compatible. They can both be passed to log_event:

log_event(  Increment("foo", datetime.now()), )

If we annotate the Event protocol with @runtime_checkable, we can check it with isinstance and use it in match cases:

@runtime_checkable class Event[Time](Protocol):  ...  def log(x: Any):  match x:  case Event() if isinstance(datetime, x.time):  log_event(x)  case _:  print(x)

Pretty neat!

That’s all I have for now. Maybe more Python hacking and blog posts will pop up if there’s interest. I’m positive to the evolution of Python and functional programming, as it’s something I use quite regularly.

Join the discussion on Twitter, Hacker News, or Lobsters.

Thank you @tusharisanerd for reviewing a draft of this post.

See On Recursion, Continuations and Trampolines for more in-depth explanations of various solutions to recursive functions and the stack.↩︎

Specifying State Machines with Temporal Logic

2021-05-05T00:00:00+02:00

Quickstrom uses linear temporal logic (LTL) for specifying web applications. When explaining how it works, I’ve found that the basics of LTL are intuitive to newcomers. On the other hand, it’s not obvious how to specify real-world systems using LTL. That’s why I’m sharing some of my learnings and ideas from the past year in the form of blog posts.

This post focuses on how to use LTL to specify systems in terms of state machines. It’s a brief overview that avoids going into too much detail. For more information on how to test web applications using such specifications, see the Quickstrom documentation.

To avoid possible confusion, I want to start by pointing out that a state machine specification in this context is not the same as a model in TLA+ (or similar modeling languages.) We’re not building a model to prove or check properties against. Rather, we’re defining properties in terms of state machine transitions, and the end goal is to test actual system behavior (e.g. web applications, desktop applications, APIs) by checking that recorded traces match our specifications.

Linear Temporal Logic

In this post, we’ll be using an LTL language. It’s a sketch of a future specification language for Quickstrom.

A formula (plural formulae) is a logical expression that evaluates to true or false. We have the constants:

true
false

We combine formulae using the logical connectives, e.g:

&&
||
not
==>

The ==> operator is implication. So far we have propositional logic, but we need a few more things.

Temporal Operators

At the core of our language we have the notion of state. Systems change state over time, and we’d like to express that in our specifications. But the formulae we’ve seen so far do not deal with time. For that, we use temporal operators.

To illustrate how the temporal operators work, I’ll use diagrams to visualize traces (sequences of states). A filled circle (●) denotes a state in which the formula is true, and an empty circle (○) denotes a state where the formula is false.

For example, let’s say we have two formulae, P and Q, where:

P is true in the first and second state
Q is true in the second state

Both formulae are false in all other states. The formulae and trace would be visualized as follows:

P ●───●───○ Q ○───●───○

Note that in these diagrams, we assume that the last state repeats forever. This might seem a bit weird, but drawing an infinite number of states is problematic.

All of the examples explaining operators have links to the Linear Temporal Logic Visualizer, in which you can interactively experiment with the formulae. The syntax is not the same as in the article, but hopefully that’s not a problem.

The next operator takes a formula as an argument and evaluates it in the next state.

next P ●───○───○ P ○───●───○

Open in LTL Visualizer

The next operator is relative to the current state, not the first state in the trace. This means that we can nest nexts to reach further into the future.

next (next P) ●───●───○───○───○ next P ○───●───●───○───○ P ○───○───●───●───○

Open in LTL Visualizer

Next for State Transitions

All right, time for a more concrete example, something we’ll evolve throughout this post. Let’s say we have a formula gdprConsentIsVisible which is true when the GDPR consent screen is visible. We specify that the screen should be visible in the current and next state like so:

gdprConsentIsVisible && next gdprConsentIsVisible

A pair of consecutive states is called a step. When specifying state machines, we use the next operator to describe state transitions. A state transition formula is a logical predicate on a step.

In the GDPR example above, we said that the consent screen should stay visible in both states of the step. If we want to describe a state change in the consent screen’s visibility, we can say:

gdprConsentIsVisible && next (not gdprConsentIsVisible)

The formula describes a state transition from a visible to a hidden consent screen.

Always

But interesting state machines usually have more than one possible transition, and interesting behaviors likely contain multiple steps.

While we could nest formulae containing the next operator, we’d be stuck with specifications only describing a finite number of transitions.

Consider the following, where we like to state that the GDPR consent screen should always be visible:

gdprConsentIsVisible && next (gdprConsentIsVisible && next ...)

This doesn’t work for state machines with cycles, i.e. with possibly infinite traces, because we can only nest a finite number of next operators. We want state machine specifications that describe any number of transitions.

This is where we pick up the always operator. It takes a formula as an argument, and it’s true if the given formula is true in the current state and in all future states.

always P ●───●───●───●───● P ●───●───●───●───● always Q ○───○───●───●───● Q ●───○───●───●───●

Open in LTL Visualizer

Note how always Q is true in the third state and onwards, because that’s when Q becomes true in the current and all future states.

Let’s revisit the always-visible consent screen specification. Instead of trying to nest an infinite amount of next formulae, we instead say:

always gdprConsentIsVisible

Neat! This is called an invariant property. Invariants are assertions on individual states, and an invariant property says that it must hold for every state in the trace.

Always for State Machines

Now, let’s up our game. To specify the system as a state machine, we can combine transitions with disjunction (||) and the always operator. First, we define the individual transition formulae open and close:

let open = not gdprConsentIsVisible && next gdprConsentIsVisible; let close = gdprConsentIsVisible && next (not gdprConsentIsVisible);

Our state machine formula says that it always transitions as described by open or close:

always (open || close)

We have a state machine specification! Note that this specification only allows for transitions where the visibility of the consent screen changes back and forth.

So far we’ve only seen examples of safety properties. Those are properties that specify that “nothing bad happens.” But we also want to specify that systems somehow make progress. The following two temporal operators let us specify liveness properties, i.e. “good things eventually happen.”

Quickstrom does not support liveness properties yet.¹

Eventually

We’ve used next to specify transitions, and always to specify invariants and state machines. But we might also want to use liveness properties in our specifications. In this case, we are not talking about specific steps, but rather goals.

The temporal operator eventually takes a formula as an argument, and it’s true if the given formula is true in the current or any future state.

eventually P ○───○───○───○───○ P ○───○───○───○───○ eventually Q ●───●───●───●───○ Q ○───○───○───●───○

Open in LTL Visualizer

For instance, we could say that the consent screen should initially be visible and eventually be hidden:

gdprConsentIsVisible && eventually (not gdprConsentIsVisible)

This doesn’t say that it stays hidden. It may become visible again, and our specification would allow that. To specify that it should stay hidden, we use a combination of eventually and always:

gdprConsentIsVisible && eventually (always (not gdprConsentIsVisible))

Let’s look at a diagram to understand this combination of temporal operators better:

eventually (always P) ○───○───○───○───○ P ○───○───●───●───○ eventually (always Q) ●───●───●───●───● Q ○───○───●───●───●

Open in LTL Visualizer

The formula eventually (always P) is not true in any state, because P never starts being true forever. The other formula, eventually (always Q), is true in all states because Q becomes true forever in the third state.

Until

The last temporal operator I want to discuss is until. For P until Q to be true, P must be true until Q becomes true.

P until Q ●───●───●───●───○ P ●───●───○───○───○ Q ○───○───●───●───○

Open in LTL Visualizer

Just as with the eventually operator, the stop condition (Q) doesn’t have to stay true forever, but it has to be true at least once.

The until operator is more expressive than always and eventually, and they can both be defined using until.²

Anyway, let’s get back to our running example. Suppose we have another formula supportChatVisible that is true when the support chat button is shown. We want to make sure it doesn’t show up until after the GDPR consent screen is closed:

not supportChatVisible until not gdprConsentIsVisible

The negations make it a bit harder to read, but it’s equivalent to the informal statement: “the support chat button is hidden at least until the GDPR consent screen is hidden.” It doesn’t demand that the support chat button is ever visible, though. For that, we instead say:

gdprConsentIsVisible until (supportChatVisible && not gdprConsentIsVisible)

In this formula, supportChatVisible has to become true eventually, and at that point the consent screen must be hidden.

Until for State Machines

We can use the until operator to define a state machine formula where the final transition is more explicit.

Let’s say we want to specify the GDPR consent screen more rigorously. Suppose we already have the possible state transition formulae defined:

allowCollectedData
disallowCollectedData
submit

We can then put together the state machine formula:

let gdprConsentStateMachine = gdprConsentIsVisible && (allowCollectedData || disallowCollectedData) until (submit && next (not gdprConsentIsVisible));

In this formula we allow any number of allowCollectedData or disallowCollectedData transitions, until the final submit resulting in a closed consent screen.

What’s next?

We’ve looked at some temporal operators in LTL, and how to use them to specify state machines. I’m hoping this post has given you some ideas and inspiration!

Another blog post worth checking out is TLA+ Action Properties by Hillel Wayne. It’s written specifically for TLA+, but most of the concepts are applicable to LTL and Quickstrom-style specifications.

I intend to write follow-ups, covering atomic propositions, queries, actions, and events. If you want to comment, there are threads on GitHub, Twitter, and on Lobsters. You may also want to sponsor my work.

Thank you Vitor Enes, Andrey Mokhov, Pascal Poizat, and Liam O’Connor for reviewing drafts of this post.

Edits

2021-05-28: Added links to the Linear Temporal Logic Visualizer matching the relevant examples. Note that the syntax is different in the visualizer.

Footnotes

A future version of Quickstrom will use a different flavor of LTL tailored for testing, and that way support liveness properties.↩︎
We can define eventually P = true until P, and perhaps a bit harder to grasp, always P = not (true until not P). Or we could say always P = not (eventually not P).↩︎

Introducing Quickstrom: High-confidence browser testing

2020-08-27T00:00:00+02:00

In the last post I shared the results from testing TodoMVC implementations using WebCheck. The project has since been renamed Quickstrom (thank you, Tom) and is now released as open source.

What is Quickstrom?

Quickstrom is a new autonomous testing tool for the web. It can find problems in any type of web application that renders to the DOM. Quickstrom automatically explores your application and presents minimal failing examples. Focus your effort on understanding and specifying your system, and Quickstrom can test it for you.

Past and future

I started writing Quickstrom on April 2, 2020, about a week after our first child was born. Somehow that code compiled, and evolved into a capable testing tool. I’m now happy and excited to share it with everyone!

In the future, when Quickstrom is more robust and has a greater mileage, I might build a commercial product on top of it. This one of the reasons I’ve chosen an AGPL-2.0 license for the code, and why contributors must sign a CLA before pull requests can be merged. The idea is to keep the CLI test runner AGPL forever, but I might need a license exception if I build a closed-source SaaS product later on.

Learning more

Interested in Quickstrom? Start by checking out any of these resources:

Main website
Project documentation, including installation instructions and usage guides
Source code

And keep an eye out for updates by signing up for the newsletter, or by following me on Twitter. Documentation should be significantly improved soon.

Comments

If you have any comments or questions, please reply to any of the following threads:

The TodoMVC Showdown: Testing with WebCheck

2020-07-02T00:00:00+02:00

In this post I’ll share the results from testing TodoMVC implementations using my new testing tool named WebCheck. I’ll explain how the specification works, what problems were found in the TodoMVC implementations, and what this means for my project.

WebCheck

During the last three months I’ve used my spare time to build WebCheck. It’s a browser testing framework combining ideas from:

property-based testing (PBT)
TLA+ and linear temporal logic
functional programming

In WebCheck, you write a specification for your web application, instead of manually writing test cases. The specification describes the intended behavior as a finite-state machine and invariants, using logic formulae written in a language inspired by the temporal logic of actions (PDF) from TLA+. WebCheck generates hundreds or thousands of tests and runs them to verify if your application is accepted by the specification.

The tests generated by WebCheck are sequences of possible actions, determined from the current state of the DOM at the time of each action. You can think of WebCheck as exploring the state space automatically, based on DOM introspection. It can find user behaviors and problems that we, the biased and lazy humans, are unlikely to think of and to write tests for. Our job is instead to think about the requirements and to improve the specification over time.

Specifications vs Models

In property-based testing, when testing state machines using models, the model should capture the essential complexity of the system under test (SUT). It needs to be functionally complete to be a useful oracle. For a system that is conceptually simple, e.g. a key-value database engine, this is not a problem. But for a system that is inherently complex, e.g. a business application with a big pile of rules, a useful model tends to grow as complex as the system itself.

In WebCheck, the specification is not like such a model. You don’t have to implement a complete functional model of your system. You can leave out details and specify only the most important aspects of your application. As an example, I wrote a specification that states that “there should at no time be more than one spinner on a page”, and nothing else. Again, this is possible to specify in PBT in general, but not with model-based PBT, from what I’ve seen.

TodoMVC as a Benchmark

Since the start of this project, I’ve used the TodoMVC implementations as a benchmark of WebCheck, and developed a general specification for TodoMVC implementations. The TodoMVC contribution documentation has a high-level feature specification, and the project has a Cypress test suite, but I was curious if I could find anything new using WebCheck.

Early on, checking the mainstream framework implementations, I found that both the Angular and Mithril implementations were rejected by my specification, and I submitted an issue in the TodoMVC issue tracker. Invigorated by the initial success, I decided to check the remaining implementations and gradually improve my specification.

I’ve generalized the specification to work on nearly all the implementations listed on the TodoMVC website. Some of them use the old markup, which uses IDs instead of classes for most elements, so I had to support both variants.

The Specification

Before looking at the tests results, you might want to have a look at the WebCheck specification that I’ve published as a gist:

TodoMVC.spec.purs

The gist includes a brief introduction to the WebCheck specification language and how to write specifications. I’ll write proper documentation for the specification language eventually, but this can give you a taste of how it works, at least. I’ve excluded support for the old TodoMVC markup to keep the specification as simple as possible.

The specification doesn’t cover all features of TodoMVC yet. Most notably, it leaves out the editing mode entirely. Further, it doesn’t cover the usage of local storage in TodoMVC, and local storage is disabled in WebCheck for now.

I might refine the specification later, but I think I’ve found enough to motivate using WebCheck on TodoMVC applications. Further, this is likely how WebCheck would be used in other projects. You specify some things and you leave out others.

The astute reader might have noticed that the specification language looks like PureScript. And it pretty much is PureScript, with some WebCheck-specific additions for temporal modalities and DOM queries. I decided not to write a custom DSL, and instead write a PureScript interpreter. That way, specification authors can use the tools and packages from the PureScript ecosystem. It works great so far!

Test Results

Below are the test results from running WebCheck and my TodoMVC specification on the examples listed on the TodoMVC website. I’ll use short descriptions of the problems (some of which are recurring), and explain in more detail further down.

	Example	Problems/Notes
Pure JavaScript
✓	Backbone.js
❌	AngularJS	Clears input field on filter change
✓	Ember.js
✓	Dojo
✓	Knockback
✓	CanJS
✓	Polymer
✓	React
❌	Mithril	Clears input field on filter change
✓	Vue
✓	MarionetteJS
Compiled to JavaScript
✓	Kotlin + React
✓	Spine
✓	Dart
✓	GWT
✓	Closure
✓	Elm
❌	AngularDart	Race condition on initialization Filters not implemented
✓	TypeScript + Backbone.js
✓	TypeScript + AngularJS
✓	TypeScript + React
✓	Reagent
✓	Scala.js + React
✓	Scala.js + Binding.scala
✓	js_of_ocaml
–	Humble + GopherJS	Missing/broken link
Under evaluation by TodoMVC
✓	Backbone.js + RequireJS
❌	KnockoutJS + RequireJS	Inconsistent first render
✓	AngularJS + RequireJS	Needs a custom `readyWhen` condition
✓	CanJS + RequireJS
❌	Lavaca + RequireJS	Clears input field on filter change
❌	cujoJS	Race condition on initialization Filters not implemented
✓	Sammy.js
–	soma.js	Missing/broken link
❌	DUEL	Clears input field on filter change
✓	Kendo UI
❌	Dijon	Filters not implemented
✓	Enyo + Backbone.js
❌	SAPUI5	No input field
✓	Exoskeleton
✓	Ractive.js
✓	React + Alt
✓	React + Backbone.js
✓	Aurelia
❌	Angular 2.0	Filters not implemented
✓	Riot
✓	JSBlocks
Real-time
–	SocketStream	State cannot be cleared
✓	Firebase + AngularJS
Node.js
–	Express + gcloud-node	Missing/broken link
Non-framework implementations
❌	VanillaJS	Adds pending item on other interaction
❌	VanillaJS ES6	Adds pending item on other interaction `.todo-count strong` is missing
✓	jQuery

✓ Passed, ❌ Failed, – Not testable

Filters not implemented: There’s no way of switching between “All”, “Active”, and “Completed” items. This is specified in the TodoMVC documentation under Routing.
Race condition during initialization: The event listeners are attached some time after the .new-todo form is rendered. Although unlikely, if you’re quick enough you can focus the input, press Return, and post the form. This will navigate the user agent to the same page but with a query parameter, e.g. index.html?text=. In TodoMVC it’s not the end of the world, but there are systems where you do not want this to happen.
Inconsistent first render: The application briefly shows an inconsistent view, then renders the valid initial state. KnockoutJS + RequireJS shows an empty list items and “0 left” in the bottom, even though the footer should be hidden when there are no items.
Needs a custom readyWhen condition: The specification awaits an element matching .todoapp (or #todoapp for the old markup) in the DOM before taking any action. In this case, the implementation needs a modified specification that instead awaits a framework-specific class, e.g. .ng-scope. This is an inconvenience in testing the implementation using WebCheck, rather than an error.
No input field: There’s no input field to enter TODO items in. I’d argue this defeats the purpose of a TODO list application, and it’s indeed specified in the official documentation.
Adds pending item on other interaction: When there’s a pending item in the input field, and another action is taken (toggle all, change filter, etc), the pending item is submitted automatically without a Return key press.
.todo-count strong element is missing: An element matching the selector .todo-count strong must be present in the DOM when there are items, showing the number of active items, as described in the TodoMVC documentation.
State cannot be cleared: This is not an implementation error, but an issue where the test implementation makes it hard to perform repeated isolated testing. State cannot (to my knowledge) be cleared between tests, and so isolation is broken. This points to a key requirement currently placed by WebCheck: the SUT must be stateless, with respect to a new private browser window. In future versions of WebCheck, I’ll add hooks to let the tester clear the system state before each test is run.
Missing/broken link: The listed implementation seems to be moved or decommissioned.

Note that I can’t decide which of these problems are bugs. That’s up to the TodoMVC project maintainers. I see them as problems, or disagreements between the implementations and my specification. A good chunk of humility is in order when testing systems designed and built by others.

The Future is Bright

I’m happy with how effective WebCheck has been so far, after only a few months of spare-time prototyping. Hopefully, I’ll have something more polished that I can make available soon. An open source tool that you can run yourself, a SaaS version with a subscription model, or maybe both. When that time comes, maybe WebCheck can be part of the TodoMVC project’s testing. And perhaps in your project?

If you’re interested in WebCheck, please sign up for the newsletter. I’ll post regular project updates, and definitely no spam. You can also follow me on Twitter.

Comments

If you have any comments or questions, please reply to any of the following threads:

Thanks to Hillel Wayne, Felix Holmgren, Martin Janiczek, Tom Harding, and Martin Clausen for reviewing drafts of this post.

Time Travelling and Fixing Bugs with Property-Based Testing

2019-11-17T00:00:00+01:00

Property-based testing (PBT) is a powerful testing technique that helps us find edge cases and bugs in our software. A challenge in applying PBT in practice is coming up with useful properties. This tutorial is based on a simple but realistic system under test (SUT), aiming to show some ways you can test and find bugs in such logic using PBT. It covers refactoring, dealing with non-determinism, testing generators themselves, number of examples to run, and coupling between tests and implementation. The code is written in Haskell and the testing framework used is Hedgehog.

This tutorial was originally written as a book chapter, and later extracted as a standalone piece. Since I’m not expecting to finish the PBT book any time soon, I decided to publish the chapter here.

The business logic we’ll test is the validation of a website’s user signup form. The website requires users to sign up before using the service. When signing up, a user must pick a valid username. Users must be between 18 and 150 years old.

Stated formally, the validation rules are:

$$ \begin{aligned} 0 \leq \text{length}(\text{name}) \leq 50 \\ 18 \leq \text{age} \leq 150 \end{aligned} \qquad(1)$$

The signup and its validation is already implemented by previous programmers. There have been user reports of strange behaviour, and we’re going to locate and fix the bugs using property tests.

Poking around the codebase, we find the data type representing the form:

data SignupForm = SignupForm  { formName :: Text  , formAge :: Int  } deriving (Eq, Show)

And the existing validation logic, defined as validateSignup. We won’t dig into to the implementation yet, only its type signature:

validateSignup  :: SignupForm -> Validation (NonEmpty SignupError) Signup

It’s a pure function, taking SignupForm data as an argument, and returning a Validation value. In case the form data is valid, it returns a Signup data structure. This data type resembles SignupForm in its structure, but refines the age as a Natural when valid:

data Signup = Signup  { name :: Text  , age :: Natural  } deriving (Eq, Show)

In case the form data is invalid, validateSignup returns a non-empty list of SignupError values. SignupError is a union type of the possible validation errors:

data SignupError  = NameTooShort Text  | NameTooLong Text  | InvalidAge Int  deriving (Eq, Show)

The Validation Type

The Validation type comes from the validation package. It’s parameterized by two types:

the type of validation failures
the type of a successfully validated value

The Validation type is similar to the Either type. The major difference is that it accumulates failures, rather than short-circuiting on the first failure. Failures are accumulated when combining multiple Validation values using Applicative.

Using a non-empty list for failures in the Validation type is common practice. It means that if the validation fails, there’s at least one error value.

Validation Property Tests

Let’s add some property tests for the form validation, and explore the existing implementation. We begin in a new test module, and we’ll need a few imports:

import Data.List.NonEmpty (NonEmpty (..)) import Data.Text (Text) import Data.Validation import Hedgehog import qualified Hedgehog.Gen as Gen import qualified Hedgehog.Range as Range

Also, we’ll need to import the implementation module:

import Validation

We’re now ready to define some property tests.

A Positive Property Test

The first property test we’ll add is a positive test. That is, a test using only valid input data. This way, we know the form validation should always be successful. We define prop_valid_signup_form_succeeds:

prop_valid_signup_form_succeeds = property $ do  let genForm = SignupForm <$> validName <*> validAge ➊  form <- forAll genForm ➋   case validateSignup form of ➌  Success{} -> pure ()  Failure failure' -> do  annotateShow failure'  failure

First, we define genForm (➊), a generator producing form data with valid names and ages. Next, we generate form values from our defined generator (➋). Finally, we apply the validateSignup function and pattern match on the result (➌):

In case it’s successful, we have the test pass with pure ()
In case it fails, we print the failure' and fail the test

The validName and validAge generators are defined as follows:

validName :: Gen Text validName = Gen.text (Range.linear 1 50) Gen.alphaNum  validAge :: Gen Int validAge = Gen.integral (Range.linear 18 150)

Recall the validation rules (eq. 1). The ranges in these generators yielding valid form data are defined precisely in terms of the validation rules.

The character generator used for names is alphaNum, meaning we’ll only generate names with alphabetic letters and numbers. If you’re comfortable with regular expressions, you can think of genValidName as producing values matching [a-zA-Z0-9]+.

Let’s run some tests:

λ> check prop_valid_signup_form_succeeds ✓  passed 100 tests.

Hooray, it works.

Negative Property Tests

In addition to the positive test, we’ll add negative tests for the name and age, respectively. Opposite to positive tests, our negative tests will only use invalid input data. We can then expect the form validation to always fail.

First, let’s test invalid names.

prop_invalid_name_fails = property $ do  let genForm = SignupForm <$> invalidName <*> validAge ➊  form <- forAll genForm   case validateSignup form of ➋  Failure (NameTooLong{} :| []) -> pure ()  Failure (NameTooShort{} :| []) -> pure ()  other -> do ➌  annotateShow other  failure

Similar to our the positive property test, we define a generator genForm (➊). Note that we use invalidName instead of validName.

Again, we pattern match on the result of applying validateSignup (➋). In this case we expect failure. Both NameTooLong and NameTooShort are expected failures. If we get anything else, the test fails (➌).

The test for invalid age is similar, expect we use the invalidAge generator, and expect only InvalidAge validation failures:

prop_invalid_age_fails = property $ do  let genForm = SignupForm <$> validName <*> invalidAge  form <- forAll genForm  case validateSignup form of  Failure (InvalidAge{} :| []) -> pure ()  other -> do  annotateShow other  failure

The invalidName and invalidAge generators are also defined in terms of the validation rules (eq. 1), but with ranges ensuring no overlap with valid data:

invalidName :: Gen Text invalidName =  Gen.choice [mempty, Gen.text (Range.linear 51 100) Gen.alphaNum]  invalidAge :: Gen Int invalidAge = Gen.integral (Range.linear minBound 17)

Let’s run our new property tests:

λ> check prop_invalid_name_fails ✓  passed 100 tests. λ> check prop_invalid_age_fails ✓  passed 100 tests.

All good? Maybe not. The astute reader might have noticed a problem with one of our generators. We’ll get back to that later.

Accumulating All Failures

When validating the form data, we want all failures returned to the user posting the form, rather than returning only one at a time. The Validation type accumulates failures when combined with Applicative, which is exactly what we want. Yet, while the hard work is handled by Validation, we still need to test that we’re correctly combining validations in validateSignup.

We define a property test generating form data, where all fields are invalid (➊). It expects the form validation to fail, returning two failures (➋).

prop_two_failures_are_returned = property $ do  let genForm = SignupForm <$> invalidName <*> invalidAge ➊  form <- forAll genForm  case validateSignup form of  Failure failures | length failures == 2 -> pure () ➋  other -> do  annotateShow other  failure

This property is weak. It states nothing about which failures should be returned. We could assert that the validation failures are equal to some expected list. But how do we know if the name is too long or too short? I’m sure you’d be less thrilled if we replicated all of the validation logic in this test.

Let’s define a slightly stronger property. We pattern match, extract the two failures (➊), and check that they’re not equal (➋).

prop_two_different_failures_are_returned = property $ do  let genForm = SignupForm <$> invalidName <*> invalidAge  form <- forAll genForm  case validateSignup form of  Failure (failure1 :| [failure2]) -> ➊  failure1 /== failure2 ➋  other -> do  annotateShow other  failure

We’re still not being specific about which failures should be returned. But unlike prop_two_failures_are_returned, this property at least makes sure there are no duplicate failures.

The Value of a Property

Is there a faulty behaviour that would slip past prop_two_different_failures_are_returned? Sure. The implementation could have a typo or copy-paste error, and always return NameTooLong failures, even if the name is too short. Does this mean our property is bad? Broken? Useless?

In itself, this property doesn’t give us strong confidence in the correctness of validateSignup. In conjunction with our other properties, however, it provides value. Together they make up a stronger test suite.

Let’s look at it in another way. What are the benefits of weaker properties over stronger ones? In general, weak properties are beneficial in that they are:

easier to define
likely to catch simple mistakes early
less coupled to the SUT

A small investment in a set of weak property tests might catch a lot of mistakes. While they won’t precisely specify your system and catch the trickiest of edge cases, their power-to-weight ratio is compelling. Moreover, a set of weak properties is better than no properties at all. If you can’t formulate the strong property you’d like, instead start simple. Lure out some bugs, and improve the strength and specificity of your properties over time.

Coming up with good properties is a skill. Practice, and you’ll get better at it.

Testing Generators

Remember how in Negative Property Tests we noted that there’s a problem? The issue is, we’re not covering all validation rules in our tests. But the problem is not in our property definitions. It’s in one of our generators, namely genInvalidAge. We’re now in a peculiar situation: we need to test our tests.

One way to test a generator is to define a property specifically testing the values it generates. For example, if we have a generator positive that is meant to generate only positive integers, we can define a property that asserts that all generated integers are positive:

positive :: Gen Int positive = Gen.integral (Range.linear 1 maxBound)  prop_integers_are_positive = property $ do  n <- forAll positive  assert (n >= 1)

We could use this technique to check that all values generated by validAge are valid. How about invalidAge? Can we check that it generates values such that all boundaries of our validation function are hit? No, not using this technique. Testing the correctness of a generator using a property can only find problems with individual generated values. It can’t perform assertions over all generated values. In that sense, it’s a local assertion.

Instead, we’ll find the generator problem by capturing statistics on the generated values and performing global assertions. Hedgehog, and a few other PBT frameworks, can measure the occurrences of user-defined labels. A label in Hedgehog is a Text value, declared with an associated condition. When Hedgehog runs the tests, it records the percentage of tests in which the condition evaluates to True. After the test run is complete, we’re presented with a listing of percentages per label.

We can even have Hedgehog fail the test unless a certain percentage is met. This way, we can declare minimum coverage requirements for the generators used in our property tests.

Adding Coverage Checks

Let’s check that we generate values covering enough cases, based on the validation rules in eq. 1 . In prop_invalid_age_fails, we use cover to ensure we generate values outside the boundaries of valid ages. 5% is enough for each, but realistically they could both get close to 50%.

prop_invalid_age_fails = property $ do  let genForm = SignupForm <$> validName <*> invalidAge  form <- forAll genForm  cover 5 "too young" (formAge form <= 17)  cover 5 "too old" (formAge form >= 151)  case validateSignup form of  Failure (InvalidAge{} :| []) -> pure ()  other -> do  annotateShow other  failure

Let’s run some tests again.

λ> check prop_invalid_age_fails ✗  failed after 100 tests. too young 100% ████████████████████ ✓ 5% ⚠ too old 0% ···················· ✗ 5% ┏━━ test/Validation/V1Test.hs ━━━ 63 ┃ prop_invalid_age_fails = property $ do 64 ┃ let genForm = SignupForm <$> validName <*> invalidAge 65 ┃ form <- forAll genForm 66 ┃ cover 5 "too young" (formAge form <= 17) 67 ┃ cover 5 "too old" (formAge form >= 151) ┃ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ┃ │ Failed (0% coverage) 68 ┃ case validateSignup form of 69 ┃ Failure (InvalidAge{} :| []) -> pure () 70 ┃ other -> do 71 ┃ annotateShow other 72 ┃ failure Insufficient coverage.

100% too young and 0% too old. The invalidAge generator is clearly not good enough. Let’s have a look at its definition again:

invalidAge :: Gen Int invalidAge = Gen.integral (Range.linear minBound 17)

We’re only generating invalid ages between the minimum bound of Int and 17. Let’s fix that, by using Gen.choice and another generator for ages greater than 150:

invalidAge :: Gen Int invalidAge = Gen.choice  [ Gen.integral (Range.linear minBound 17)  , Gen.integral (Range.linear 151 maxBound)  ]

Running tests again, the coverage check stops complaining. But there’s another problem:

λ> check prop_invalid_age_fails ✗  failed at test/Validation/V1Test.hs:75:7 after 3 tests and 2 shrinks. too young 67% █████████████▎······ ✓ 5% ⚠ too old 0% ···················· ✗ 5% ┏━━ test/Validation/V1Test.hs ━━━ 66 ┃ prop_invalid_age_fails = property $ do 67 ┃ let genForm = SignupForm <$> validName <*> invalidAge 68 ┃ form <- forAll genForm ┃ │ SignupForm { formName = "a" , formAge = 151 } 69 ┃ cover 5 "too young" (formAge form <= 17) 70 ┃ cover 5 "too old" (formAge form >= 151) 71 ┃ case validateSignup form of 72 ┃ Failure (InvalidAge{} :| []) -> pure () 73 ┃ other -> do 74 ┃ annotateShow other ┃ │ Success Signup { name = "a" , age = 151 } 75 ┃ failure ┃ ^^^^^^^

OK, we have an actual bug. When the age is 151 or greater, the form is deemed valid. It should cause a validation failure. Looking closer at the implementation, we see that a pattern guard is missing the upper bound check:

 validateAge age' | age' >= 18 = Success (fromIntegral age')  | otherwise = Failure (pure (InvalidAge age'))

If we change it to age' >= 18 && age' <= 150, and rerun the tests, they pass.

λ> check prop_invalid_age_fails ✓  passed 100 tests. too young 53% ██████████▌········· ✓ 5% too old 47% █████████▍·········· ✓ 5%

We’ve fixed the bug.

Measuring and declaring requirements on coverage is a powerful tool in Hedgehog. It gives us visibility into the generative tests we run, making it practical to debug generators. It ensures our tests meet our coverage requirements, even as implementation and tests evolve over time.

From Ages to Birth Dates

So far, our efforts have been successful. We’ve fixed real issues in both implementation and tests. Management is pleased. They’re now asking us to modify the signup system, and use our testing skills to ensure quality remains high.

Instead of entering their age, users will enter their birth date. Let’s suppose this information is needed for something important, like sending out birthday gifts. The form validation function must be modified to check, based on the supplied birth date date, if the user signing up is old enough.

First, we import the Calendar module from the time package:

import Data.Time.Calendar

Next, we modify the SignupForm data type to carry a formBirthDate of type Date, rather than an Int.

data SignupForm = SignupForm  { formName :: Text  , formBirthDate :: Day  } deriving (Eq, Show)

And we make the corresponding change to the Signup data type:

data Signup = Signup  { name :: Text  , birthDate :: Day  } deriving (Eq, Show)

We’ve also been requested to improve the validation errors. Instead of just InvalidAge, we define three constructors for various invalid birthdates:

data SignupError  = NameTooShort Text  | NameTooLong Text  | TooYoung Day  | TooOld Day  | NotYetBorn Day  deriving (Eq, Show)

Finally, we need to modify the validateSignup function. Here, we’re faced with an important question. How should the validation function obtain today’s date?

Keeping Things Deterministic

We could make validateSignup a non-deterministic action, which in Haskell would have the following type signature:

validateSignup  :: SignupForm -> IO (Validation (NonEmpty SignupError) Signup)

Note the use of IO. It means we could retrieve the current time from the system clock, and extract the Day value representing today’s date. But this approach has severe drawbacks.

If validateSignup uses IO to retrieve the current date, we can’t test it with other dates. What it there’s a bug that causes validation to behave incorrectly only on a particular date? We’d have to run the tests on that specific date to trigger it. If we introduce a bug, we want to know about it immediately. Not weeks, months, or even years after the bug was introduced. Furthermore, if we find such a bug with our tests, we can’t easily reproduce it on another date. We’d have to rewrite the implementation code to trigger the bug again.

Instead of using IO, we’ll use a simply technique for keeping our function pure: take all the information the function needs as arguments. In the case of validateSignup, we’ll pass today’s date as the first argument:

validateSignup  :: Day -> SignupForm -> Validation (NonEmpty SignupError) Signup

Again, let’s not worry about the implementation just yet. We’ll focus on the tests.

Generating Dates

In order to test the new validateSignup implementation, we need to generate Day values. We’re going to use a few functions from a separate module called Data.Time.Gen, previously written by some brilliant developer in our team. Let’s look at their type signatures. The implementations are not very interesting.

The generator, day, generates a day within the given range:

day :: Range Day -> Gen Day

A day range is constructed with linearDay:

linearDay :: Day -> Day -> Range Day

Alternatively, we might use exponentialDay:

exponentialDay :: Day -> Day -> Range Day

The linearDay and exponentialDay range functions are analogous to Hedgehog’s linear and exponential ranges for integral numbers.

To use the generator functions from Data.Time.Gen, we first add an import, qualified as Time:

import qualified Data.Time.Gen as Time

Next, we define a generator anyDay:

anyDay :: Gen Day anyDay =  let low = fromGregorian 1900 1 1  high = fromGregorian 2100 12 31  in Time.day (Time.linearDay low high)

The date range [1900-01-01, 2100-12-31] is arbitrary. We could pick any centuries we like, provided the time package supports the range. But why not make it somewhat realistic?

Rewriting Existing Properties

Now, it’s time to rewrite our existing property tests. Let’s begin with the one testing that validating a form with all valid data succeeds:

prop_valid_signup_form_succeeds = property $ do  today <- forAll anyDay ➊  let genForm = SignupForm <$> validName <*> validBirthDate today  form <- forAll genForm ➋   case validateSignup today form of  Success{} -> pure ()  Failure failure' -> do  annotateShow failure'  failure

A few new things are going on here. We’re generating a date representing today (➊), and generating a form with a birth date based on today’s date (➋). Generating today’s date, we’re effectively time travelling and running the form validation on that date. This means our validBirthDate generator must know which date is today, in order to pick a valid birth date. We pass today’s date as a parameter, and generate a date within the range of 18 to 150 years earlier:

validBirthDate :: Day -> Gen Day validBirthDate today = do  n <- Gen.integral (Range.linear 18 150)  pure (n `yearsBefore` today)

We define the helper function yearsBefore in the test suite. It offsets a date backwards in time by a given number of years:

yearsBefore :: Integer -> Day -> Day yearsBefore years = addGregorianYearsClip (negate years)

The Data.Time.Calendar module exports the addGregorianYearsClip function. It adds a number of years, clipping February 29th (leap days) to February 28th where necessary.

Let’s run tests:

λ> check prop_valid_signup_form_succeeds ✓  passed 100 tests.

Let’s move on to the next property, checking that invalid birth dates do not pass validation. Here, we use the same pattern as before, generating today’s date, but use invalidBirthDate instead:

prop_invalid_age_fails = property $ do  today <- forAll anyDay  form <- forAll (SignupForm <$> validName <*> invalidBirthDate today)   cover 5 "not yet born" (formBirthDate form > today)  cover 5 "too young" (formBirthDate form > 18 `yearsBefore` today)  cover 5 "too old" (formBirthDate form < 150 `yearsBefore` today)   case validateSignup today form of  Failure (TooYoung{} :| []) -> pure ()  Failure (NotYetBorn{} :| []) -> pure ()  Failure (TooOld{} :| []) -> pure ()  other -> do  annotateShow other  failure

Notice that we’ve also adjusted the coverage checks. There’s a new label, “not born yet,” for birth dates in the future. Running tests, we see the label in action:

λ> check prop_invalid_age_fails ✓  passed 100 tests. not yet born 18% ███▌················ ✓ 5% too young 54% ██████████▊········· ✓ 5% too old 46% █████████▏·········· ✓ 5%

Good coverage, all tests passing. We’re not quite done, though. There’s a particular set of dates that we should be sure to cover: “today” dates and birth dates that are close to, or exactly, 18 years apart.

Within our current property test for invalid ages, we’re only sure that generated birth dates include at least 5% too old, and at least 5% too young. We don’t know how far away from the “18 years” validation boundary they are.

We could tweak our existing generators to produce values close to that boundary. Given a date T, exactly 18 years before today’s date, then:

invalidBirthDate would need to produce birth dates just after but not equal to T
validBirthDate would need to produce birth dates just before or equal to T

There’s another option, though. Instead of defining separate properties for valid and invalid ages, we’ll use a single property for all cases. This way, we only need a single generator.

A Single Validation Property

In Building on developers’ intuitions to create effective property-based tests, John Hughes talks about “one property to rule them all.” Similarly, we’ll define a single property prop_validates_age for birth date validation. We’ll base our new property on prop_invalid_age_fails, but generalize to cover both positive and negative tests:

prop_validates_age = property $ do  today <- forAll anyDay  form <- forAll (SignupForm <$> validName <*> anyBirthDate today) ➊   let tooYoung = formBirthDate form > 18 `yearsBefore` today ➋  notYetBorn = formBirthDate form > today  tooOld = formBirthDate form < 150 `yearsBefore` today  oldEnough = formBirthDate form <= 18 `yearsBefore` today  exactly age = formBirthDate form == age `yearsBefore` today  closeTo age =  let diff' =  diffDays (formBirthDate form) (age `yearsBefore` today)  in abs diff' `elem` [0 .. 2]   cover 10 "too young" tooYoung  cover 1 "not yet born" notYetBorn  cover 1 "too old" tooOld   cover 20 "old enough" oldEnough ➌  cover 1 "exactly 18" (exactly 18)  cover 5 "close to 18" (closeTo 18)   case validateSignup today form of ➍  Failure (NotYetBorn{} :| []) | notYetBorn -> pure ()  Failure (TooYoung{} :| []) | tooYoung -> pure ()  Failure (TooOld{} :| []) | tooOld -> pure ()  Success{} | oldEnough -> pure ()  other -> annotateShow other >> failure

There are a few new things going on here:

Instead of generating exclusively invalid or valid birth dates, we’re now generating any birth date based on today’s date
The boolean expressions are used both in coverage checks and in asserting, so we separate them in a let binding
We add three new labels for the valid cases
Finally, we assert on both valid and invalid cases, based on the same expressions used in coverage checks

Note that our assertions are more specific than in prop_invalid_age_fails. The failure cases only pass if the corresponding label expressions are true. The oldEnough case covers all valid birth dates. Any result other than the four expected cases is considered incorrect.

The anyBirthDate generator is based on today’s date:

anyBirthDate :: Day -> Gen Day anyBirthDate today =  let ➊  inPast range = do  years <- Gen.integral range  pure (years `yearsBefore` today)  inFuture = do  years <- Gen.integral (Range.linear 1 5)  pure (addGregorianYearsRollOver years today)  daysAroundEighteenthYearsAgo = do  days <- Gen.integral (Range.linearFrom 0 (-2) 2)  pure (addDays days (18 `yearsBefore` today))  in ➋  Gen.frequency  [ (5, inPast (Range.exponential 1 150))  , (1, inPast (Range.exponential 151 200))  , (2, inFuture)  , (2, daysAroundEighteenthYearsAgo)  ]

We defines helper functions (➊) for generating dates in the past, in the future, and close to 18 years ago. Using those helper functions, we combine four generators, with different date ranges, using a Gen.frequency distribution (➋). The weights we use are selected to give us a good coverage.

Let’s run some tests:

λ> check prop_validates_age ✓  passed 100 tests. too young 62% ████████████▍······· ✓ 10% not yet born 20% ████················ ✓ 1% too old 4% ▊··················· ✓ 1% old enough 38% ███████▌············ ✓ 20% exactly 18 16% ███▏················ ✓ 1% close to 18 21% ████▏··············· ✓ 5%

Looks good! We’ve gone from testing positive and negative cases separately, to instead have a single property covering all cases, based on a single generator. It’s now easier to generate values close to the valid/invalid boundary of our SUT, i.e. around 18 years from today’s date.

February 29th

For the fun of it, let’s run some more tests. We’ll crank it up to 20000.

λ> check (withTests 20000 prop_validates_age) ✗  failed at test/Validation/V3Test.hs:141:64 after 17000 tests and 25 shrinks. too young 60% ████████████········ ✓ 10% not yet born 20% ███▉················ ✓ 1% too old 9% █▉·················· ✓ 1% old enough 40% ███████▉············ ✓ 20% exactly 18 14% ██▊················· ✓ 1% close to 18 21% ████▎··············· ✓ 5% ┏━━ test/Validation/V3Test.hs ━━━ 114 ┃ prop_validates_age = property $ do 115 ┃ today <- forAll anyDay ┃ │ 1956 - 02 - 29 116 ┃ form <- forAll (SignupForm <$> validName <*> anyBirthDate today) ➊ ┃ │ SignupForm { formName = "aa" , formBirthDate = 1938 - 03 - 01 } 117 ┃ 118 ┃ let tooYoung = formBirthDate form > 18 `yearsBefore` today ➋ 119 ┃ notYetBorn = formBirthDate form > today 120 ┃ tooOld = formBirthDate form < 150 `yearsBefore` today 121 ┃ oldEnough = formBirthDate form <= 18 `yearsBefore` today 122 ┃ exactlyEighteen = formBirthDate form == 18 `yearsBefore` today 123 ┃ closeToEighteen = 124 ┃ let diff' = 125 ┃ diffDays (formBirthDate form) (18 `yearsBefore` today) 126 ┃ in abs diff' `elem` [0 .. 2] 127 ┃ 128 ┃ cover 10 "too young" tooYoung 129 ┃ cover 1 "not yet born" notYetBorn 130 ┃ cover 1 "too old" tooOld 131 ┃ 132 ┃ cover 20 "old enough" oldEnough ➌ 133 ┃ cover 1 "exactly 18" exactlyEighteen 134 ┃ cover 5 "close to 18" closeToEighteen 135 ┃ 136 ┃ case validateSignup today form of ➍ 137 ┃ Failure (NotYetBorn{} :| []) | notYetBorn -> pure () 138 ┃ Failure (TooYoung{} :| []) | tooYoung -> pure () 139 ┃ Failure (TooOld{} :| []) | tooOld -> pure () 140 ┃ Success{} | oldEnough -> pure () 141 ┃ other -> annotateShow other >> failure ┃ │ Success Signup { name = "aa" , birthDate = 1938 - 03 - 01 } ┃ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Failure! Chaos! What’s going on here? Let’s examine the test case:

Today’s date is 1956-02-29
The birth date is 1938-03-01
The validation function considers this valid (it returns a Success value)
The test does considers this invalid (oldEnough is False)

This means that when the validation runs on a leap day, February 29th, and the person would turn 18 years old the day after (on March 1st), the validation function incorrectly considers the person old enough. We’ve found a bug.

Test Count and Coverage

Two things led us to find this bug:

Most importantly, that we generate today’s date and pass it as a parameter. Had we used the actual date, retrieved with an IO action, we’d only be able to find this bug every 1461 days. Pure functions are easier to test.
That we ran more tests than the default of 100. We might not have found this bug until much later, when the generated dates happened to trigger this particular bug. In fact, running 20000 tests does not always trigger the bug.

Our systems are often too complex to be tested exhaustively. Let’s use our form validation as an example. Between 1900-01-01 and 2100-12-31 there are 73,413 days. Selecting today’s date and the birth date from that range, we have more than five billion combinations. Running that many Hedgehog tests in GHCi on my laptop (based on some quick benchmarks) would take about a month. And this is a simple pure validation function!

To increase coverage, even if it’s not going to be exhaustive, we can increase the number of tests we run. But how many should we run? On a continuous integration server we might be able to run more than we do locally, but we still want to keep a tight feedback loop. And what if our generators never produce inputs that reveal existing bugs, regardless of the number of tests we run?

If we can’t test exhaustively, we need to ensure our generators cover interesting combinations of inputs. We need to carefully design and measure our tests and generators, based on the edge cases we already know of, as well as the ones that we discover over time. PBT without measuring coverage easily turns into a false sense of security.

In the case of our leap day bug, we can catch it with fewer tests, and on every test run. We need to make sure we cover leap days, used both as today’s date and as the birth date, even with a low number of tests.

Covering Leap Days

To generate inputs that cover certain edge cases, we combine specific generators using Gen.frequency:

(today, birthDate') <- forAll  (Gen.frequency  [ (5, anyDayAndBirthDate) ➊   , (2, anyDayAndBirthDateAroundYearsAgo 18) ➋  , (2, anyDayAndBirthDateAroundYearsAgo 150)   , (1, leapDayAndBirthDateAroundYearsAgo 18) ➌  , (1, leapDayAndBirthDateAroundYearsAgo 150)   , (1, commonDayAndLeaplingBirthDateAroundYearsAgo 18) ➍  , (1, commonDayAndLeaplingBirthDateAroundYearsAgo 150)  ]  )

Arbitrary values for today’s date and the birth date are drawn most frequently (➊), with a weight of 5. Next, with weights of 2, are generators for cases close to the boundaries of the validation function (➋). Finally, with weights of 1, are generators for special cases involving leap days as today’s date (➌) and leap days as birth date (➍).

Note that these generators return pairs of dates. For most of these generators, there’s a strong relation between today’s date and the birth date. For example, we can’t first generate any today’s date, pass that into a generator function, and expect it to always generate a leap day that occurred 18 years ago. Such a generator would have to first generate the leap day and then today’s date.

Let’s define the generators. The first one, anyDayAndBirthDate, picks any today’s date within a wide date range. It also picks a birth date from an even wider date range, resulting in some future birth dates and some ages above 150.

anyDayAndBirthDate :: Gen (Day, Day) anyDayAndBirthDate = do  today <- Time.day  (Time.linearDay (fromGregorian 1900 1 1)  (fromGregorian 2020 12 31)  )  birthDate' <- Time.day  (Time.linearDay (fromGregorian 1850 1 1)  (fromGregorian 2050 12 31)  )  pure (today, birthDate')

Writing automated tests with a hard-coded year 2020 might scare you. Won’t these tests fail when run in the future? No, not these tests. Remember, the validation function is deterministic. We control today’s date. The actual date on which we run these tests doesn’t matter.

Similar to the previous generator is anyDayAndBirthDateAroundYearsAgo. First, it generates any date as today’s date (➊). Next, it generates an arbitrary date approximately some number of years ago (➋), where the number of years is an argument of the generator.

anyDayAndBirthDateAroundYearsAgo :: Integer -> Gen (Day, Day) anyDayAndBirthDateAroundYearsAgo years = do  today <- Time.day ➊  (Time.linearDay (fromGregorian 1900 1 1)  (fromGregorian 2020 12 31)  )  birthDate' <- addingApproxYears (negate years) today ➋  pure (today, birthDate')

The addingApproxYearsAgo generator adds a number of years to a date, and offsets it between two days back and two days forward in time.

addingApproxYears :: Integer -> Day -> Gen Day addingApproxYears years today = do  days <- Gen.integral (Range.linearFrom 0 (-2) 2)  pure (addDays days (addGregorianYearsRollOver years today))

The last two generators used in our frequency distribution cover leap day edge cases. First, let’s define the leapDayAndBirthDateAroundYearsAgo generator. It generates a leap day used as today’s date, and a birth date close to the given number of years ago.

leapDayAndBirthDateAroundYearsAgo :: Integer -> Gen (Day, Day) leapDayAndBirthDateAroundYearsAgo years = do  today <- leapDay (Range.linear 1904 2020)  birthDate' <- addingApproxYears (negate years) today  pure (today, birthDate')

The leapDay generator uses mod to only generate years divisible by 4 and constructs dates on February 29th. That alone isn’t enough to only generate valid leap days, though. Years divisible by 100 but not by 400 are not leap years. To keep the generator simple, we discard those years using the already existing isLeapDay predicate as a filter.

leapDay :: Range Integer -> Gen Day leapDay yearRange = Gen.filter isLeapDay $ do  year <- Gen.integral yearRange  pure (fromGregorian (year - year `mod` 4) 2 29)

In general, we should be careful about discarding generated values using filter. If we discard too much, Hedgehog gives up and complains loudly. In this particular case, discarding a few generated dates is fine. Depending on the year range we pass it, we might not discard any date.

Finally, we define the commonDayAndLeaplingBirthDateAroundYearsAgo generator. It first generates a leap day used as the birth date, and then a today’s date approximately the given number of years after the birth date.

commonDayAndLeaplingBirthDateAroundYearsAgo :: Integer -> Gen (Day, Day) commonDayAndLeaplingBirthDateAroundYearsAgo years = do  birthDate' <- leapDay (Range.linear 1904 2020)  today <- addingApproxYears years birthDate'  pure (today, birthDate')

That’s it for the generators. Now, how do we know that we’re covering the edge cases well enough? With coverage checks!

 cover 5 ➊  "close to 18, validated on common day"  (closeTo 18 && not (isLeapDay today)) cover 1  "close to 18, validated on leap day"  (closeTo 18 && isLeapDay today)  cover 5 ➋  "close to 150, validated on common day"  (closeTo 150 && not (isLeapDay today)) cover 1  "close to 150, validated on leap day"  (closeTo 150 && isLeapDay today)  cover 5 ➌  "exactly 18 today, born on common day"  (exactly 18 && not (isLeapDay birthDate')) cover ➍  1  "legally 18 today, born on leap day"  ( isLeapDay birthDate'  && (addGregorianYearsRollOver 18 birthDate' == today)  )

We add new checks to the property test, checking that we hit both leap day and regular day cases around the 18th birthday (➊) and the 150th birthday (➋). Notice that we had similar checks before, but we were not discriminating between leap days and common days.

Finally, we check the coverage of two leap day scenarios that can occur when a person legally turns 18: a person born on a common day turning 18 on a leap day (➌), and a leapling turning 18 on a common day (➍).

Running the modified property test, we get the leap day counter-example every time, even with as few as a hundred tests. For example, we might see today’s date being 1904-02-29 and the birth date being 1886-03-01. The validation function deems the person old enough. Again, this is incorrect.

Now that we can quickly and reliably reproduce the failing example we are in a great position to find the error. While we could use a fixed seed to reproduce the particular failing case from the 20000 tests run, we are now more confident that the property test would catch future leap day-related bugs, if we were to introduce new ones. Digging into the implementation, we’ll find a boolean expression in a pattern guard being the culprit:

birthDate' <= addGregorianYearsRollOver (-18) today

The use of addGregorianYearsRollOver together with adding a negative number of years is the problem, rolling over to March 1st instead of clipping to February 28th. Instead, we should use addGregorianYearsClip:

birthDate' <= addGregorianYearsClip (-18) today

Running 100 tests again, we see that they all pass, and that our coverage requirements are met.

λ> Hedgehog.check prop_validates_age ✓  passed 100 tests. too young 17% ███▍················ ✓ 10% not yet born 7% █▍·················· ✓ 1% too old 19% ███▊················ ✓ 1% old enough 83% ████████████████▌··· ✓ 20% close to 18, validated on common day 30% ██████·············· ✓ 5% close to 18, validated on leap day 2% ▍··················· ✓ 1% close to 150, validated on common day 31% ██████▏············· ✓ 5% close to 150, validated on leap day 6% █▏·················· ✓ 1% exactly 18 today, born on common day 17% ███▍················ ✓ 5% legally 18 today, born on leap day 5% █··················· ✓ 1%

Summary

In this tutorial, we started with a simple form validation function, checking the name and age of a person signing up for an online service. We defined property tests for positive and negative tests, learned how to test generators with coverage checks, and found bugs in both the test suite and the implementation.

When requirements changed, we had to start working with dates. In order to keep the validation function deterministic, we had to pass in today’s date. This enabled us to simulate the validation running on any date, in combination with any reported birth date, and trigger bugs that could otherwise take years to find, if ever. Had we not made it deterministic, we would likely not have found the leap day bug later on.

To generate inputs that sufficiently test the validation function’s boundaries, we rewrote our separate positive and negative properties into a single property, and used coverage checks to ensure the quality of our generators. The trade-off between multiple disjoint properties and a single more complicated property is hard.

With multiple properties, for example split between positive and negative tests, both generators and assertions can be simpler and more targeted. On the other hand, you run a risk of missing certain inputs. The set of properties might not cover the entire space of inputs. Furthermore, performing coverage checks across multiple properties, using multiple targeted generators, can be problematic.

Ensuring coverage of generators in a single property is easier. You might even get away with a naive generator, depending on the system you’re testing. If not, you’ll need to combine more targeted generators, for example with weighted probabilities. The drawback of using a single property is that the assertion not only becomes more complicated, it’s also likely to mirror the implementation of the SUT. As we saw with our single property testing the validation function, the assertion duplicated the validation rules. You might be able to reuse the coverage expressions in assertions, but still, there’s a strong coupling.

The choice between single or multiple properties comes down to how you want to cover the boundaries of the SUT. Ultimately, both approaches can achieve the same coverage, in different ways. They both suffer from the classic problem of a test suite mirroring the system it’s testing.

Finally, running a larger number of tests, we found a bug related to leap days. Again, without having made the validation function deterministic, this could’ve only been found on a leap day. We further refined our generators to cover leap day cases, and found the bug reliably with as few as 100 tests. The bug was easy to find and fix when we had the inputs pointing directly towards it.

That’s it for this tutorial. Thanks for reading, and happy property testing and time travelling!

Property-Based Testing in a Screencast Editor, Case Study 3: Integration Testing

2019-06-02T00:00:00+02:00

In the last article we looked at how Komposition automatically classifies moving and still segments in imported video media, how I went from ineffective testing by eye to finding curious bugs using property-based testing (PBT). If you haven’t read it, or its preceding posts, I encourage you to check them out first:

This is the final case study in the “Property-Based Testing in a Screencast Editor” series. It covers property-based integration testing and its value during aggressive refactoring work within Komposition.

A History of Two Stacks

In Komposition, a project’s state is represented using an in-memory data structure. It contains the hierarchical timeline, the focus, import and render settings, project storage file paths, and more. To let users navigate backwards and forwards in their history of project edits, for example when they have made a mistake, Komposition supplies undo and redo commands.

The undo/redo history was previously implemented as a data structure recording project states, compromised of:

a current state variable
a stack of previous states
a stack of possible future states

The undo/redo history data structure held entire project state values. Each undoable and redoable user action created a new state value. Let’s look a bit closer at how this worked.

Performing Actions

When a user performed an undoable/redoable action, the undo/redo history would:

push the previous state onto the undo stack
perform the action and replace the current state
discard all states in the redo stack

This can be visualized as in the following diagram, where the state d is being replaced with a new state h, and d being pushed onto the undo stack. The undo/redo history to the left of the dividing line is the original, and the one to the right is the resulting history.

Again, note that performing new actions discarded all states in the redo stack.

Undoing Actions

When the user chose to undo an action, the undo/redo history would:

pop the undo stack and use that state as the current state
push the previous state onto the redo stack

The following diagram shows how undoing the last performed action’s resulting state, d, pushes d onto the redo stack, and pops c from the undo stack to use that as the current state.

Redoing Actions

When the user chose to redo an action, the undo/redo history would:

pop the redo stack and use that state as the current state
push the previous state onto the undo stack

The last diagram shows how redoing, recovering a previously undone state, pops g from the redo stack to use that as the current state, and pushes the previous state d onto the undo stack.

Note that not all user actions in Komposition are undoable/redoable. Actions like navigating the focus or zooming are not recorded in the history.

Dealing With Performance Problems

While the “two stacks of states” algorithm was easy to understand and implement, it failed to meet my non-functional requirements. A screencast project compromised of hundreds or thousands of small edits would consume gigabytes of disk space when stored, take tens of seconds to load from disk, and consume many gigabytes of RAM when in memory.

Now, you might think that my implementation was incredibly naive, and that the performance problems could be fixed with careful profiling and optimization. And you’d probably be right! I did consider going down that route, optimizing the code, time-windowing edits to compact history on the fly, and capping the history at some fixed size. Those would all be interesting pursuits, but in the end I decided to try something else.

Refactoring with Property-Based Integration Tests

Instead of optimizing the current stack-based implementation, I decided to implement the undo/redo history in terms of inverse actions. In this model, actions not only modify the project state, they also return another action, its inverse, that reverses the effects of the original action. Instead of recording a new project state data structure for each edit, the history only records descriptions of the actions themselves.

I realized early that introducing the new undo/redo history implementation in Komposition was not going to be a small task. It would touch the majority of command implementation code, large parts of the main application logic, and the project binary serialization format. What it wouldn’t affect, though, was the module describing user commands in abstract.

To provide a safety net for the refactoring, I decided to cover the undo/redo functionality with tests. As the user commands would stay the same throughout my modifications, I chose to test at that level, which can be characterized as integration-level testing. The tests run Komposition, including its top-level application control flow, but with the user interface and some other effects stubbed out. Making your application testable at this level is hard work, but the payoff can be huge.

With Komposition featuring close to twenty types of user commands, combined with a complex hierarchical timeline and navigation model, the combinatory explosion of possible states was daunting. Relying on example-based tests to safeguard my work was not satisfactory. While PBT couldn’t cover the entire state space either, I was confident it would improve my chances of finding actual bugs.

Undo/Redo Tests

Before I began refactoring, I added tests for the inverse property of undoable/redoable actions. The first test focuses on undoing actions, and is structured as follows:

Generate an initial project and application state
Generate a sequence of undoable/redoable commands (wrapped in events)
Run the application with the initial state and the generated events
Run an undo command for each original command
Assert that final timeline is equal to the initial timeline

Let’s look at the Haskell Hedgehog property test:

hprop_undo_actions_are_undoable = property $ do   -- 1. Generate initial timeline and focus  timelineAndFocus <- forAllWith showTimelineAndFocus $  Gen.timelineWithFocus (Range.linear 0 10) Gen.parallel   -- ... and initial application state  initialState <- forAll (initializeState timelineAndFocus)   -- 2. Generate a sequence of undoable/redoable commands  events <- forAll $  Gen.list (Range.exponential 1 100) genUndoableTimelineEvent   -- 3. Run 'events' on the original state  beforeUndos <- runTimelineStubbedWithExit events initialState   -- 4. Run as many undo commands as undoable commands  afterUndos <- runTimelineStubbedWithExit (undoEvent <$ events) beforeUndos   -- 5. That should result in a timeline equal to the one we started  -- with  timelineToTree (initialState ^. currentTimeline)  === timelineToTree (afterUndos ^. currentTimeline)

The second test, focusing on redoing actions, is structured very similarly to the previous test:

Generate an initial project and application state
Generate a sequence of undoable commands (wrapped in events)
Run the application with the initial state and the generated events
Run an undo commands for each original command
Run an redo commands for each original command
Assert that final timeline is equal to the timeline before undoing actions

The test code is also very similar:

hprop_undo_actions_are_redoable = property $ do   -- 1. Generate the initial timeline and focus  timelineAndFocus <- forAllWith showTimelineAndFocus $  Gen.timelineWithFocus (Range.linear 0 10) Gen.parallel   -- ... and the initial application state  initialState <- forAll (initializeState timelineAndFocus)   -- 2. Generate a sequence of undoable/redoable commands  events <- forAll $  Gen.list (Range.exponential 1 100) genUndoableTimelineEvent   -- 3. Run 'events' on the original state  beforeUndos <- runTimelineStubbedWithExit events initialState   -- 4. Run undo commands corresponding to all original commands  afterRedos <-  runTimelineStubbedWithExit (undoEvent <$ events) beforeUndos  -- 5. Run redo commands corresponding to all original commands  >>= runTimelineStubbedWithExit (redoEvent <$ events)   -- 6. That should result in a timeline equal to the one we had  -- before undoing actions  timelineToTree (beforeUndos ^. currentTimeline)  === timelineToTree (afterRedos ^. currentTimeline)

Note that these tests only assert on the equality of timelines, not entire project states, as undoable commands only operate on the timeline.

All Tests Passing, Everything Works

The undo/redo tests were written and run on the original stack-based implementation, kept around during a refactoring that took me two weeks of hacking during late nights and weekends, and finally run and passing with the new implementation based on inverse actions. Except for a few minimal adjustments to data types, these tests stayed untouched during the entire process.

The confidence I had when refactoring felt like a super power. Two simple property tests made the undertaking possible. They found numerous bugs, including:

Off-by-one index errors in actions modifying the timeline
Inconsistent timeline focus:
- focus was incorrectly restored on undoing an action
- focus was outside of the timeline bounds
Non-inverse actions:
- actions returning incorrectly constructed inverses
- the inverse of splitting a sequence is joining sequences, and joining them back up didn’t always work

After all tests passed, I ran the application with its GUI, edited a screencast project, and it all worked flawlessly. It’s almost too good to be true, right?

Property testing is not a silver bullet, and there might still be bugs lurking in my undo/redo history implementation. The tests I run are never going to be exhaustive and my generators might be flawed. That being said, they gave me a confidence in refactoring that I’ve never had before. Or maybe I just haven’t hit that disastrous edge case yet?

Why Test With Properties?

This was the last case study in the “Property-Based Testing in a Screencast Editor” series. I’ve had a great time writing these articles and giving talks on the subject. Before I wrap up, I’ll summarize my thoughts on PBT in general and my experience with it in Komposition.

Property-based testing is not only for pure functions; you can use it to test effectful actions. It is not only for unit testing; you can write integration tests using properties. It’s not only for functional programming languages; there are good frameworks for most popular programming languages.

Properties describe the general behavior of the system under test, and verify its correctness using a variety of inputs. Not only is this an effective way of finding errors, it’s also a concise way of documenting the system.

The iterative process in property-based testing, in my experience, comes down to the following steps:

Think about the specification of your system under test
Think about how generators and tests should work
Write or modify generators, tests, and implementation code, based on steps 1 and 2
Get minimal examples of failing tests
Repeat

Using PBT within Komposition has made it possible to confidently refactor large parts of the application. It has found errors in my thinking, my generators, my tests, and in my implementation code. Testing video scene classification went from a time consuming, repetitive, and manual verification process to a fast, effective, and automated task.

In short, it’s been a joy, and I look forward to continue using PBT in my work and in my own projects. I hope I’ve convinced you of its value, and inspired you to try it out, no matter what kind of project you’re working on and what programming language you are using. Involve your colleagues, practice writing property tests together, and enjoy finding complicated bugs before your users do!

Buy the Book

This series is now available as an ebook on Leanpub. While the content is mostly the same, there are few changes bringing it up-to-date. Also, if you’ve already enjoyed the articles, you might want support my work by purchasing this book. Finally, you might enjoy a nicely typeset PDF, or an EPUB book, over a web page.

Property-Based Testing in a Screencast Editor, Case Study 2: Video Scene Classification

2019-04-17T00:00:00+02:00

In the previous case study on property-based testing (PBT) in Komposition we looked at timeline flattening. This post covers the video classifier, how it was tested before, and the bugs I found when I wrote property tests for it.

If you haven’t read the introduction or the first case study yet, I recommend checking them out!

Classifying Scenes in Imported Video

Komposition can automatically classify scenes when importing video files. This is a central productivity feature in the application, effectively cutting recorded screencast material automatically, letting the user focus on arranging the scenes of their screencast. Scenes are segments that are considered moving, as opposed to still segments:

A still segment is a sequence of at least S seconds of near-equal frames
A moving segment is a sequence of non-equal frames, or a sequence of near-equal frames with a duration less than S

S is a preconfigured minimum still segment duration in Komposition. In the future it might be configurable from the user interface, but for now it’s hard-coded.

Equality of two frames f₁ and f₂ is defined as a function E(f₁, f₂), described informally as:

comparing corresponding pixel color values of f₁ and f₂, with a small epsilon for tolerance of color variation, and
deciding two frames equal when at least 99% of corresponding pixel pairs are considered equal.

In addition to the rules stated above, there are two edge cases:

The first segment is always a considered a moving segment (even if it’s just a single frame)
The last segment may be a still segment with a duration less than S

The second edge case is not what I would call a desirable feature, but rather a shortcoming due to the classifier not doing any type of backtracking. This could be changed in the future.

Manually Testing the Classifier

The first version of the video classifier had no property tests. Instead, I wrote what I thought was a decent classifier algorithm, mostly messing around with various pixel buffer representations and parallel processing to achieve acceptable performance.

The only type of testing I had available, except for general use of the application, was a color-tinting utility. This was a separate program using the same classifier algorithm. It took as input a video file, and produced as output a video file where each frame was tinted green or red, for moving and still frames, respectively.

In the recording above you see the color-tinted output video based on a recent version of the classifier. It classifies moving and still segments rather accurately. Before I wrote property tests and fixed the bugs that I found, it did not look so pretty, flipping back and forth at seemingly random places.

At first, debugging the classifier with the color-tinting tool way seemed like a creative and powerful technique. But the feedback loop was horrible, having to record video, process it using the slow color-tinting program, and inspecting it by eye. In hindsight, I can conclude that PBT is far more effective for testing the classifier.

Video Classification Properties

Figuring out how to write property tests for video classification wasn’t obvious to me. It’s not uncommon in example-based testing that tests end up mirroring the structure, and even the full implementation complexity, of the system under test. The same can happen in property-based testing.

With some complex systems it’s very hard to describe the correctness as a relation between any valid input and the system’s observed output. The video classifier is one such case. How do I decide if an output classification is correct for a specific input, without reimplementing the classification itself in my tests?

The other way around is easy, though! If I have a classification, I can convert that into video frames. Thus, the solution to the testing problem is to not generate the input, but instead generate the expected output. Hillel Wayne calls this technique “oracle generators” in his recent article.¹

The classifier property tests generate high-level representations of the expected classification output, which are lists of values describing the type and duration of segments.

Next, the list of output segments is converted into a sequence of actual frames. Frames are two-dimensional arrays of RGB pixel values. The conversion is simple:

Moving segments are converted to a sequence of alternating frames, flipping between all gray and all white pixels
Still frames are converted to a sequence of frames containing all black pixels

The example sequence in the diagram above, when converted to pixel frames with a frame rate of 10 FPS, can be visualized like in the following diagram, where each thin rectangle represents a frame:

By generating high-level output and converting it to pixel frames, I have input to feed the classifier with, and I know what output it should produce. Writing effective property tests then comes down to writing generators that produce valid output, according to the specification of the classifier. In this post I’ll show two such property tests.

Testing Still Segment Minimum Length

As stated in the beginning of this post, classified still segments must have a duration greater than or equal to S, where S is the minimum still segment duration used as a parameter for the classifier. The first property test we’ll look at asserts that this invariant holds for all classification output.

hprop_classifies_still_segments_of_min_length = property $ do   -- 1. Generate a minimum still segment length/duration  minStillSegmentFrames <- forAll $ Gen.int (Range.linear 2 (2 * frameRate))  let minStillSegmentTime = frameCountDuration minStillSegmentFrames   -- 2. Generate output segments  segments <- forAll $  genSegments (Range.linear 1 10)  (Range.linear 1  (minStillSegmentFrames * 2))  (Range.linear minStillSegmentFrames  (minStillSegmentFrames * 2))  resolution   -- 3. Convert test segments to actual pixel frames  let pixelFrames = testSegmentsToPixelFrames segments   -- 4. Run the classifier on the pixel frames  let counted = classifyMovement minStillSegmentTime (Pipes.each pixelFrames)  & Pipes.toList  & countSegments   -- 5. Sanity check  countTestSegmentFrames segments === totalClassifiedFrames counted   -- 6. Ignore last segment and verify all other segments  case initMay counted of  Just rest ->  traverse_ (assertStillLengthAtLeast minStillSegmentTime) rest  Nothing -> success  where  resolution = 10 :. 10

This chunk of test code is pretty busy, and it’s using a few helper functions that I’m not going to bore you with. At a high level, this test:

Generates a minimum still segment duration, based on a minimum frame count (let’s call it n) in the range [2, 20]. The classifier currently requires that n ≥ 2, hence the lower bound. The upper bound of 20 frames is an arbitrary number that I’ve chosen.
Generates valid output segments using the custom generator genSegments, where
- moving segments have a frame count in [1, 2n], and
- still segments have a frame count in [n, 2n].
Converts the generated output segments to actual pixel frames. This is done using a helper function that returns a list of alternating gray and white frames, or all black frames, as described earlier.
Count the number of consecutive frames within each segment, producing a list like [Moving 18, Still 5, Moving 12, Still 30].
Performs a sanity check that the number of frames in the generated expected output is equal to the number of frames in the classified output. The classifier must not lose or duplicate frames.
Drops the last classified segment, which according to the specification can have a frame count less than n, and asserts that all other still segments have a frame count greater than or equal to n.

Let’s run some tests.

> :{ | hprop_classifies_still_segments_of_min_length | & Hedgehog.withTests 10000 | & Hedgehog.check | :} ✓  passed 10000 tests.

Cool, it looks like it’s working.

Sidetrack: Why generate the output?

Now, you might wonder why I generate output segments first, and then convert to pixel frames. Why not generate random pixel frames to begin with? The property test above only checks that the still segments are long enough!

The benefit of generating valid output becomes clearer in the next property test, where I use it as the expected output of the classifier. Converting the output to a sequence of pixel frames is easy, and I don’t have to state any complex relation between the input and output in my property. When using oracle generators, the assertions can often be plain equality checks on generated and actual output.

But there’s benefit in using the same oracle generator for the “minimum still segment length” property, even if it’s more subtle. By generating valid output and converting to pixel frames, I can generate inputs that cover the edge cases of the system under test. Using property test statistics and coverage checks, I could inspect coverage, and even fail test runs where the generators don’t hit enough of the cases I’m interested in.²

Had I generated random sequences of pixel frames, then perhaps the majority of the generated examples would only produce moving segments. I could tweak the generator to get closer to either moving or still frames, within some distribution, but wouldn’t that just be a variation of generating valid scenes? It would be worse, in fact. I wouldn’t then be reusing existing generators, and I wouldn’t have a high-level representation that I could easily convert from and compare with in assertions.

Testing Moving Segment Time Spans

The second property states that the classified moving segments must start and end at the same timestamps as the moving segments in the generated output. Compared to the previous property, the relation between generated output and actual classified output is stronger.

hprop_classifies_same_scenes_as_input = property $ do  -- 1. Generate a minimum still still segment duration  minStillSegmentFrames <- forAll $ Gen.int (Range.linear 2 (2 * frameRate))  let minStillSegmentTime = frameCountDuration minStillSegmentFrames   -- 2. Generate test segments  segments <- forAll $ genSegments (Range.linear 1 10)  (Range.linear 1  (minStillSegmentFrames * 2))  (Range.linear minStillSegmentFrames  (minStillSegmentFrames * 2))  resolution   -- 3. Convert test segments to actual pixel frames  let pixelFrames = testSegmentsToPixelFrames segments   -- 4. Convert expected output segments to a list of expected time spans  -- and the full duration  let durations = map segmentWithDuration segments  expectedSegments = movingSceneTimeSpans durations  fullDuration = foldMap unwrapSegment durations   -- 5. Classify movement of frames  let classifiedFrames =  Pipes.each pixelFrames  & classifyMovement minStillSegmentTime  & Pipes.toList   -- 6. Classify moving scene time spans  let classified =  (Pipes.each classifiedFrames  & classifyMovingScenes fullDuration)  >-> Pipes.drain  & Pipes.runEffect  & runIdentity   -- 7. Check classified time span equivalence  expectedSegments === classified   where  resolution = 10 :. 10

Steps 1–3 are the same as in the previous property test. From there, this test:

Converts the generated output segments into a list of time spans. Each time span marks the start and end of an expected moving segment. Furthermore, it needs the full duration of the input in step 6, so that’s computed here.
Classify the movement of each frame, i.e. if it’s part of a moving or still segment.
Run the second classifier function called classifyMovingScenes, based on the full duration and the frames with classified movement data, resulting in a list of time spans.
Compare the expected and actual classified list of time spans.

While this test looks somewhat complicated with its setup and various conversions, the core idea is simple. But is it effective?

Bugs! Bugs everywhere!

Preparing for a talk on property-based testing, I added the “moving segment time spans” property a week or so before the event. At this time, I had used Komposition to edit multiple screencasts. Surely, all significant bugs were caught already. Adding property tests should only confirm the level of quality the application already had. Right?

Nope. First, I discovered that my existing tests were fundamentally incorrect to begin with. They were not reflecting the specification I had in mind, the one I described in the beginning of this post.

Furthermore, I found that the generators had errors. At first, I used Hedgehog to generate the pixels used for the classifier input. Moving frames were based on a majority of randomly colored pixels and a small percentage of equally colored pixels. Still frames were based on a random single color.

The problem I had not anticipated was that the colors used in moving frames were not guaranteed to be distinct from the color used in still frames. In small-sized examples I got black frames at the beginning and end of moving segments, and black frames for still segments, resulting in different classified output than expected. Hedgehog shrinking the failing examples’ colors towards 0, which is black, highlighted this problem even more.

I made my generators much simpler, using the alternating white/gray frames approach described earlier, and went on to running my new shiny tests. Here’s what I got:

What? Where does 0s–0.6s come from? The classified time span should’ve been 0s–1s, as the generated output has a single moving scene of 10 frames (1 second at 10 FPS). I started digging, using the annotate function in Hedgehog to inspect the generated and intermediate values in failing examples.

I couldn’t find anything incorrect in the generated data, so I shifted focus to the implementation code. The end timestamp 0.6s was consistently showing up in failing examples. Looking at the code, I found a curious hard-coded value 0.5 being bound and used locally in classifyMovement.

The function is essentially a fold over a stream of frames, where the accumulator holds vectors of previously seen and not-yet-classified frames. Stripping down and simplifying the old code to highlight one of the bugs, it looked something like this:

classifyMovement minStillSegmentTime =  case ... of  InStillState{..} ->  if someDiff > minEqualTimeForStill  then ...  else ...  InMovingState{..} ->  if someOtherDiff >= minStillSegmentTime  then ...  else ...  where  minEqualTimeForStill = 0.5

Let’s look at what’s going on here. In the InStillState branch it uses the value minEqualTimeForStill, instead of always using the minStillSegmentTime argument. This is likely a residue from some refactoring where I meant to make the value a parameter instead of having it hard-coded in the definition.

Sparing you the gory implementation details, I’ll outline two more problems that I found. In addition to using the hard-coded value, it incorrectly classified frames based on that value. Frames that should’ve been classified as “moving” ended up “still”. That’s why I didn’t get 0s–1s in the output.

Why didn’t I see 0s–0.5s, given the hard-coded value 0.5? Well, there was also an off-by-one bug, in which one frame was classified incorrectly together with the accumulated moving frames.

The classifyMovement function is 30 lines of Haskell code juggling some state, and I managed to mess it up in three separate ways at the same time. With these tests in place I quickly found the bugs and fixed them. I ran thousands of tests, all passing.

Finally, I ran the application, imported a previously recorded video, and edited a short screencast. The classified moving segments where notably better than before.

Summary

A simple streaming fold can hide bugs that are hard to detect with manual testing. The consistent result of 0.6, together with the hard-coded value 0.5 and a frame rate of 10 FPS, pointed clearly towards an off-by-one bug. I consider this is a great showcase of how powerful shrinking in PBT is, consistently presenting minimal examples that point towards specific problems. It’s not just a party trick on ideal mathematical functions.

Could these errors have been caught without PBT? I think so, but what effort would it require? Manual testing and introspection did not work for me. Code review might have revealed the incorrect definition of minEqualTimeForStill, but perhaps not the off-by-one and incorrect state handling bugs. There are of course many other QA techniques, I won’t evaluate all. But given the low effort that PBT requires in this setting, the amount of problems it finds, and the accuracy it provides when troubleshooting, I think it’s a clear win.

I also want to highlight the iterative process that I find naturally emerges when applying PBT:

Think about how your system is supposed to work. Write down your specification.
Think about how to generate input data and how to test your system, based on your specification. Tune your generators to provide better test data. Try out alternative styles of properties. Perhaps model-based or metamorphic testing fits your system better.
Run tests and analyze the minimal failing examples. Fix your implementation until all tests pass.

This can be done when modifying existing code, or when writing new code. You can apply this without having any implementation code yet, perhaps just a minimal stub, and the workflow is essentially the same as TDD.

Coming Up

The final post in this series will cover testing at a higher level of the system, with effects and multiple subsystems being integrated to form a full application. We will look at property tests that found many bugs and that made a substantial refactoring possible.

Introduction
Timeline Flattening
Video Scene Classification
Integration Testing

Until then, thanks for reading!

Credits

Thank you Ulrik Sandberg, Pontus Nagy, and Fredrik Björeman for reviewing drafts of this post.

Buy the Book

Footnotes

See the “Oracle Generators” section in Finding Property Tests.↩︎
John Hughes’ talk Building on developers’ intuitions goes into depth on this. There’s also work being done to provide similar functionality for Hedgehog.↩︎

Property-Based Testing in a Screencast Editor, Case Study 1: Timeline Flattening

2019-03-24T00:00:00+01:00

In the first post of this series I introduced the Komposition screencast editor, and briefly explained the fundamentals of property-based testing (PBT). Furthermore, I covered how to write testable code, regardless of how you check your code with automated tests. Lastly, I highlighted some difficulties in using properties to perform component and integration testing.

If you haven’t read the introductory post, I suggest doing so before continuing with this one. You’ll need an understanding of what PBT is for this case study to make sense.

This post is the first case study in the series, covering the timeline flattening process in Komposition and how it’s tested using PBT. The property tests aren’t integration-level tests, but rather unit tests. This case study serves as a warm-up to the coming, more advanced, ones.

Before we look at the tests, we need to learn more about Komposition’s hierarchical timeline and how the flattening process works.

The Hierarchical Timeline

Komposition’s timeline is hierarchical. While many non-linear editing systems have support for some form of nesting¹ they are primarily focused on flat timeline workflows. The timeline structure and the keyboard-driven editing in Komposition is optimized for the screencast editing workflow I use.

It’s worth emphasizing that Komposition is not a general video editor. In addition to its specific editing workflow, you may need to adjust your recording workflow to use it effectively².

Video and Audio in Parallels

At the lowest level of the timeline are clips and gaps. Those are put within the video and audio tracks of parallels. The following diagram shows a parallel consisting of two video clips and one audio clip.

The tracks of a parallel are played simultaneously (in parallel), as indicated by the arrows in the above diagram. The tracks start playing at the same time. This makes parallels useful to synchronize the playback of specific parts of a screencast, and to group closely related clips.

Gaps

When editing screencasts made up of separate video and audio recordings you often end up with differing clip duration. The voice-over audio clip might be longer than the corresponding video clip, or vice versa. A useful default behaviour is to extend the short clips. For audio, this is easy. Just pad with silence. For video, it’s not so clear what to do. In Komposition, shorter video tracks are padded with repeated still frame sections called gaps.

The following diagram shows a parallel with a short video clip and a longer audio clip. The dashed area represents the implicit gap.

You can also add gaps manually, specifying a duration of the gap and inserting it into a video or audio track. The following diagram shows a parallel with manually added gaps in both video and audio tracks.

Manually added gaps (called explicit gaps) are padded with still frames or silence, just as implicit gaps that are added automatically to match track duration.

Sequences

Parallels are put in sequences. The parallels within a sequence are played sequentially; the first one is played in its entirety, then the next one, and so on. This behaviour is different from how parallels play their tracks. Parallels and sequences, with their different playback behaviors, make up the fundamental building blocks of the compositional editing in Komposition.

The following diagram shows a sequence of two parallels, playing sequentially:

The Timeline

Finally, at the top level, we have the timeline. Effectively, the timeline is a sequence of sequences; it plays every child sequence in sequence. The reason for this level to exist is for the ability to group larger chunks of a screencast within separate sequences.

I use separate sequences within the timeline to delimit distinct parts of a screencast, such as the introduction, the different chapters, and the summary.

Timeline Flattening

Komposition currently uses FFmpeg to render the final media. This is done by constructing an ffmpeg command invocation with a filter graph describing how to fit together all clips, still frames, and silent audio parts.

FFmpeg doesn’t know about hierarchical timelines; it only cares about video and audio streams. To convert the hierarchical timeline into a suitable representation to build the FFmpeg filter graph from, Komposition performs timeline flattening.

The flat representation of a timeline contains only two tracks; audio and video. All gaps are explicitly represented in those tracks. The following graph shows how a hierarchical timeline is flattened into two tracks.

Notice in the graphic above how the implicit gaps at the ends of video and audio tracks get represented with explicit gaps in the flat timeline. This is because FFmpeg does not know how to render implicit gaps. All gaps are represented explicitly, and are converted to clips of still frames or silent audio when rendered with FFmpeg.

Property Tests

To test the timeline flattening, there’s a number of properties that are checked. I’ll go through each one and their property test code.

These properties were primarily written after I already had an implementation. They capture some general properties of flattening that I’ve come up with. In other cases, I’ve written properties before beginning on an implementation, or to uncover an existing bug that I’ve observed.

Thinking about your system’s general behaviour and expressing that as executable property tests is hard. I believe, like with any other skill, that it requires a lot of practice. Finding general patterns for properties, like the ones Scott Wlaschin describe in Choosing properties for property-based testing, is a great place to start. When you struggle with finding properties of your system under test, try applying these patterns and see which work for you.

Property: Duration Equality

Given a timeline t, where all parallels have at least one video clip, the total duration of the flattened t must be equal to the total duration of t. Or, in a more dense notation,

∀t ∈ T → duration(flatten(t)) = duration(t)

where T is the set of timelines with at least one video clip in each parallel.

The reason that all parallels must have at least one video clip is because currently the flattening algorithm can only locate still frames for video gaps from within the same parallel. If it encounters a parallel with no video clips, the timeline flattening fails. This limitation is discussed in greater detail at the end of this article.

The test for the duration equality property is written using Hedgehog, and looks like this:

hprop_flat_timeline_has_same_duration_as_hierarchical =  property $ do  -- 1. Generate a timeline with video clips in each parallel  timeline' <- forAll $  Gen.timeline (Range.exponential 0 5) Gen.parallelWithClips   -- 2. Flatten the timeline and extract the result  let Just flat = Render.flattenTimeline timeline'   -- 3. Check that hierarchical and flat timeline duration are equal  durationOf AdjustedDuration timeline'  === durationOf AdjustedDuration flat

It generates a timeline using forAll and custom generators (1). Instead of generating timelines of any shape and filtering out only the ones with video clips in each parallel, which would be very inefficient, this test uses a custom generator to only obtain inputs that satisfy the invariants of the system under test.

The range passed as the first argument to Gen.timeline is used as the bounds of the generator, such that each level in the generated hierarchical timeline will have at most 5 children.

Gen.timeline takes as its second argument another generator, the one used to generate parallels, which in this case is Gen.parallelWithClips. With Hedgehog generators being regular values, it’s practical to compose them like this. A “higher-order generator” can be a regular function taking other generators as arguments.

As you might have noticed in the assertion (3), durationOf takes as its first argument a value AdjustedDuration. What’s that about? Komposition supports adjusting the playback speed of video media for individual clips. To calculate the final duration of a clip, the playback speed needs to taken into account. By passing AdjustedDuration we take playback speed into account for all video clips.

Sidetrack: Finding a Bug

Let’s say I had introduced a bug in timeline flattening, in which all video gaps weren’t added correctly to the flat video tracks. The flattening is implemented as a fold, and it would not be unthinkable that the accumulator was incorrectly constructed in a case. The test would catch this quickly and present us with a minimal counter-example:

Hedgehog prints the source code for the failing property. Below the forAll line the generated value is printed. The difference between the expected and actual value is printed below the failing assertion. In this case it’s a simple expression of type Duration. In case you’re comparing large tree-like structures, this diff will highlight only the differing expressions. Finally, it prints the following:

This failure can be reproduced by running: > recheck (Size 23) (Seed 16495576598183007788 5619008431246301857)

When working on finding and fixing the fold bug, we can use the printed size and seed values to deterministically rerun the test with the exact same inputs.

Property: Clip Occurrence

Slightly more complicated than the duration equality property, the clip occurrence property checks that all clips from the hierarchical timeline, and no other clips, occur within the flat timeline. As discussed in the introduction on timeline flattening, implicit gaps get converted to explicit gaps and thereby add more gaps, but no video or audio clips should be added or removed.

hprop_flat_timeline_has_same_clips_as_hierarchical =  property $ do  -- 1. Generate a timeline with video clips in each parallel  timeline' <- forAll $  Gen.timeline (Range.exponential 0 5) Gen.parallelWithClips   -- 2. Flatten the timeline  let flat = Render.flattenTimeline timeline'   -- 3. Check that all video clips occur in the flat timeline  flat ^.. _Just . Render.videoParts . each . Render._VideoClipPart  === timelineVideoClips timeline'   -- 4. Check that all audio clips occur in the flat timeline  flat ^.. _Just . Render.audioParts . each . Render._AudioClipPart  === timelineAudioClips timeline'

The hierarchical timeline is generated and flattened like before (1, 2). The two assertions check that the respective video clips (3) and audio clips (4) are equal. It’s using lenses to extract clips from the flat timeline, and the helper functions timelineVideoClips and timelineAudioClips to extract clips from the original hierarchical timeline.

Still Frames Used

In the process of flattening, the still frame source for each gap is selected. It doesn’t assign the actual pixel data to the gap, but a value describing which asset the still frame should be extracted from, and whether to pick the first or the last frame (known as still frame mode.) This representation lets the flattening algorithm remain a pure function, and thus easier to test. Another processing step runs the effectful action that extracts still frames from video files on disk.

The decision of still frame mode and source is made by the flattening algorithm based on the parallel in which each gap occur, and what video clips are present before or after. It favors using clips occurring after the gap. It only uses frames from clips before the gap in case there are no clips following it. To test this behaviour, I’ve defined three properties.

Property: Single Initial Video Clip

The following property checks that an initial single video clip, followed by one or more gaps, is used as the still frame source for those gaps.

hprop_flat_timeline_uses_still_frame_from_single_clip =  property $ do  -- 1. Generate a video track generator where the first  -- video part is always a clip  let genVideoTrack = do  v1 <- Gen.videoClip  vs <- Gen.list (Range.linear 1 5) Gen.videoGap  pure (VideoTrack () (v1 : vs))   -- 2. Generate a timeline with the custom video track  -- generator  timeline' <- forAll $ Gen.timeline  (Range.exponential 0 5)  (Parallel () <$> genVideoTrack <*> Gen.audioTrack)   -- 3. Flatten the timeline  let flat = Render.flattenTimeline timeline'   -- 4. Check that any video gaps will use the last frame  -- of a preceding video clip  flat  ^.. ( _Just  . Render.videoParts  . each  . Render._StillFramePart  . Render.stillFrameMode  )  & traverse_ (Render.LastFrame ===)

The custom video track generator (1) always produces tracks with an initial video clip followed by one or more video gaps. The generated timeline (2) can contain parallels with any audio track shape, which may result in a longer audio track and thus an implicit gap at the end of the video track. In either case, all video gaps should padded with the last frame of the initial video clip, which is checked in the assertion (4).

Property: Ending with a Video Clip

In case the video track ends with a video clip, and is longer than the audio track, all video gaps within the track should use the first frame of a following clip.

hprop_flat_timeline_uses_still_frames_from_subsequent_clips =  property $ do  -- 1. Generate a parallel where the video track ends with  -- a video clip, and where the audio track is shorter  let  genParallel = do  vt <-  VideoTrack ()  <$> ( snoc  <$> Gen.list (Range.linear 1 10) Gen.videoPart  <*> Gen.videoClip  )  at <- AudioTrack () . pure . AudioGap () <$> Gen.duration'  (Range.linearFrac  0  (durationToSeconds (durationOf AdjustedDuration vt) - 0.1)  )  pure (Parallel () vt at)   -- 2. Generate a timeline with the custom parallel generator  timeline' <- forAll $ Gen.timeline (Range.exponential 0 5) genParallel   -- 3. Flatten the timeline  let flat = Render.flattenTimeline timeline'   -- 4. Check that all gaps use the first frame of subsequent clips  flat  ^.. ( _Just  . Render.videoParts  . each  . Render._StillFramePart  . Render.stillFrameMode  )  & traverse_ (Render.FirstFrame ===)

The custom generator (1) produces parallels where the video track is guaranteed to end with a clip, and where the audio track is 100 ms shorter than the video track. This ensures that there’s no implicit video gap at the end of the video track. Generating (2) and flattening (3) is otherwise the same as before. The assertion (4) checks that all video gaps uses the first frame of a following clip.

Property: Ending with an Implicit Video Gap

The last property on still frame usage covers the case where the video track is shorter than the audio track. This leaves an implicit gap which, just like explicit gaps inserted by the user, are padded with still frames.

hprop_flat_timeline_uses_last_frame_for_automatic_video_padding =  property $ do  -- 1. Generate a parallel where the video track only contains a video  -- clip, and where the audio track is longer  let  genParallel = do  vt <- VideoTrack () . pure <$> Gen.videoClip  at <- AudioTrack () . pure . AudioGap () <$> Gen.duration'  (Range.linearFrac  (durationToSeconds (durationOf AdjustedDuration vt) + 0.1)  10  )  pure (Parallel () vt at)   -- 2. Generate a timeline with the custom parallel generator  timeline' <- forAll $ Gen.timeline (Range.exponential 0 5) genParallel   -- 3. Flatten the timeline  let flat = Render.flattenTimeline timeline'   -- 4. Check that video gaps (which should be a single gap at the  -- end of the video track) use the last frame of preceding clips  flat  ^.. ( _Just  . Render.videoParts  . each  . Render._StillFramePart  . Render.stillFrameMode  )  & traverse_ (Render.LastFrame ===)

The custom generator (1) generates a video track consisting of video clips only, and an audio track that is 100ms longer. Generating the timeline (2) and flattening (3) are again similar to the previous property tests. The assertion (4) checks that all video gaps use the last frame of preceding clips, even if we know that there should only be one at the end.

Properties: Flattening Equivalences

The last property I want to show in this case study checks flattening at the sequence and parallel levels. While rendering a full project always flattens at the timeline, the preview feature in Komposition can be used to render and preview a single sequence or parallel.

There should be no difference between flattening an entire timeline and flattening all of its sequences or parallels and folding those results into a single flat timeline. This is what the flattening equivalences properties are about.

hprop_flat_timeline_is_same_as_all_its_flat_sequences =  property $ do  -- 1. Generate a timeline  timeline' <- forAll $  Gen.timeline (Range.exponential 0 5) Gen.parallelWithClips   -- 2. Flatten all sequences and fold the resulting flat  -- timelines together  let flat = timeline' ^.. sequences . each  & foldMap Render.flattenSequence   -- 3. Make sure we successfully flattened the timeline  flat /== Nothing   -- 4. Flatten the entire timeline and compare to the  -- flattened sequences  Render.flattenTimeline timeline' === flat

The first property generates a timeline (1) where all parallels have at least one video clip. It flattens all sequences within the timeline and folds the results together (2). Folding flat timelines together means concatenating their video and audio tracks, resulting in a single flat timeline.

Before the final assertion, it checks that we got a result (3) and not Nothing. As it’s using the Gen.parallelWithClips generator there should always be video clips in each parallel, and we should always successfully flatten and get a result. The final assertion (4) checks that rendering the original timeline gives the same result as the folded-together results of rendering each sequence.

The other property is very similar, but operates on parallels rather than sequences:

hprop_flat_timeline_is_same_as_all_its_flat_parallels =  property $ do  -- 1. Generate a timeline  timeline' <- forAll $  Gen.timeline (Range.exponential 0 5) Gen.parallelWithClips   -- 2. Flatten all parallels and fold the resulting flat  -- timelines together  let flat = timeline' ^.. sequences . each . parallels . each  & foldMap Render.flattenParallel   -- 3. Make sure we successfully flattened the timeline  flat /== Nothing   -- 4. Flatten the entire timeline and compare to the  -- flattened parallels  Render.flattenTimeline timeline' === flat

The only difference is in the traversal (2), where we apply Render.flattenParallel to each parallel instead of applying Render.flattenSequence to each sequence.

Missing Properties

Whew! That was quite a lot of properties and code, especially for a warm-up. But timeline flattening could be tested more thoroughly! I haven’t yet written the following properties, but I’m hoping to find some time to add them:

Clip playback timestamps are the same. The “clip occurrence” property only checks that the hierarchical timeline’s clips occur in the flat timeline. It doesn’t check when in the flat timeline they occur. One way to test this would be to first annotate each clip in original timeline with its playback timestamp, and transfer this information through to the flat timeline. Then the timestamps could be included in the assertion.
Source assets used as still frame sources. The “still frames used” properties only check the still frame mode of gaps, not the still frame sources. The algorithm could have a bug where it always uses the first video clip’s asset as a frame source, and the current property tests would not catch it.
Same flat result is produced regardless of sequence grouping. Sequences can be split or joined in any way without affecting the final rendered media. They are merely ways of organizing parallels in logical groups. A property could check that however you split or join sequences within a timeline, the flattened result is the same.

A Missing Feature

As pointed out earlier, parallels must have at least one video clip. The flattening algorithm can only locate still frame sources for video gaps from within the same parallel. This is an annoying limitation when working with Komposition, and the algorithm should be improved.

As the existing set of properties describe timeline flattening fairly well, changing the algorithm could be done with a TDD-like workflow:

Modify the property tests to capture the intended behaviour
Tests will fail, with the errors showing how the existing implementation fails to find still frame sources as expected
Change the implementation to make the tests pass

PBT is not only an after-the-fact testing technique. It can be used much like conventional example-based testing to drive development.

Obligatory Cliff-Hanger

In this post we’ve looked at timeline flattening, the simplest case study in the “Property-Based Testing in a Screencast Editor” series. The system under test was a module of pure functions, with complex enough behaviour to showcase PBT as a valuable tool. The tests are more closely related to the design choices and concrete representations of the implementation.

Coming case studies will dive deeper into the more complex subsystems of Komposition, and finally we’ll see how PBT can be used for integration testing. At that level, the property tests are less tied to the implementation, and focus on describing the higher-level outcomes of the interaction between subsystems.

Next up is property tests for the video classifier. It’s also implemented a pure function, but with slightly more complicated logic that is trickier to test. We’re going to look at an interesting technique where we generate the expected output instead of the input.

Thanks for reading! See you next time.

Credits

Thank you Chris Ford and Ulrik Sandberg for proof-reading and giving valuable feedback on drafts of this post.

Buy the Book

Footnotes

Final Cut Pro has compound clips, and Adobe Premiere Pro has nested sequences.↩︎
The section on workflow in Komposition’s documentation describes how to plan, record, and edit your screencast in way compatible with Komposition.↩︎

Property-Based Testing in a Screencast Editor: Introduction

2019-03-02T00:00:00+01:00

This is the first in a series of posts about using property-based testing (PBT) within Komposition, a screencast editor that I’ve been working on during the last year. It introduces PBT and highlights some challenges in testing properties of an application like Komposition.

Future posts will focus on individual case studies, covering increasingly complex components and how they are tested. I’ll reflect on what I’ve learned in each case, what bugs the tests have found, and what still remains to be improved.

For example, I’ll explain how using PBT helped me find and fix bugs in the specification and implementation of Komposition’s video classifier. Those were bugs that would be very hard to find using example-based tests or using a static type system!

This series is not a tutorial on PBT, but rather a collection of motivating examples. That said, you should be able to follow along without prior knowledge of PBT.

Komposition

In early 2018 I started producing Haskell screencasts. A majority of the work involved cutting and splicing video by hand in a non-linear editing system (NLE) like Premiere Pro or Kdenlive. I decided to write a screencast editor specialized for my needs, reducing the amount of manual labor needed to edit the recorded material. Komposition was born.

Komposition is a modal GUI application built for editing screencasts. Unlike most NLE systems, it features a hierarchical timeline, built out of sequences, parallels, tracks, clips, and gaps. To make the editing experience more efficient, it automatically classifies scenes in screen capture video, and sentences in recorded voice-over audio.

If you are curious about Komposition and want to learn more right away, check out its documentation.

Some of the most complex parts of Komposition include focus and timeline transformations, video classification, video rendering, and the main application logic. Those are the areas in which I’ve spent most effort writing tests, using a combination of example-based and property-based testing.

I’ve selected the four most interesting areas where I’ve applied PBT in Komposition, and I’ll cover one in each coming blog post:

Timeline flattening
Video scene classification
Focus and timeline consistency
Symmetry of undo/redo

I hope these case studies will be motivating, and that they will show the value of properties all the way from unit testing to integration testing.

Property-Based Testing

To get the most out of this series, you need a basic understanding of what PBT is, so let’s start there. For my take on a minimal definition, PBT is about:

Specifying your system under test in terms of properties, where properties describe invariants of the system based on its input and output.
Testing that those properties hold against a large variety of inputs.

It’s worth noting that PBT is not equal to QuickCheck, or any other specific tool, for that matter. The set of inputs doesn’t have to be randomly generated. You don’t have to use “shrinking”. You don’t have to use a static type system or a functional programming language. PBT is a general idea that can be applied in many ways.

The following resources are useful if you want to learn more about PBT:

The introductory articles on Hypothesis, although specific to Python.
“What is Property Based Testing?” by David R. MacIver is a definition of what PBT is, and particularly what it isn’t.

The code examples will be written in Haskell and using the Hedgehog testing system. You don’t have to know Haskell to follow this series, as I’ll explain the techniques primarily without code. But if you are interested in the Haskell specifics and in Hedgehog, check out “Property testing with Hedgehog” by Tim Humphries.

Properties of the Ugly Parts

When I started with PBT, I struggled with applying it to anything beyond simple functions. Examples online are often focused on the fundamentals. They cover concepts like reversing lists, algebraic laws, and symmetric encoders and decoders. Those are important properties to test, and they are good examples for teaching the foundations of PBT.

I wanted to take PBT beyond pure and simple functions, and leverage it on larger parts of my system. The “ugly” parts, if you will. In my experience, the complexity of a system often becomes much higher than the sum of its parts. The way subsystems are connected and form a larger graph of dependencies drives the need for integration testing at an application level.

Finding resources on integration testing using PBT is hard, and it might drive you to think that PBT is not suited for anything beyond the introductory examples. With the case studies in this blog series I hope to contribute to debunking such misconceptions.

Designing for Testability

In my case, it’s a desktop multimedia application. What if we’re working on a backend that connects to external systems and databases? Or if we’re writing a frontend application with a GUI driven by user input? In addition to these kinds of systems being hard to test at a high level due to their many connected subsystems, they usually have stateful components, side effects, and non-determinism. How do we make such systems testable with properties?

Well, the same way we would design our systems to be testable with examples. Going back to “Writing Testable Code” by Miško Hevery from 2008, and Kent Beck’s “Test-Driven Development by Example” from 2003, setting aside the OOP specifics, many of their guidelines apply equally well to code tested with properties:

Determinism: Make it possible to run the “system under test” deterministically, such that your tests can be reliable. This does not mean the code has to be pure, but you might need to stub or otherwise control side effects during your tests.
No global state: In order for tests to be repeatable and independent of execution order, you might have to rollback database transactions, use temporary directories for generated files, stub out effects, etc.
High cohesion: Strive for modules of high cohesion, with smaller units each having a single responsibility. Spreading closely related responsibilities thin across multiple modules makes the implementation harder to maintain and test.
Low coupling: Decrease coupling between interface and implementation. This makes it easier to write tests that don’t depend on implementation details. You may then modify the implementation without modifying the corresponding tests.

I find these guidelines universal for writing testable code in any programming language I’ve used professionally, regardless of paradigm or type system. They apply to both example-based and property-based testing.

Patterns for Properties

Great, so we know how to write testable code. But how do we write properties for more complex units, and even for integration testing? There’s not a lot of educational resources on this subject that I know of, but I can recommend the following starting points:

“Choosing properties for property-based testing” by Scott Wlaschin, giving examples of properties within a set of common categories.
The talk “Property-Based Testing for Better Code” by Jessica Kerr, with examples of generating valid inputs and dealing with timeouts.

Taking a step back, we might ask “Why it’s so hard to come up with these properties?” I’d argue that it’s because doing so forces us to understand our system in a way we’re not used to. It’s challenging understanding and expressing the general behavior of a system, rather than particular anecdotes that we’ve observed or come up with.

If you want to get better at writing properties, the only advice I can give you (in addition to studying whatever you can find in other projects) is to practice. Try it out on whatever you’re working on. Talk to your colleagues about using properties in addition to example-based tests at work. Begin at a smaller scale, testing simple functions, and progress towards testing larger parts of your system once you’re comfortable. It’s a long journey, filled with reward, surprise, and joy!

Testing Case Studies

With a basic understanding of PBT, how we can write testable code, and how to write properties for our system under test, we’re getting ready to dive into the case studies:

Credits

Thank you Chris Ford, Alejandro Serrano Mena, Tobias Pflug, Hillel Wayne, and Ulrik Sandberg for kindly providing your feedback on my drafts!

Buy the Book

Why I'm No Longer Taking Donations

2018-12-29T00:00:00+01:00

Haskell at Work, the screencast focused on Haskell in practice, is approaching its one year birthday. Today, I decided to stop taking donations through Patreon due to the negative stress I’ve been experiencing.

The Beginning

This journey started in January 2018. Having a wave of inspiration after watching some of Gary Bernhardt’s new videos, I decided to try making my own videos about practical Haskell programming. Not only producing high-quality content, but with high video and audio quality, was the goal. Haskell at Work was born, and the first video was surprisingly well-received by followers on Twitter.

With the subsequent episodes being published in rapid succession, a follower base on YouTube grew quickly. A thousand or so followers might not be exceptional for a programming screencast channel on YouTube, but to me this was exciting and unexpected. To be honest, Haskell is not exactly a mainstream programming language.

Early on, encouraged by some followers, and being eager to develop the concept, I decided to set up Patreon as a way of people to donate to Haskell at Work. Much like the follower count, the number of patrons and their monthly donations grew rapidly, beyond any hopes I had.

Fatigue Kicks In

The majority of screencasts were published between January and May. Then came the summer and my month-long vacation, in which I attended ZuriHac and spent three weeks in Bali with my wife and friends. Also, I had started getting side-tracked by my project to build a screencast video editor in Haskell. Working on Komposition also spawned the Haskell package gi-gtk-declarative, and my focus got swept away from screencasts. In all fairness, I’m not great at consistently doing one thing for an extended period. My creativity and energy comes in bursts, and it may not strike where and when I hope. Maybe this can be managed or controlled somehow, but I don’t know how.

With the lower publishing pace over the summer, a vicious circle of anxiety and low productivity grew. I had thoughts about shutting down the Patreon back then, but decided to instead pause it for a few months.

Regaining Energy

By October, I had recovered some energy. I got very good feedback and lots of encouragement from people at Haskell eXchange, and decided to throw myself back into the game. I published one screencast in November, but something was still there nagging me. I felt pressure and guilt. That I had not delivered on the promise given.

By this time, the Patreon donations had covered my recording equipment expenses, hosting costs over the year, and a few programming books I bought. The donations were still coming in, however, at around $160 per month, with me producing no obvious value for the patrons. The guilt was still there, even stronger than before.

I’m certain that this is all in my head. I do not blame any supporter for these feelings. You have all been great! With all the words of caution you hear about not reading the comments, having a YouTube channel filled with positive feedback, and almost exclusively thumbs-up ratings, I’m beyond thankful for the support I have received.

Trying Something Else

After Christmas this year, I had planned to record and publish a new screencast. Various personal events got in the way, though, and I had very little time to spend on working with it, resulting in the same kind of stress. I took a step back and thought about it carefully, and I’ve realized that money is not a good driver for the free material and open-source code work that I do, and that it’s time for a change.

I want to make screencasts because I love doing it, and I will do so when I have time and energy.

From the remaining funds in my PayPal account, I have allocated enough to keep the domain name and hosting costs covered for another year, and I have donated the remaining amount (USD 450) to Haskell.org.

Please keep giving me feedback and suggestions for future episodes. Your ideas are great! I’m looking forward to making more Haskell at Work videos in the future, and I’m toying around with ideas on how to bring in guests, and possibly trying out new formats. Stay tuned, and thank you all for your support!

Writing a Screencast Video Editor in Haskell

2018-10-26T00:00:00+02:00

For the last six months I’ve been working on a screencast video editor called Komposition, and it’s now released and open source. This is an experience report, based on a talk from Lambda World Cádiz 2018, that’ll give an overview of Komposition’s design, implementation, testing, and planned future work.

Background

It all began with Haskell at Work, the series of screencasts focused on practical Haskell that I’ve been producing the last year. The screencasts are fast-paced tutorials centered around the terminal and text editor, complemented by a voice-over audio track.

My workflow for producing screencasts looks like this:

Write a very detailed script. The script usually gets uploaded to the website as-is, being used as the show notes for the screencast.
Record video separately, including all mistakes and retries, in a single video file. Each little part of editing or running commands in the terminal is separated by a rest, where I don’t type anything at all for a few seconds.
Record audio separately, where the voice-over audio is based on the script. A mistake or mispronunciation is corrected by taking a small break and then trying the same sentence or paragraph again.
Cut and arrange all parts of the screencast using a video editor. This is the most painful and time-consuming part of the process.

I’ve found this workflow beneficial for me, partly because of being a non-native English speaker and not being keen on coding and talking simultaneously, but also because I think it helps with organizing the content into a cohesive narrative. I can write the script almost as a text-based tutorial, read it through to make sure it flows well, and then go into recording. Having to redo the recording phase is very time-consuming, so I’m putting a lot of effort into the script phase to catch mistakes early.

Video Editors

I’ve used a bunch of video editors. First, I tried free software alternatives, including Kdenlive, OpenShot, and a few more. Unfortunately, the audio effects available were a bit disappointing. I use, at a minimum, normalization and a noise gate. Having a good compressor is a plus.

More importantly, these applications are built for general video editing, like film shot with a camera, and they are optimized for that purpose. I’m using a very small subset of their feature set, which is not suited to my editing workflow. It works, but the tasks are repetitive and time-consuming.

On the commercial side, there are applications like Premiere Pro and Final Cut Pro. These are proprietary and expensive systems, but they offer very good effects and editing capabilities. I used Premiere Pro for a while and enjoyed the stability and quality of the tools, but I still suffered from the repetitive workload of cutting and organizing the video and audio clips to form my screencasts.

What does a programmer do when faced with a repetitive task? Spend way more time on automating it! Thus, I started what came to be the greatest yak shave of my life.

Building a Screencast Video Editor

I decided to build a screencast video editor tailored to my workflow; an editor with a minimal feature set, doing only screencast editing, and doing that really well.

You might think “why not extend the free software editors to cover your needs?” That is a fair question. I wanted to rethink the editing experience, starting with a blank slate, and question the design choices made in the traditional systems. Also, to be honest, I’m not so keen on using my spare time to write C++.

I decided to write it in GHC Haskell, using GTK+ for the graphical user interface. Another option would be Electron and PureScript, but the various horror stories about Electron memory usage, in combination with it running on NodeJS, made me decide against it. As I expected my application to perform some performance-critical tasks around video and audio processing, Haskell seemed like the best choice of the two. There are many other languages and frameworks that could’ve been used, but this combination fit me well.

Komposition

About five months later, after many hours of hacking, Komposition was released open-source under the Mozilla Public License 2.0. The project working name was “FastCut,” but when making it open source I renamed it to the Swedish word for “composition.” It has nothing to do with KDE.

Komposition is a modal, cross-platform, GUI application. The modality means that you’re always in exactly one mode, and that only certain actions can be taken depending on the mode. It currently runs on Linux, Windows, and macOS, if you compile and install it from source.

At the heart of the editing model lies the hierarchical timeline, which we’ll dive into shortly. Other central features of Komposition include the automatic video and audio classification tools. They automate the tedious task of working through your recorded video and audio files to cut out the interesting parts. After importing, you’ll have a collection of classified video scenes, and a collection of classified audio parts. The audio parts are usually sentences, but it depends on how you take rests when recording your voice-over audio.

Keyboard-Driven Editing

Komposition is built for keyboard-driven editing, currently with Vim-like bindings, and commands transforming the hierarchical timeline, inspired by Paredit for Emacs.

There are corresponding menu items for most commands, and there’s limited support for using the mouse in the timeline. If you need help with keybindings, press the question mark key, and you will be presented with a help dialog showing the bindings available in the current mode.

The Hierarchical Timeline

The hierarchical timeline is a tree structure with a fixed depth. At the leaves of the tree there are clips. Clips are placed in the video and audio tracks of a parallel. It’s named that way because the video and audio tracks play in parallel. Clips within a track play in sequence.

If the audio track is longer than the video track, the remaining part of the video track is padded with still frames from an adjacent clip.

Explicit gaps can be added to the video and audio tracks. Video gaps are padded with still frames, and audio gaps are silent. When adding the gap, you specify its duration.

Parallels are put in sequences, and they are played in sequence. The first parallel is played until its end, then the next is played, and so on. Parallels and sequences are used to group cohesive parts of your screencast, and to synchronize the start of related video and audio clips.

When editing within a sequence or parallel, for example when deleting or adding clips, you will not affect the synchronization of other sequences or parallels. If there were only audio and video tracks in Komposition, deleting an audio clip would possibly shift many other audio clips, causing them to get out of sync with their related video clips. This is why the timeline structure is built up using sequences and parallels.

Finally, the timeline is the top-level structure that contains sequences. This is merely for organizing larger parts of a screencast. You can comfortably build up your screencast with a single sequence containing parallels.

Note that the timeline always contains at least one sequence, and that all sequences contain at least one parallel. The tracks within a parallel can be empty, though.

Documentation

The project website includes a user guide, covering the concepts of the application, along with recommendations on how to plan and record your screencasts to achieve the best results and experience using Komposition.

The landing page features a tutorial screencast, explaining how to import, edit, and render a screencast. It’s already a bit outdated, but I might get around to making an updated version when doing the next release. Be sure to check it out, it’ll give you a feel for how editing with Komposition works.

I can assure you, editing a screencast, that is about editing screencasts using your screencast editor, in your screencast editor, is quite the mind-bender. And I thought I had recursion down.

Implementation

I’ve striven to keep the core domain code in Komposition pure. That is, only pure function and data structures. Currently, the timeline and focus, command and event handling, key bindings, and the video classification algorithm are all pure. There are still impure parts, like audio and video import, audio classification, preview frame rendering, and the main application control flow.

Some parts are inherently effectful, so it doesn’t make sense to try writing them as pure functions, but as soon as the complexity increases, it’s worth considering what can be separated out as pure functions. The approach of “Functional core, imperative shell” describes this style very well. If you can do this for the complex parts of your program, you have a great starting point for automated testing, something I’ll cover later in this post.

GTK+

Komposition’s GUI is built with GTK+ 3 and the Haskell bindings from gi-gtk. I noticed early in the project that the conventional programming style and the APIs of GTK+ were imperative, callback-oriented, and all operating within IO, making it painful to use from Haskell, especially while trying to keep complex parts of the application free of effects.

To mitigate this issue, I started building a library (more yak shaving!) called gi-gtk-declarative, which is a declarative layer on top of gi-gtk. The previous post in this blog describes the project in detail. Using the declarative layer, rendering becomes a pure function (state -> Widget event), where state and event varies with the mode and view that’s being rendered. Event handling is based on values and pure functions.

There are cases where custom widgets are needed, calling the imperative APIs of gi-gtk, but they are well-isolated and few in numbers.

Type-Indexed State Machines

I had a curiosity itching when starting this project that I decided to scratch. Last year I worked on porting the Idris ST library, providing a way to encode type-indexed state machines in GHC Haskell. The library is called Motor. I wanted to try it in the context of a GUI application.

Just to give some short examples, the following type signatures, used in the main application control flow, operate on the application state machine that’s parameterized by its mode.

The start function takes a name and key maps, and creates a new application state machine associated with the name, and in the state of WelcomeScreenMode:

start  :: Name n  -> KeyMaps  -> Actions m '[ n !+ State m WelcomeScreenMode] r ()

The returnToTimeline function takes a name of an existing state machine and a TimelineModel, and transitions the application from the current mode to TimelineMode, given that the current mode instantiates the ReturnsToTimeline class:

returnToTimeline  :: ReturnsToTimeline mode  => Name n  -> TimelineModel  -> Actions m '[ n := State m mode !--> State m TimelineMode] r ()

The usage of Motor in Komposition is likely the most complicated aspect of the codebase, and I have been unsure if it is worth the cost. On the positive side, combining this with GADTs and the singleton pattern for mode-specific commands and events, GHC can really help out with pattern-matching and exhaustivity-checking. No nasty (error "this shouldn't happen") as the fall-through case when pattern matching!

I’m currently in the process of rewriting much of the state machine encoding in Komposition, using it more effectively for managing windows and modals in GTK+, and I think this warrants the use of Motor more clearly. Otherwise, it might be worth falling back to a less advanced encoding, like the one I described in Finite-State Machines, Part 2: Explicit Typed State Transitions.

Singleton Pattern

The singleton pattern is used in a few places in Komposition, as mentioned above. To show a concrete example, the encoding of mode-specific commands and events is based on the Mode data type.

data Mode  = WelcomeScreenMode  | TimelineMode  | LibraryMode  | ImportMode

This type is lifted to the kind level using the DataKinds language extension, and is used in the corresponding definition of the singleton SMode.

data SMode m where  SWelcomeScreenMode :: SMode WelcomeScreenMode  STimelineMode :: SMode TimelineMode  SLibraryMode :: SMode LibraryMode  SImportMode :: SMode ImportMode

The Command data type is parameterized by the mode in which the command is valid. Some commands are valid in all modes, like Cancel and Help, while others are specific to a certain mode. FocusCommand and JumpFocus are only valid in the timeline mode, as seen in their type signatures below.

data Command (mode :: Mode) where  Cancel :: Command mode  Help :: Command mode  FocusCommand :: FocusCommand -> Command TimelineMode  JumpFocus :: Focus SequenceFocusType -> Command TimelineMode  -- ...

Finally, by passing a singleton for a mode to the keymaps function, we get back a keymap specific to that mode. This is used to do event handling and key bindings generically for the entire application.

keymaps :: SMode m -> KeyMap (Command m) keymaps =  \case  SWelcomeScreenMode ->  [ ([KeyChar 'q'], Mapping Cancel)  , ([KeyEscape], Mapping Cancel)  , ([KeyChar '?'], Mapping Help)  ]  -- ...

In the spirit of calling out usage of advanced GHC features, I think singletons and GADTs are one more such instance. However, I find them very useful in this context, and worth the added cognitive load. You don’t have to go full “Dependent Haskell” or bring in the singletons library to leverage some of these techniques.

Automatic Scene Classification

The automatic classification of scenes in video is implemented using the Pipes and ffmpeg-light libraries, mainly. It begins with the readVideoFile, that given a video file path will give us a Pipes.Producer of timed frames, which are basically pixel arrays tagged with their time in the original video. The producer will stream the video file, and yield timed frames as they are consumed.

readVideoFile :: MonadIO m => FilePath -> Producer (Timed Frame) m ()

The frames are converted from JuicyPixels frames to massiv frames. Then, the producer is passed to the classifyMovement function together with a minimum segment duration (such that segments cannot be shorter than N seconds), which returns a producer of Classified frames, tagging each frame as being either moving or still.

classifyMovement  :: Monad m  => Time -- ^ Minimum segment duration  -> Producer (Timed RGB8Frame) m ()  -> Producer (Classified (Timed RGB8Frame)) m ()  data Classified f  = Moving f  | Still f  deriving (Eq, Functor, Show)

Finally, the classifyMovingScenes function, given a full duration of the original video and a producer of classified frames, returns a producer that yields ProgressUpdate values and returns a list of time spans.

classifyMovingScenes ::  Monad m  => Duration -- ^ Full length of video  -> Producer (Classified (Timed RGB8Frame)) m ()  -> Producer ProgressUpdate m [TimeSpan]

The time spans describe which parts of the original video are considered moving scenes, and the progress update values are used to render a progress bar in the GUI as the classification makes progress.

Automatic Sentence Classification

Similar to the video classification, Komposition also classifies audio files to find sentences or paragraphs in voice-over audio. The implementation relies on the sox tool, a separate executable that’s used to:

normalize the audio,
apply a noise gate, and
auto-split by silence.

One problem with sox is that it, as far as I can tell, can only write the split audio files to disk. I haven’t found a way to retrieve the time spans in the original audio file, so that information is unfortunately lost. This will become more apparent when Komposition supports editing the start and end position of clips, as it can’t be supported for audio clips produced by sox.

I hope to find some way around this, by extending or parsing output from sox somehow, by using libsox through FFI bindings, or by implementing the audio classification in Haskell. I’m trying to avoid the last alternative.

Rendering

The rendering pipeline begins with a pure function that converts the hierarchical timeline to a flat timeline representation. This representation consists of a video track and an audio track, where all gaps are made explicit, and where the tracks are of equal duration.

From the flat representation, an FFmpeg command is built up. This is based on a data type representation of the FFmpeg command-line syntax, and most importantly the filter graph DSL that’s used to build up the complex FFmpeg rendering command.

Having an intermediate data type representation when building up complex command invocations, instead of going directly to Text or String, is something I highly recommend.

Preview

The preview pipeline is very similar to the rendering pipeline. In fact, it’s using the same machinery, except for the output being a streaming HTTP server instead of a file, and that it’s passed the proxy media instead of the full-resolution original video. On the other side there’s a GStreamer widget embedded in the GTK+ user interface that plays back the HTTP video stream.

Using HTTP might seem like a strange choice for IPC between FFmpeg and GStreamer. Surprisingly, it’s the option that have worked most reliably across operating systems for me, but I’d like to find another IPC mechanism, eventually.

The HTTP solution is also somewhat unreliable, as I couldn’t find a way to ensure that the server is ready to stream, so there’s a race condition between the server and the GStreamer playback, silently “solved” with an ugly threadDelay.

Testing

Let’s talk a bit about testing in Komposition. I’ve used multiple techniques, but the one that stands out as unconventional, and specific to the domain of this application, is the color-tinting video classifier.

It uses the same classification functions as described before, but instead of returning time spans, it creates a new video file where moving frames are tinted green and still frames are tinted red. This tool made it much easier to tweak the classifier and test it on real recordings.

Property-Based Testing

I’ve used Hedgehog to test some of the most complex parts of the codebase. This has been incredibly valuable, and has found numerous errors and bad assumptions in the implementation. The functionality tested with Hedgehog and properties includes:

Timeline commands and movement: It generates a sequence of commands, together with a consistent timeline and focus. It folds over the commands, applying each one to the current timeline and focus, and asserts that the resulting timeline and focus are still consistent. The tested property ensures that there’s no possibility of out-of-bounds movement, and that deleting or otherwise transforming the timeline doesn’t cause an inconsistent timeline and focus pair.
Video scene classification: It generates known test scenes of random duration, that are either scenes of only still frames, or scenes with moving frames. It translates the test scenes, which are just descriptions, to real frames, and runs the classifier on the frames. Finally, it checks that the classified scenes are the same as the generated test scenes.
Flattening of hierarchical timeline: The flattening process converts the hierarchical timeline to a flat representation. The tested property ensures that hierarchical and flat timelines are always of the same total duration. There are other properties that could be added, e.g. that all clips in the original timeline are present in the flat timeline.
Round-trip properties of FFmpeg format printers and parsers: This is a conventional use of property-based tests. It ensures that parsing an FFmpeg-format timestamp string, produced by the FFmpeg-format timestamp printer, gives you back the same timestamp as you started with.

There are also cases of example-based testing, but I won’t cover them in this report.

Used Packages

Komposition depends on a fairly large collection of Haskell and non-Haskell tools and libraries to work with video, audio, and GUI rendering. I want to highlight some of them.

haskell-gi

The haskell-gi family of packages are used extensively, including:

gi-gobject
gi-glib
gi-gst
gi-gtk
gi-gdk
gi-gdkpixbuf
gi-pango

They supply bindings to GTK+, GStreamer, and more, primarily generated from the GObject Introspection metadata. While GTK+ has been problematic to work with in Haskell, these bindings have been crucial to the development of Komposition.

massiv & massiv-io

The massiv package is an array library that uses function composition to accomplish a sort of fusion. It’s used to do parallel pixel comparison in the video classifier. Thank you, Alexey Kuleshevich (author and maintainer of the massiv packages) for helping me implement the first version!

Pipes

The Pipes library is used extensively in Komposition:

The streaming video reader from ffmpeg-light is wrapped in a Pipes.Producer to provide composable streaming.
In general, effectful operations with progress notifications are producers that yield ProgressUpdate values as they perform their work.
pipes-safe is used for handling resources and processes.
pipes-parse is used in stateful transformations in the video classifier.

A big thanks to Gabriel Gonzales, the author of Pipes and the related packages!

Others

To name a few more:

I’ve used protolude as the basis for a custom prelude.
The lens library is used for working with nested data structures, positional updates in lists, and monadic transformations.
typed-process is used together with pipes-safe, in a situation where I couldn’t use the regular process package because of version constraint issues. The typed-process API turned out to be really nice, so I think it will be used more in the future.

Summary

Looking back at this project, the best part has been to first write it for my own use, and later find out that quite a lot of people are interested in how it’s built, and even in using it themselves. I’ve already received pull requests, bug reports, usability feedback, and many kind words and encouragements. It’s been great!

Also, it’s been fun to work on an application that can be considered outside of Haskell’s comfort zone, namely a multimedia and GUI application. Komposition is not the first application to explore this space — see Movie Monad and Gifcurry for other examples — but it is exciting, nonetheless.

Speaking of using Haskell, the effort to keep complex domain logic free of effects, and the use of property-based testing with Hedgehog to lure out nasty bugs, has been incredibly satisfactory and a great learning experience.

The Problematic Parts

It’s not been all fun and games, though. I’ve spent many hours struggling with FFmpeg, video and audio codecs, containers, and streaming. Executing external programs and parsing their output has been time-consuming and very hard to test. GTK+ has been very valuable, but also difficult to work with in Haskell. Finally, management of non-Haskell dependencies, in combination with trying to be cross-platform, is painful. Nix has helped with my own setup, but everyone will not install Komposition using Nix.

Next Steps

There are many features that I’d like to add in the near future.

More commands to work with the timeline, e.g. yank, paste, and join.
More Vim-like movement commands.
Previewing of any timeline part. Currently you can only preview the entire timeline, a sequence, or a parallel.
Adjustable clips, meaning that you can change the time span of a clip. This is useful if the automatic classification generated slightly incorrect clip time spans.
Content-addressed project files, to enable reuse of generated files, and to avoid collision. This includes most files involved in importing, and generated preview frames.

It would be great to set up packaging for popular operating systems, so that people can install Komposition without compiling from source. There’s already a Nix expression that can be used, but I’d like to supply Debian packages, macOS bundles, Windows installers, and possibly more.

There are some things that I’d like to explore and assess, but that won’t necessarily happen. The first is to use GStreamer in the rendering pipeline, instead of FFmpeg. I think this is possible, but I haven’t done the research yet. The second thing, an idea that evolved when talking to people at Lambda World Cádiz, would be to use voice recognition on audio clips to show text in the preview area, instead of showing a waveform.

Finally, there are some long-awaited refactorings and cleanups waiting, and optimization of the FFmpeg filter graph and the diffing in gi-gtk-declarative. Some of these I’ve already started on.

Wrap-Up

I hope you enjoyed reading this report, and that you now have got a clearer picture of Komposition, its implementation, and where it’s going. If you’re interested in using it, let me know how it works out, either by posting in the Gitter channel or by reaching out on Twitter. If you want to contribute by reporting bugs or sending pull requests, there’s the issue tracker on GitHub.

Thanks for reading!

Declarative GTK+ Programming with Haskell

2018-09-04T00:00:00+02:00

This post introduces a declarative GTK+ architecture for Haskell which I’ve been working on during the journey with FastCut, a video editor specialized for my own screencast editing workflow. It outlines the motivation, introduces the packages and their uses, and highlights parts of the implementation.

Imperative GUI Programming

When starting to work on FastCut, I wanted to use a GUI framework that would allow FastCut to be portable across Linux, macOS, and Windows. Furthermore, I was looking for a native GUI framework, as opposed to something based on web technology. GTK+ stood out as an established framework, with good bindings for Haskell in the haskell-gi package.

It didn’t take me very long before the imperative APIs of GTK+ became problematic. Widgets are created, properties are set, callbacks are attached, and methods are called, all in IO. With the imperative style, I see there being two distinct phases of a widget’s life cycle, which need to be handled separately in your rendering code:

construction, where you instantiate the widget with its initial state, and
subsequent updates, where some relevant state changes and the widget’s state is modified to reflect that. Often, the application state and the widget state are the same.

This programming style makes testing and locally reasoning about your code really hard, as it forces you to scatter your application logic and state over IORefs, MVars, and various callbacks, mutating shared state in IO. These symptoms are not specific to GTK+ or its Haskell bindings, but are common in object-oriented and imperative GUI frameworks. If you’re familiar with JavaScript and jQuery UI programming, you have most likely experienced similar problems.

While GTK+ has the Glade “WYSIWYG” editor, and the Builder class, they only make the first phase declarative. You still need to find all widgets in the instantiated GUI, attach event handlers, and start mutating state, to handle the second phase.

After having experienced these pains and getting repeatedly stuck when building FastCut with the imperative GTK+ APIs, I started exploring what a declarative GTK+ programming model could look like.

Going Declarative

The package I’ve been working on is called gi-gtk-declarative, and it aims to be a minimal declarative layer on top of the GTK+ bindings in haskell-gi.

Rendering becomes a pure function from state to a declarative widget, which is a data structure representation of the user interface. The library uses a patching mechanism to calculate the updates needed to be performed on the actual GTK+ widgets, based on differences in the declarative widgets, similar to a virtual DOM implementation.

Event handling is declarative, i.e. you declare which events are emitted on particular signals, rather than mutating state in callbacks or using concurrency primitives to communicate changes. Events of widgets can be mapped to other data types using the Functor instance, making widgets reusable and composable.

Finally, gi-gtk-declarative tries to be agnostic of the application architecture it’s used in. It should be possible to reuse it for different styles. As we shall see later there is a state reducer-based architecture available in gi-gtk-declarative-app-simple, and FastCut is using a custom architecture based on indexed monads and the Motor library.

Declarative Widgets

By reusing type-level information provided by the haskell-gi framework, and by using the OverloadedLabels language extension in GHC, gi-gtk-declarative can support many widgets automatically. Even though some widgets need special support, specifically containers, it is a massive benefit not having to redefine all GTK+ widgets.

Single Widgets

A single widget, i.e. one that cannot contain children, is constructed using the widget smart constructor, taking a GTK+ widget constructor and an attribute list:

myButton = widget Button []  myCheckButton = widget CheckButton []

Note that Button and CheckButton are constructors from gi-gtk. They are not defined by gi-gtk-declarative.

Bins

Bins, in GTK+ lingo, are widgets that contain a single child. They are created using the bin smart constructor:

myScrollArea =  bin ScrolledWindow [] $  widget Button []

Other examples of bins are Expander, Viewport, and SearchBar.

Containers

Containers are widgets that can contain zero or more child widgets, and they are created using the container smart constructor:

myListBox =  container ListBox [] $ do  bin ListBoxRow [] $ widget Button []  bin ListBoxRow [] $ widget CheckButton []

In regular GTK+, containers often accept any type of widget to be added to the container, and if the container requires its children to be of a specific type, it will automatically insert the in-between widget implicitly. An example of such a container is ListBox, which automatically wraps added children in ListBoxRow bins, if needed.

In gi-gtk-declarative, on the other hand, containers restrict the type of their children to make these relationships explicit. Thus, as seen above, to embed child widgets in a ListBox they have to be wrapped in ListBoxRow bins.

Another example, although slightly different, is Box. While Box doesn’t have a specific child widget type, you can in regular GTK+ add children using the boxPackStart and boxPackEnd methods. The arguments to those methods are expand, fill, and padding, which control how the child is rendered (packed) in the box. As gi-gtk-declarative doesn’t support method calls, there is a helper function and corresponding declarative widget boxChild to control Box child rendering:

myBox =  container Box [] $ do  boxChild False False 0 $ widget Button []  boxChild True True 0 $ widget CheckButton []

Note that we are using do notation to construct adjacent boxChild markup values. There is a monadic MarkupOf builder in the library that the container smart constructor takes as its last argument. Although we need the Monad instance to be able to use do notation, the return value of such expressions are rarely useful, and is thus constrained to () by the library.

Attributes

All declarative widgets can have attributes, and so far we’ve only seen empty attribute lists. Attributes on declarative widgets are not the same as GTK+ properties. They do include GTK+ properties, but also include CSS classes declarations and event handling.

Properties

One type of attribute is a property declaration. To declare a property, use the (:=) operator, which takes a property name label, and a property value, much like (:=) in haskell-gi:

myButtonWithLabel =  widget Button [#label := "Click Here"]  myHorizontallyScrolledWindow =  bin ScrolledWindow [ #hscrollbarPolicy := PolicyTypeAutomatic ] $  someSuperWideWidget  myContainerWithMultipleSelection =  container ListBox [ #selectionMode := SelectionModeMultiple ] $  children

To find out what properties are available, see the GI.Gtk.Objects module, find the widget module you’re interested in, and see the “Properties” section. As an example, you’d find the properties available for Button here.

Events

Using the on attribute, you can emit events on GTK+ signal emissions.

counterButton clickCount =  let msg = "I've been clicked "  <> Text.pack (show clickCount)  <> " times."  in widget  Button  [ #label := msg  , on #clicked ButtonClicked  ]

Some events need to be constructed with IO actions, to be able to query underlying GTK+ widgets for attributes. The onM attribute receives the widget as its first argument, and returns an IO event action. In the following example getColorButtonRgba has type ColorButton -> IO (Maybe RGBA), and so we compose it with an fmap of the ColorChanged constructor to get an IO Event.

data Event = ColorChanged (Maybe RGBA)  colorButton color =  widget  ColorButton  [ #title := "Selected color"  , #rgba := color  , onM #colorSet (fmap ColorChanged . getColorButtonRgba)  ]

You can think of onM having the following signature, even if it’s really a simplified version:

onM  :: Gtk.SignalProxy widget  -> (widget -> IO event)  -> Attribute widget event

Finally, CSS classes can be declared for widgets in the attributes list, using the classes attribute:

myAnnoyingButton =  widget  Button  [ classes ["big-button"]  , #label := "CLICK ME"  ]

GI.Gtk.Declarative.App.Simple

In addition to the declarative widget library gi-gtk-declarative, there’s an application architecture for you to use, based on the state reducer design of Pux.

At the heart of this architecture is the App:

data App state event =  App  { update :: state -> event -> Transition state event  , view :: state -> Widget event  , inputs :: [Producer event IO ()]  , initialState :: state  }

The type parameters state and event will be instantiated with our specific types used in our application. For example, if we were writing a “Snake” clone, our state datatype would describe the current playing field, the snake length and where it’s been, the edible objects’ positions, etc. The event datatype would likely include key press events, such as “arrow down” and “arrow right”.

The App datatype consists of:

an update function, that reduces the current state and a new event to a Transition, which decides the next state to transition to,
a view function, that renders a state value as a Widget, parameterized by the Apps event type,
inputs, which are Producers that feed events into the application, and
the initial state value of the state reduction loop.

Running Applications

To run an App, you can use the convenient run function, that initializes GTK+ and sets up a window for you:

run  :: Typeable event  => Text Window title  -> Maybe (Int32, Int32) -- ^ Optional window size  -> App state event -- ^ Application to run  -> IO ()

There’s also runInWindow if you like to initialize GTK+ yourself, and set up your own window; something you need to do if you want to use CSS, for instance.

Declarative “Hello, world!”

The haskell-gi README includes an “Hello, world!” example, written in an imperative style:

main :: IO () main = do  Gtk.init Nothing   win <- new Gtk.Window [ #title := "Hi there" ]   on win #destroy Gtk.mainQuit   button <- new Gtk.Button [ #label := "Click me" ]   on button #clicked $  set  button  [ #sensitive := False  , #label := "Thanks for clicking me"  ]   #add win button   #showAll win   Gtk.main

It has two states; either the button has not been clicked yet, in which it shows a “sensitive” button, or the button has been clicked, in which it shows an “insensitive” button and a label thanking the user for clicking.

Let’s rewrite this application in a declarative style, using gi-gtk-declarative and gi-gtk-declarative-app-simple, and see how that works out! Our state and event datatypes describe what states the application can be in, and what events can be emitted, respectively:

data State = NotClicked | Clicked  data Event = ButtonClicked

Our view function, here defined as view', renders a label according to what state the application is in:

view' :: State -> Widget Event view' = \case  NotClicked ->  widget  Button  [ #label := "Click me"  , on #clicked ButtonClicked  ]  Clicked ->  widget  Button  [ #sensitive := False  , #label := "Thanks for clicking me"  ]

The update function reduces the current state and an event to a Transition event state, which can either be Transition or Exit. Here we always transition to the Clicked state if the button has been clicked.

update' :: State -> Event -> Transition State Event update' _ ButtonClicked = Transition Clicked (return Nothing)

Note that the Transition constructor not only takes the next state, but also an IO (Maybe Event) action. This makes it possible to generate a new event in the update function.

Finally, we run the “Hello, world!” application using run.

main :: IO () main =  run  "Hi there"  Nothing  App  { view = view'  , update = update'  , inputs = []  , initialState = NotClicked  }

Comparing with the imperative version, I like this style a lot better. The rendering code is a pure function, and core application logic can also be pure functions on data structures, instead of mutation of shared state. Moreover, the small state machine that was hiding in the original code is now explicit with the State sum type and the update' function.

There are more examples in gi-gtk-declarative if you want to check them out.

Implementation

Writing gi-gtk-declarative has been a journey full of insights for me, and I’d like to share some implementation notes that might be interesting and helpful if you want to understand how the library works.

Patching

At the core of the library lies the Patchable type class. The create method creates a new GTK+ widget given a declarative widget. The patch method calculates a minimal patch given two declarative widgets; the old and the new version:

class Patchable widget where  create :: widget e -> IO Gtk.Widget  patch :: widget e1 -> widget e2 -> Patch

A patch describes a modification of a GTK+ widget, specifies a replacement of a GTK+ widget, or says that the GTK+ widget should be kept as-is.

data Patch  = Modify (Gtk.Widget -> IO ())  | Replace (IO Gtk.Widget)  | Keep

Replacing a widget is necessary if the declarative widget changes from one type of widget to another, say from Button to ListBox. We can’t modify a Button to become a ListBox in GTK+, so we have to create a new GTK+ widget and replace the existing one.

Heterogeneous Widgets

Declarative widgets are often wrapped in the Widget datatype, to support widgets of any type to be used as a child in a heterogeneous container, and to be able to return any declarative widget, as we did in the App view function previously. The Widget datatype is a GADT:

data Widget event where  Widget  :: ( Typeable widget  , Patchable widget  , Functor widget  , EventSource widget  )  => widget event  -> Widget event

If you look at the Widget constructor’s type signature, you can see that it hides the inner widget type, and that it carries all the constraints needed to write instances for patching and event handling.

We can define a Patchable instance for Widget as the inner widget is constrained to have an instance of Patchable. As the widget is also constrained with Typeable, we can use eqT to compare the types of two Widgets. If their inner declarative widget types are equal, we can calculate a patch from the declarative widgets. If not, we replace the old GTK+ widget with a new one created from the new declarative widget.

instance Patchable Widget where  create (Widget w) = create w  patch (Widget (w1 :: t1 e1)) (Widget (w2 :: t2 e2)) =  case eqT @t1 @t2 of  Just Refl -> patch w1 w2  _ -> Replace (create w2)

Similar to the case with Patchable, as we’ve constrained the inner widget type in the GADT, we can define instances for Functor and EventSource.

At first, it might seem unintuitive to use dynamic typing in Haskell, but I think this case is very motivating, and it’s central to the implementation of gi-gtk-declarative.

Smart Constructors and Return Type Polymorphism

All smart constructors — widget, bin, and container — can return either a Widget value, such that you can use it in a context where the inner widget type needs to be hidden, or a MarkupOf with a type specifically needed in the contexts in which the widget is used, for example, a bin or container with a requirement on what child widget it can contain.

Here are some possible specializations of smart constructor return types:

widget Button [] :: Widget MyEvent widget Button [] :: MarkupOf (SingleWidget Button) MyEvent ()  bin ScrolledWindow [] _ :: Widget MyEvent bin ScrolledWindow [] _ :: MarkupOf (Bin ScrolledWindow Widget) MyEvent ()  container Box [] _ :: Widget MyEvent container Box [] _ :: MarkupOf (Container Box (Children BoxChild)) MyEvent ()

As a small example, consider the helper textRow that constructs a ListBoxRow to be contained in a ListBox:

myList :: Widget Event myList =  container ListBox [] $  mapM textRow ["Foo", "Bar", "Baz"]  textRow :: Text -> MarkupOf (Bin ListBoxRow Widget) Event () textRow t =  bin ListBoxRow [] $  widget Label [ #label := t ]

As the type signature above shows, the textRow function returns a MarkupOf value parameterized by a specific child type: Bin ListBoxRow Widget. You can read the whole type as “markup containing bins of list box rows, where the list box rows can contain any type of widget, and where they all emit events of type Event.” I know, it’s a mouthful, but as you probably won’t split your markup-building function up so heavily, and as GHC will be able to infer these types, it’s not an issue.

The return type polymorphism of the smart constructors should not affect type inference badly. If you find a case where it does, please submit an issue on GitHub.

Summary

Callback-centric GUI programming is hard. I prefer using data structures and pure functions for core application code, and keep it decoupled from the GUI code by making rendering as simple as a function State -> Widget. This is the ideal I’m striving for, and what motivated the creation of these packages.

I have just released gi-gtk-declarative and gi-gtk-declarative-app-simple on Hackage. They are both to be regarded as experimental packages, but I hope for them to be useful and stable some day. Please try them out, and post issues on the GitHub tracker if you find anything weird, and give me shout if you have any questions.

The gi-gtk-declarative library is used heavily in FastCut, with great success. Unfortunately that project is not open source yet, so I can’t point you to any code examples right now. Hopefully, I’ll have it open-sourced soon, and I’m planning on blogging more about its implementation.

Until then, happy native and declarative GUI hacking!

Finite-State Machines, Part 2: Explicit Typed State Transitions

2017-11-19T00:00:00+01:00

In the first part of this series, we left off having made states explicit using Haskell data types. We concluded that state transitions were implicit, and that a mistake in implementation, making an erroneous state transition, would not be caught by the type system. We also noted that side effects performed at state transitions complicated testing of the state machine, as we were tied to IO.

Before addressing those problems, let’s remind ourselves of the example state machine diagram. If you haven’t read the previous post, I recommend you go do that first.

We begin with the language extensions we’ll need, along with the module declaration.

{-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE GADTs #-} {-# LANGUAGE GeneralizedNewtypeDeriving #-} {-# LANGUAGE TypeFamilies #-} module EnforcingLegalStateTransitions where

Let me quickly explain the GHC language extensions used:

OverloadedStrings converts string literals in Haskell source code to Text values, in our case.
GADTs enables the use of generalized algebraic data types, an extension to GHC that lets us specify different type signatures for each constructor in a data type. This is useful to parameterize constructors with different types, something we will use for state data types.
We use GeneralizedNewtypeDeriving to have our implementation newtype derive instances for Functor, Applicative, and MonadIO. I’ll explain this later in this post.
TypeFamilies extends GHC Haskell with what you can consider type-level functions. We’ll use this extension to associate a concrete state data type with our instance of the state machine.

In addition to Control.Monad.IO.Class, we import the same modules as in the previous post.

import Control.Monad.IO.Class import Data.List.NonEmpty import Data.Semigroup import qualified Data.Text.IO as T

From the modules specific to the blog post series we import some functions and data types. As before, their exact implementations are not important. The ConsoleInput module provides helpers for retrieving text input and confirmation from the terminal.

import qualified PaymentProvider import Checkout ( Card(..)  , CartItem  , OrderId(..)  , mkItem  , calculatePrice  , newOrderId  ) import qualified ConsoleInput

Enough imports, let’s go build our state machine!

The State Machine Protocol

In contrast to the transition function in the previous post, where a single function had the responsibility of deciding which state transitions were legal, performing associated side effects on transitions, and transitioning to the next state, we will now separate the protocol from the program. In other words, the set of states, and the associated state transitions for certain events, will be encoded separately from the implementation of the automaton. Conceptually, it still follows the definition we borrowed from Erlang:

State(S) × Event(E) → Actions (A), State(S’)

With the risk of stretching a metaphor too thin, I like to think of the split representation as taking the single-function approach and turning it inside-out. Our program is separate from our state machine protocol, and we can implement it however we like, as long as we follow the protocol.

Empty Data Types for States

Our states are no longer represented by a single data type with constructors for each state. Instead, we create an empty data type for each state. Such a type has no constructors, and is therefore not inhabited by any value. We will use them solely as markers in GADT constructors, a technique generally known as phantom types.

data NoItems data HasItems data NoCard data CardSelected data CardConfirmed data OrderPlaced

I know that phantom types and GADTs sound scary at first, but please hang in there, as I’ll explain their practical use throughout this post, and hopefully give you a sense of why we are using them.

State Machine Protocol as a Type Class

We encode our state machine protocol, separate from the program, using a type class.

class Checkout m where

In our protocol, we do not want to be specific about what data type is used to represent the current state; we only care about the state type it is parameterized by. Therefore, we use an associated type alias, also known as an open type family, with kind * -> * to represent states.

 type State m :: * -> *

The signature * -> * can be thought of as a type-level function, or a type constructor, from type to type (the star is the kind of types). The parameter m to state means we are associating the state type with the instance of m, so that different instances can specify their own concrete state types.

There are two benefits of using an associated type for state:

Different instances of Checkout can provide different concrete state data types, hiding whatever nasty implementation details they need to operate, such as database connections, web sessions, or file handles.
The concrete state type is not known when using the state machine protocol, and it is therefore impossible to create a state “manually”; the program would not typecheck.

In our case, the parameter will always be one of our empty data types declared for states. As an example, (State m NoItems) has kind *, and is used to represent the “NoItems” state abstractly.

Note that m also has kind * -> *, but not for the same reason; the m is going to be the monadic type we use for our implementation, and is therefore higher-kinded.

Events as Type Class Methods

Checkout specifies the state machine events as type class methods, where method type signatures describe state transitions. The initial method creates a new checkout, returning a “NoItems” state. It can be thought of as a constructor in object-oriented programming terms.

 initial  :: m (State m NoItems)

The value returned, of type (State m NoItems), is the first state. We use this value as a parameter to the subsequent event, transitioning to another state. Events that transition state from one to another take the current state as an argument, and return the resulting state.

The select event is a bit tricky, as it is accepted from both “NoItems” and “HasItems”. We use the union data type SelectState, analogous to Either, that represents the possibility of either “NoItems” or “HasItems”. The definition of SelectState is included further down this post.

 select  :: SelectState m  -> CartItem  -> m (State m HasItems)

Worth noting is that the resulting state is returned inside m. We do that to enable the instance of Checkout to perform computations available in m at the state transition.

Does this ring a bell?

Just as in the previous post, we want the possibility to interleave side effects on state transitions, and using a monadic return value gives the instance that flexibility.

The checkout event is simpler than select, as it transitions from exactly one state to another.

 checkout  :: State m HasItems  -> m (State m NoCard)

Some events, like selectCard, carry data in the form of arguments, corresponding to how some event data constructors had arguments. Most of the events in Checkout follow the patterns described so far.

 selectCard  :: State m NoCard  -> Card  -> m (State m CardSelected)   confirm  :: State m CardSelected  -> m (State m CardConfirmed)   placeOrder  :: State m CardConfirmed  -> m (State m OrderPlaced)

Note that we are not doing any error handling. All operations return state values. In a real-world system you might need to handle error cases, like selectCard not accepting the entered card number. I have deliberately excluded error handling from this already lengthy post, but I will probably write a post about different ways of handling errors in this style of state machine encoding.

Similar to select, the cancel event is accepted from more than one state. In fact, it is accepted from three states: “NoCard”, “CardSelected”, and “CardConfirmed”. Like with select, we use a union data type representing the ternary alternative.

 cancel  :: CancelState m  -> m (State m HasItems)

Finally, we have the end method as a way of ending the state machine in its terminal state, similar to a destructor in object-oriented programming terms. Instances of Checkout can have end clean up resources associated with the machine.

 end  :: State m OrderPlaced  -> m OrderId

As promised, I will show you the definitions of SelectState and CancelState, the data types that represent alternative source states for the select and cancel events, respectively.

data SelectState m  = NoItemsSelect (State m NoItems)  | HasItemsSelect (State m HasItems)  data CancelState m  = NoCardCancel (State m NoCard)  | CardSelectedCancel (State m CardSelected)  | CardConfirmedCancel (State m CardConfirmed)

Each constructor takes a specific state as argument, thus creating a union type wrapping the alternatives.

A Program using the State Machine Protocol

Now that we have a state machine protocol, the Checkout type class, we can write a program with it. This is the automaton part of our implementation, i.e. the part that drives the state machine forward.

As long as the program follows the protocol, it can be structured however we like; we can drive it using user input from a console, by listening to a queue of commands, or by incoming HTTP requests from a web server. In the interest of this post, however, we will keep to reading user input from the console.

The type signature of fillCart constrains m to be an instance of both Checkout and MonadIO. Moreover, it is a function from a “NoItems” state to a “HasItems” state. The type is similar to the event methods’ type signatures in the Checkout protocol, and similarly describes a state transition with a type.

fillCart  :: (Checkout m, MonadIO m)  => State m NoItems  -> m (State m HasItems)

This is where we are starting to use the MTL style of abstracting effects, and combining different effects by constraining the monadic type with multiple type classes.

The critical reader might object to using MonadIO, and claim that we have not separated all side effects, and failed in making the program testable. They wouldn’t be wrong. I have deliberately left the direct use of MonadIO in to keep the example somewhat concrete. We could refactor it to depend on, say, a UserInput type class for collecting more abstract user commands. By using MonadIO, though, the example highlights particularly how the state machine protocol has been abstracted, and how the effects of state transitions are guarded by the type system, rather than making everything abstract. I encourage you to try out both approaches in your code!

The definition of fillCart takes a “NoItems” state value as an argument, prompts the user for the first cart item, selects it, and hands off to selectMoreItems.

fillCart noItems =  mkItem <$> ConsoleInput.prompt "First item:"  >>= select (NoItemsSelect noItems)  >>= selectMoreItems

The event methods of the Checkout protocol, and functions like fillCart and selectMoreItems, are functions from one state to a monadic return value of another state, and thus compose using (>>=).

The selectMoreItems function remains in a “HasItems” state. It asks the user if they want to add more items. If so, it asks for the next item, selects that and recurses to possibly add even more items; if not, it returns the current “HasItems” state. Note how we need to wrap the “HasItems” state in HasItemsSelect to create a SelectState value.

selectMoreItems  :: (Checkout m, MonadIO m)  => State m HasItems  -> m (State m HasItems) selectMoreItems s = do  more <- ConsoleInput.confirm "More items?"  if more  then  mkItem <$> ConsoleInput.prompt "Next item:"  >>= select (HasItemsSelect s)  >>= selectMoreItems  else return s

When all items have been added, we are ready to start the checkout part. The type signature of startCheckout tells us that it transitions from a “HasItems” state to an “OrderPlaced” state.

startCheckout  :: (Checkout m, MonadIO m)  => State m HasItems  -> m (State m OrderPlaced)

The function starts the checkout, prompts for a card, and selects it. It asks the user to confirm the use of the selected card, and ends by placing the order. If the user did not confirm, the checkout is canceled, and we go back to selecting more items, followed by attempting a new checkout.

startCheckout hasItems = do  noCard <- checkout hasItems  card <- ConsoleInput.prompt "Card:"  cardSelected <- selectCard noCard (Card card)  useCard <- ConsoleInput.confirm ("Confirm use of '" <> card <> "'?")  if useCard  then confirm cardSelected >>= placeOrder  else cancel (CardSelectedCancel cardSelected) >>=  selectMoreItems >>=  startCheckout

The protocol allows for cancellation in all three checkout states, but that the program only gives the user a possibility to cancel in the end of the process. Again, the program must follow the rules of the protocol, but it is not required to trigger all events the protocol allows for.

The definition of checkoutProgram is a composition of what we have so far. It creates the state machine in its initial state, fills the shopping cart, starts the checkout, and eventually ends the checkout.

checkoutProgram  :: (Checkout m, MonadIO m)  => m OrderId checkoutProgram =  initial >>= fillCart >>= startCheckout >>= end

We now have a complete program using the Checkout state machine protocol. To run it, however, we need an instance of Checkout.

Defining an Instance for Checkout

To define an instance for Checkout, we need a type to define it for. A common way of defining such types, especially in MTL style, is using newtype around a monadic value. The type name often ends with T to denote that it’s a transformer. Also by convention, the constructor takes a single-field record, where the field accessor follows the naming scheme run; in our case runCheckoutT.

newtype CheckoutT m a = CheckoutT  { runCheckoutT :: m a  } deriving ( Functor  , Monad  , Applicative  , MonadIO  )

We derive the MonadIO instance automatically, along with the standard Functor, Applicative, and Monad hierarchy.

Monad Transformer Stacks and Instance Coupling

Had we not derived MonadIO, the program from before, with constraints on both Checkout and MonadIO, would not have compiled. Therein lies a subtle dependency that is hard to see at first, but that might cause you a lot of headache. Data types used to instantiate MTL style type classes, when stacked, need to implement all type classes in use. This is caused by the stacking aspect of monad transformers, and is a common critique of MTL style.

Other techniques for separating side effects, such as free monads or extensible effects, have other tradeoffs. I have chosen to focus on MTL style as it is widely used, and in my opinion, a decent starting point. If anyone rewrites these examples using another technique, please drop a comment!

A Concrete State Data Type

Remember how we have, so far, only been talking about state values abstractly, in terms of the associated type alias State in the Checkout class? It is time to provide the concrete data type for state that we will use in our instance.

As discussed earlier, the type we associate for State need to have kind (* -> *). The argument is the state marker type, i.e. one of the empty data types for states. We define the data type CheckoutState using a GADT, where s is the state type.

data CheckoutState s where

With GADTs, data constructors specify their own type signatures, allowing the use of phantom types, and differently typed values resulting from the constructors. Each constructor parameterize the CheckoutState with a different state type.

The NoItems constructor is nullary, and constructs a value of type CheckoutState NoItems.

 NoItems  :: CheckoutState NoItems

The data constructor NoItems is defined here, whereas the type NoItems is defined in the beginning of the program, and they are not directly related.

There is, however, a relation between them in terms of the CheckoutState data type. If we have a value of type CheckoutState NoItems, and we pattern match on it, GHC knows that there is only one constructor for such a value. This will become very handy when defining our instance.

The other constructors are defined similarly, but some have arguments, in the same way the State data type in the previous post had. They accumulate the extended state needed by the state machine, up until the order is placed.

 HasItems  :: NonEmpty CartItem  -> CheckoutState HasItems   NoCard  :: NonEmpty CartItem  -> CheckoutState NoCard   CardSelected  :: NonEmpty CartItem  -> Card  -> CheckoutState CardSelected   CardConfirmed  :: NonEmpty CartItem  -> Card  -> CheckoutState CardConfirmed   OrderPlaced  :: OrderId  -> CheckoutState OrderPlaced

We have a concrete state data type, defined as a GADT, and we can go ahead defining the instance of Checkout for our CheckoutT newtype. We need MonadIO to perform IO on state transitions, such as charging the customer card.

instance (MonadIO m) => Checkout (CheckoutT m) where

Next, we can finally tie the knot, associating the state type for CheckoutT with CheckoutState.

 type State (CheckoutT m) = CheckoutState

We continue by defining the methods. The initial method creates the state machine by returning the initial state, the NoItems constructor.

 initial = return NoItems

In select, we receive the current state, which can be either one of the constructors of SelectState. Unwrapping those gives us the CheckoutState value. We return the HasItems state with the selected item prepended to a non-empty list.

 select state item =  case state of  NoItemsSelect NoItems ->  return (HasItems (item :| []))  HasItemsSelect (HasItems items) ->  return (HasItems (item <| items))

As emphasized before, GHC knows which constructors of CheckoutState can occur in the SelectState wrappers, and we can pattern match exhaustively on only the possible state constructors.

The checkout, selectCard, and confirm methods accumulate the extended state, and returns the appropriate state constructor.

 checkout (HasItems items) =  return (NoCard items)   selectCard (NoCard items) card =  return (CardSelected items card)   confirm (CardSelected items card) =  return (CardConfirmed items card)

Now for placeOrder, where we want to perform a side effect. We have constrained m to be an instance of MonadIO, and we can thus use the effectful newOrderId and PaymentProvider.chargeCard in our definition.

 placeOrder (CardConfirmed items card) = do  orderId <- newOrderId  let price = calculatePrice items  PaymentProvider.chargeCard card price  return (OrderPlaced orderId)

Similar to select, cancel switches on the alternatives of the CancelState data type. In all cases it returns the HasItems state with the current list of items.

 cancel cancelState =  case cancelState of  NoCardCancel (NoCard items) ->  return (HasItems items)  CardSelectedCancel (CardSelected items _) ->  return (HasItems items)  CardConfirmedCancel (CardConfirmed items _) ->  return (HasItems items)

Finally, the definition of end returns the generated order identifier.

 end (OrderPlaced orderId) = return orderId

The CheckoutT instance of Checkout is complete, and we are ready to stitch everything together into a running program.

Putting the Pieces Together

To run checkoutProgram, we need an instance of Checkout, and an instance of MonadIO. There is already an instance (MonadIO IO) available. To select our CheckoutT instance for Checkout, we use runCheckoutT.

example :: IO () example = do  OrderId orderId <- runCheckoutT checkoutProgram  T.putStrLn ("Completed with order ID: " <> orderId)

The complete checkout program is run, using the CheckoutT instance, and an OrderId is returned, which we print at the end. A sample execution of this program looks like this:

 λ> example First item: Banana More items? (y/N) y Next item: Horse More items? (y/N) y Next item: House More items? (y/N) n Card: 0000-0000-0000-0000 Confirm use of '0000-0000-0000-0000'? (y/N) y Charging card 0000-0000-0000-0000 $200 Completed with order ID: foo

Cool, we have a console implementation running!

Instances Without Side Effects

A benefit of using MTL style, in addition to have effects be explicit, is that we can write alternative instances. We might write an instance that only logs the effects, using a Writer monad , collecting them as pure values in a list, and use that instance when testing the state machine.

Parting Thoughts

Using a sort of extended MTL style, with conventions for state machine encodings, gives us more type safety in state transitions. In addition to having turned our state machine program inside-out, into a protocol separated from the automaton, we have guarded side effects with types in the form of type class methods. Abstract state values, impossible to create outside the instance, are now passed explicitly in state transitions.

But we still have a rather loud elephant in the room.

Suppose I’d write the following function, wherein I place the order twice. Do you think it would typecheck?

doBadThings ::  (Checkout m, MonadIO m)  => State m CardConfirmed  -> m (State m OrderPlaced) doBadThings cardConfirmed = do  _ <- placeOrder cardConfirmed  placeOrder cardConfirmed

The answer is yes, it would typecheck. With the Checkout instance we have, the customer’s card would be charged twice, without doubt departing from our business model, and likely hurting our brand.

The problem is that we are allowed to discard the state transitioned to, a value of type (State m OrderPlaced), returned by the first placeOrder expression. Then, we can place the order again, using the old state value of type (State m CardConfirmed). The ability to reuse, or never use, state values is the Achilles’ heel of this post’s state machine encoding.

We could venture into the land of linear types, a feature recently proposed to be added to GHC. With linear types, we could ensure state values are used exactly once, making our current approach safer.

I’d like to step back for a moment, however, and remind you that the techniques we encounter along this journey are not ordered as increasingly “better”, in terms of what you should apply in your work. I show more and more advanced encodings, using various GHC language extensions, but it doesn’t mean you should necessarily use the most advanced one. Simplicity is powerful, something Tim Humphries tweet thread reminded me about this morning, and I recommend you start out simple.

As demonstrated, the extended MTL style for encoding state machines presented in this post has a type safety flaw. That doesn’t mean the technique is useless and should be forever rejected. At least not in my opinion. It gives additional type safety around state transitions, it composes well with MTL style programs in general, and it uses a modest collection of type system features and language extensions. We can write alternative instances, without any IO, and use them to test our state machines programs in a pure setting.

If you still feel that all hope is lost, then I’m happy to announce that there will be more posts coming in this series! To demonstrate a possible next step, in terms of even greater type safety, in the next post we will be exploring indexed monads and row kinds as a way of armoring the Achilles heel.

Happy hacking!

Revisions

November 20, 2017: Based on a Reddit comment, on the lack of error handling in event type class methods, I added a small note about that below the selectCard type signature.

Finite-State Machines, Part 1: Modeling with Haskell Data Types

2017-11-10T00:00:00+01:00

Stateful programs often become complex beasts as they grow. Program state incohesively spread across a bunch of variables, spuriously guarded by even more variables, is what I refer to as implicit state. When working with such code, we have to reconstruct a model mentally, identifying possible states and transitions between them, to modify the program with confidence. Even if a test suite can help, the process is tedious and error-prone, and I insist we should have our tools do the heavy lifting instead.

By teaching the type system about possible states and state transitions in our program, it can verify that we follow our own business rules, both when we write new code, and when we modify existing code. It is not merely a process of asking the compiler “did I do okay?” Our workflow can be a conversation with the compiler, a process known as type-driven development. Moreover, the program encodes the state machine as living machine-verified documentation.

After having given my talk at Code Mesh on this topic, and having spent a lot of time researching and preparing examples, I want to share the material in the form of a blog post series. Each post will cover increasingly advanced techniques that give greater static safety guarantees. That is not to say that the latter techniques are inherently better, nor that they are the ones that you should use. This series is meant as a small à la carte of event-driven state machine encodings and type safety, where you can choose from the menu based on your taste and budget. I will, however, present the techniques in a linear form. Also, note that these posts do not claim to exhaust all options for state machine encodings.

There are many trade-offs, including type safety and strictness, implementation complexity, and how language, technique, and library choices affect your team. Taking one step towards explicit state, in an area where it leverages your project in doing so, can be the best choice. You don’t have to go nuts with type systems to use explicit states in your program! Furthermore, most mainstream languages today let you encode states as data types in some way.

This is the introductory post, in which I’ll show the first step on our way from implicit state and despair to writing stateful and effectful programs with great confidence. We will use Haskell and algebraic data types (ADTs) to encode possible states as data types. You should be able to read and understand this post without knowing much Haskell. If not, tell me, and I will try to explain better.

Finite-State Machines

First, we should have a clear understanding of what a finite-state machine is. There are many variations and definitions, and I’m sure you, especially if coming from an engineering background, have some relation to state machines.

In general, a finite-state machine can be described as an abstract machine with a finite set of states, being in one state at a time. Events trigger state transitions; that is, the machine changes from being in one state to being in another state. The machine defines a set of legal transitions, often expressed as associations from a state and event pair to a state.

For the domains we will be exploring, the Erlang documentation’s definition of a finite-state machine is simple and useful:

State(S) × Event(E) → Actions (A), State(S’)

If we are in state S and the event E occurs, we should perform the actions A and make a transition to the state S’.

That is the basis for the coming posts. I will not categorize as Mealy or Moore machines, or use UML state charts, at least not to any greater extent. Some diagrams will use the notation for hierarchical state machines for convenience.

Example: Shopping Cart Checkout

The running example we will use in these posts is a shopping cart checkout, modeled as an event-driven finite-state machine. This stems from a real-world project I worked on, where the lack of explicit states in code became a real problem as requirements evolved. It’s the use-case that inspired me to look for more robust methods.

As shown graphically in the state diagram above, we start in “NoItems”, selecting one or more items, staying in “HasItems”, until we begin the checkout. We enter the nested “Checkout” machine on the “checkout” event. Modeling it as a hierarchically nested machine we can have all its states accept the “cancel” event. We select and confirm a card, and eventually place an order, if not canceled.

States and Events as Data Types

Let’s get started. We will use Text instead of String, and NonEmpty lists. The two modules PaymentProvider and Checkout hide some implementation detail of lesser importance.

{-# LANGUAGE OverloadedStrings #-} module StateMachinesWithAdts where  import Control.Monad (foldM) import Data.List.NonEmpty import Data.Text (Text) import Text.Printf (printf)  import qualified PaymentProvider import Checkout ( Card(..)  , CartItem(..)  , calculatePrice  )

CheckoutState is a sum type, with one constructor for each valid state. Some constructors are nullary, meaning they have no arguments. Others have arguments, for the data they carry.

data CheckoutState  = NoItems  | HasItems (NonEmpty CartItem)  | NoCard (NonEmpty CartItem)  | CardSelected (NonEmpty CartItem)  Card  | CardConfirmed (NonEmpty CartItem)  Card  | OrderPlaced  deriving (Show, Eq)

Looking at the state constructors in the definition of CheckoutState, we can see how they accumulate state as the machine makes progress, right up until the order is placed.

Note that CartItem and Card are imported from the shared Checkout module.

Similar to the data type for states, the data type for events, called CheckoutEvent, defines one constructor for each valid event. The non-nullary constructors carry some data with the event.

data CheckoutEvent  = Select CartItem  | Checkout  | SelectCard Card  | Confirm  | PlaceOrder  | Cancel  deriving (Show, Eq)

We have now translated the diagram to Haskell data types, and we can implement the state transitions and actions.

A Pure State Reducer Function

Now, we might consider the simplest possible implementation of a state machine a function from state and event to the next state, very much like the definition from Erlang’s documentation quoted above. Such a function could have the following type:

CheckoutState -> CheckoutEvent -> CheckoutState

In a state machine that itself can be regarded a pure function, such as a parser, or a calculator, the above signature would be fine. For our purposes, however, we need to interleave side effects with state transitions. We might want to validate that the selected items exist using external database queries, and send requests to a third-party payment provider when placing the order.

Reaching for IO

Some systems built around the concept of a state reducer function, such as The Elm Architecture or Pux, support a way of specifying the side effects together with the next state. A starting point to achieve this in Haskell, for our checkout state machine, is the following type signature:

checkout  :: CheckoutState  -> CheckoutEvent  -> IO CheckoutState

A state transition then returns IO of the next state, meaning that we can interleave side effects with transitions. We create a type alias for such a function type, named FSM.

type FSM s e =  s -> e -> IO s

Then we can write the type signature for checkout using our data types as parameters.

checkout :: FSM CheckoutState CheckoutEvent

The definition of checkout pattern matches on the current state and the event. The first five cases simply builds up the state values based on the events, and transitions appropriately. We could add validation of selected items, and validation of the selected credit card, but we would then need explicit error states, or terminate the entire state machine on such invalid inputs. I’ll err on the side of keeping this example simple.

checkout NoItems (Select item) =  return (HasItems (item :| []))  checkout (HasItems items) (Select item) =  return (HasItems (item <| items))  checkout (HasItems items) Checkout =  return (NoCard items)  checkout (NoCard items) (SelectCard card) =  return (CardSelected items card)  checkout (CardSelected items card) Confirm =  return (CardConfirmed items card)

Remember the state diagram? The nested “Checkout” machine accepts the “cancel” event in all its states, and so does our implementation. We switch on the current state, and cancel in the correct ones, otherwise remaining in the current state.

checkout state Cancel =  case state of  NoCard items -> return (HasItems items)  CardSelected items _ -> return (HasItems items)  CardConfirmed items _ -> return (HasItems items)  _ -> return state

To demonstrate how an interleaved side effect is performed, we use the imported chargeCard and calculatePrice to charge the card. The implementations of chargeCard and calculatePrice are not important.

checkout (CardConfirmed items card) PlaceOrder = do  PaymentProvider.chargeCard card (calculatePrice items)  return OrderPlaced

The last case is a fall-through pattern, for unaccepted events in the current state, which effectively has the machine remain in its current state.

checkout state _ = return state

That is it for checkout, our state reducer function using IO.

Running the State Machine

To run our machine, we can rely on foldM. Given a machine, an initial state, and a foldable sequence of events, we get back the terminal state inside IO.

runFsm :: Foldable f => FSM s e -> s -> f e -> IO s runFsm = foldM

Just getting back the terminal state might be too much of a black box. To see what happens as we run a machine, we can decorate it with logging. The withLogging function runs the state machine it receives as an argument, logs its transition, and returns the next state.

withLogging  :: (Show s, Show e)  => FSM s e  -> FSM s e withLogging fsm s e = do  s' <- fsm s e  printf "- %s × %s → %s\n" (show s) (show e) (show s')  return s'

Combining these building blocks and running them in GHCi, with a list of events as input, we see the transitions logged and our side-effecting chargeCard operation.

*StateMachinesWithAdts> runFsm (withLogging checkout) NoItems [ Select (CartItem "potatoes" 23.95) , Select (CartItem "fish" 168.50) , Checkout , SelectCard (Card "0000-0000-0000-0000") , Confirm , PlaceOrder ] - NoItems × Select (CartItem {itemId = "potatoes", itemPrice = 23.95}) → HasItems (CartItem {itemId = "potatoes", itemPrice = 23.95} :| []) - HasItems (CartItem {itemId = "potatoes", itemPrice = 23.95} :| []) × Select (CartItem {itemId = "fish", itemPrice = 168.5}) → HasItems (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) - HasItems (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) × Checkout → NoCard (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) - NoCard (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) × SelectCard (Card "0000-0000-0000-0000") → CardSelected (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) (Card "0000-0000-0000-0000") - CardSelected (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) (Card "0000-0000-0000-0000") × Confirm → CardConfirmed (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) (Card "0000-0000-0000-0000") Charging card 0000-0000-0000-0000 $192.45 - CardConfirmed (CartItem {itemId = "fish", itemPrice = 168.5} :| [CartItem {itemId = "potatoes", itemPrice = 23.95}]) (Card "0000-0000-0000-0000") × PlaceOrder → OrderPlaced OrderPlaced

Yes, the logging is somewhat verbose, but there we have it; a simplified event-driven state machine using ADTs for states and events. The data types protect us from constructing illegal values, they bring the code closer to our conceptual model, and they make state transitions explicit.

Side Effects and Illegal Transitions

This is a great starting point, and as I argued in the introduction of this post, probably the leg on our journey with the highest return on investment. It is, however, still possible to implement illegal state transitions! We would not get any compile-time error bringing our attention to such mistakes. Another concern is that the state machine implementation is tightly coupled with IO, making it hard to test.

We could factor out the side effects in checkout using MTL-style type classes or free monads, , or perhaps using extensible-effects. That said, in the next post I will show you a technique to tackle both the side effect and testability concerns, using MTL-style abstraction of the state machine itself.

Why don’t you go on and read part 2 next!

Motor: Finite-State Machines in Haskell

2017-10-27T00:00:00+02:00

While writing my talk “Finite-state machines? Your compiler wants in!”, I have worked on porting the Idris ST library to Haskell. I call it Motor.

Motor is an experimental Haskell library for building finite-state machines with type-safe transitions and effects. I have just published it on Hackage, written a bunch of documentation with Haddock, and put the source code on GitHub.

This blog post is very similar to the Hackage documentation, and aims to pique your interest. The library and documentation will probably evolve and outdate this description, though.

State Machines using Row Types

The central finite-state machine abstraction in Motor is the MonadFSM type class. MonadFSM is an indexed monad type class, meaning that it has not one, but three type parameters:

A Row of input resource states
A Row of output resource states
A return type (just as in Monad)

The MonadFSM parameter kinds might look a bit scary, but they state the same:

class IxMonad m =>  MonadFSM (m :: (Row *) -> (Row *) -> * -> *) where  ...

The rows describe how the FSM computation will affect the state of its resources when evaluated. A row is essentially a type-level map, from resource names to state types, and the FSM computation's rows describe the resource states before and after the computation.

An FSM computation newConn that adds a resource named "connection" with state Idle could have the following type:

newConn :: MonadFSM m =>  m r ("connection" ::= Idle :| r) ()

A computation spawnTwoPlayers that adds two resources could have this type:

spawnTwoPlayers :: MonadFSM m =>  m r ("hero2" ::= Standing :| "hero1" ::= Standing :| r) ()

Motor uses the extensible records in Data.OpenRecords, provided by the CTRex library, for row kinds. Have a look at it's documentation to learn more about the type-level operators available for rows.

Building on Indexed Monads

As mentioned above, MonadFSM is an indexed monad. It uses the definition from Control.Monad.Indexed, in the indexed package. This means that you can use ibind and friends to compose FSM computations.

-- 'c1' and 'c2' are FSM computations c1 >>>= \_ -> c2

You can combine this with the RebindableSyntax language extension to get do-syntax for FSM programs:

test :: MonadFSM m => m Empty Empty () test = do  c1  c2  r <- c3  c4 r  where  (>>) a = (>>>=) a . const  (>>=) = (>>>=)

See 24 Days of GHC Extensions: Rebindable Syntax for some more information on how to use RebindableSyntax.

State Actions

To make it easier to read and write FSM computation types, there is some syntax sugar available.

State actions allow you to describe state changes of named resources with a single list, as opposed two writing two rows. They also take care of matching the CTRex row combinators with the expectations of Motor, which can be tricky to do by hand.

There are three state actions:

Add adds a new resource.
To transitions the state of a resource.
Delete deletes an existing resource.

A mapping between a resource name is written using the :-> type operator, with a Symbol on the left, and a state action type on the right. Here are some examples:

"container" :-> Add Empty  "list" :-> To Empty NonEmpty  "game" :-> Delete GameEnded

So, the list of mappings from resource names to state actions describe what happens to each resource. Together with an initial row of resources r, and a return value a, we can declare the type of an FSM computation using the Actions type:

MonadFSM m => Actions m '[ n1 :-> a1, n2 :-> a2, ... ] r a

A computation that adds two resources could have the following type:

addingTwoThings ::  MonadFSM m =>  Actions m '[ "container" :-> Add Empty  , "game" :-> Add Started  ] r ()

Infix Operators

As an alternative to the Add, To, and Delete types, Motor offers infix operator aliases. These start with ! to indicate that they can be effectful.

The !--> operator is an infix alias for To:

useStateMachines ::  MonadFSM m =>  Actions m '[ "program" :-> NotCool !--> Cool ] r ()

The !+ and !- are infix aliases for mappings from resource names to Add and Delete state actions, respectively:

startNewGame ::  MonadFSM m =>  Actions m '[ "game" !+ Started ] r ()

endGameWhenWon ::  MonadFSM m =>  Actions m '[ "game" !- Won ] r ()

Row Polymorphism

Because of how CTRex works, FSM computations that have a free variable as their input row of resources, i.e. that are polymorphic in the sense of other resource states, must list all their actions in reverse order.

doFourThings ::  Game m  => Actions m '[ "hero2" !- Standing  , "hero1" !- Standing  , "hero2" !+ Standing  , "hero1" !+ Standing  ] r () doFourThings = do  spawn hero1  spawn hero2  perish hero1  perish hero   where  (>>) a = (>>>=) a . const  (>>=) = (>>>=)

This is obviously quite clumsy. If anyone has ideas on how to fix or work around it, please get in touch. Had the r been replaced by Empty in the type signature above, it could have had type NoActions m Empty () instead.

Running the State Machine

The runFSM function in Motor.FSM runs an FSM computation in some base monad:

runFSM :: Monad m => FSM m Empty Empty a -> m a

FSM has instances for IxMonadTrans and a bunch of other type classes. More might be added as they are needed.

Examples

There is only one small Door example in the repository, along with some test programs. I haven’t had much time to write examples, but hopefully I will soon. The door example does feature most of the relevant concepts, though.

Automating the Build of your Technical Presentation

2017-09-24T00:00:00+02:00

Writing technical presentations that include code samples and diagrams can be really tedious. In mainstream presentation software, such as Keynote and PowerPoint, your workflow is likely to manually copy-and-paste source code from your editor into your slides. If you’re not using the drawing capabilities of your presentation software, you have to perform similar steps to include diagrams.

In my process of writing a technical presentation, code samples and diagrams are not written first, and included in the slides at the last minute – I work iteratively on slide content, source code, and diagrams, all at the same time. Having to repeat the time-consuming and error-prone process of updating code samples in slides, each time my original source code changes, breaks my creative flow completely. I also want to have my source code compiled and executable, so that I can be confident it is correct.

The main features I’m looking for in a technical presentation setup includes:

Text-based sources for everything (slides, code samples, diagrams, presentation template, styling, and build script)
The ability to include sections of external source code files into slides
Repeatable and fully-automated builds
PDF output with and without notes

I’m less interested in:

Slide transitions and animation
Videos and GIFs

This article demonstrates a setup that fulfills these goals, using Pandoc Markdown, Beamer, Graphviz and Make. I have also created a template based on my setup, that you can use if you like this approach.

Writing Slides with Pandoc Markdown

One of my favorite tools in technical writing is Pandoc. I use it for documentation, talks, Markdown preview, this website, and for converting existing documents to more desirable formats¹.

A very nice feature of Pandoc is slideshow output formats. You can write your slides in Markdown using regular headings, with the slide content below:

--- title: My Awesome Topic subtitle: Ramblings on the Subject author: Alice date: September 2017 ---  # Introduction  - Something - Another thing - The last one  # I Can LaTeX  \centerline{\Large{\textit{I can embed LaTeX as well.}}}  # A Program  ``` javascript function coolTools() {  return ["pandoc", "beamer"]; } ```

Build the LaTeX source code using Pandoc and the beamer target, and then generate the PDF using pdflatex:

pandoc -t beamer -o slides.tex slides.md pdflatex slides.tex

Voilà! You have a PDF, such as this one.

You might want to customize some of the Beamer styling, which is done by including a .tex file using the -H command line parameter of Pandoc. The full template described below uses this technique to change the styling.

Including Source Code from External Files

As stated in the introduction of this article, I want my source code samples to compile, and possibly be executable. If I have to write code directly in the slides, I will most likely make mistakes, and there will be no compiler or toolchain to tell me about it.

There are a number of ways to include code from external files with Pandoc, but I will shamelessly refer to my own filter called pandoc-include-code, which I use extensively. To include a source code file, write an empty fenced code block and use the include attribute to specify the path to the external file:

``` {.javascript include=my-program.js} ```

Now, suppose you have a Haskell program in a file Sample.hs.

module Sample where  data Animal = Dog | Cat  isAfraidOf :: Animal -> Animal -> Bool isAfraidOf Cat Dog = True isAfraidOf _ _ = False  result :: Bool result = Dog `isAfraidOf` Cat

The issue is you want to include just the Animal data type and the isAfraidOf definition, not the top module declaration and the result definition. By wrapping the content in two special comments, start snippet and end snippet , you create a named snippet:

module Sample2 where  -- start snippet animals data Animal = Dog | Cat  isAfraidOf :: Animal -> Animal -> Bool isAfraidOf Cat Dog = True isAfraidOf _ _ = False -- end snippet animals  result :: Bool result = Dog `isAfraidOf` Cat

In the Markdown source, refer to the snippet’s name when including:

``` {.haskell include=Sample2.hs snippet=animals} ```

The included code will be only that in your snippet:

data Animal = Dog | Cat  isAfraidOf :: Animal -> Animal -> Bool isAfraidOf Cat Dog = True isAfraidOf _ _ = False

You can still compile the code, load it in the REPL, and write tests for it, while including interesting parts into your slides. You are not depending on specific line number ranges, which of course becomes a nightmare once you edit your source code.

The last feature of pandoc-include-code I want to demonstrate is the dedent attribute. Let’s say we have a Javascript file with a class method that you’re interested in:

class Foo {  // start snippet bar  bar() {  return "bar";  }  // end snippet bar }

When including snippets of indented source code, you might want to “dedent”, i.e. remove extra leading whitespace. This is easily accomplished with the dedent attribute, specifying how many whitespace characters you want removed:

``` {.javascript include=sample1.js snippet=bar dedent=2} ```

The included code will be “dedented” to the first column:

bar() {  return "bar"; }

Generating Diagrams

Often I want a couple of diagrams in a presentation, to illustrate some design or flow in a program. I enjoy generating diagrams from plain text sources, instead of drawing by hand or using special drawing software with binary formats. Both Graphviz and PlantUML are powerful tools that are relatively easy to integrate with the presentation build in a Makefile.

Let’s say I want to generate a state diagram. The following Graphviz source code generates a simple yet beautiful diagram:

digraph door_states {  graph [ dpi = 300 ];  splines=true;  esep=5;  rankdir=LR;   size="8,5";   edge [ fontname = "Ubuntu" ];  node [ fontname = "Ubuntu Bold" ];   node [shape = point, width = .25, height = .25 ];  Start;   node [shape = circle];  Opened;  Closed;   Start -> Closed  Closed -> Opened [ label = "open" ];  Opened -> Closed [ label = "close" ]; }

Generate a PNG file using the dot command:

dot -Tpng -o door.png door.dot

The generated PNG image looks like this:

To automate this process with Make, you can find all .dot files, transform those paths into a list of target paths, and have Make run the dot command to generate all targets.

DIAGRAM_SRCS=$(shell find src -name '*.dot') DIAGRAMS=$(DIAGRAM_SRCS:src/%.dot=target/%.png)  .PHONY: all all: $(DIAGRAMS)  target/%.png: src/%.dot  mkdir -p $(shell dirname $@)  dot -Tpng $< -o $@

A similar setup can be used with PlantUML, although you might want the JAR file to download automatically:

PLANTUML=deps/plantuml.jar  UML_SRCS=$(shell find src -name '*.uml.txt) UMLS=$(UML_SRCS:src/%.uml.txt=target/%.png)  .PHONY: all all: $(UMLS)  target/%.png: src/%.uml.txt $(PLANTUML)  mkdir -p $(shell dirname $@)  cat $< | java -jar $(PLANTUML) -tpng -pipe > $@  $(PLANTUML):  mkdir -p $(shell dirname $@)  wget http://sourceforge.net/projects/plantuml/files/plantuml.jar/download -O $@

I have used PlantUML in this blog, just as described above, to generate diagrams for posts. See the post Hyper: Elegant Weapons for a More Civilized Page for an example.

Wrapping Up

Based on the techniques described in this post, I have created a template that you can use for your own presentations. It is published at GitHub. I hope this will be useful to someone, and that it can be a good complement to this article.

What I really like about the tools and techniques demonstrated in this article is that they are not limited to writing presentations. I use the same tools for writing documentation, and for writing this blog. Pandoc is an amazing piece of software, and I have just scratched the surface of what it can do. For instance, if you do not want PDF output for your talk, there’s a number of Javascript-based formats for slideshows available.

Now go on and write some cool tech talks!

Footnotes

I once needed to convert a technical manual from ODF to reStructuredText. A single Pandoc command later, and I had the sources for a proper Sphinx build.↩︎

Tagless Final Encoding of a Test Language

2017-06-05T00:00:00+02:00

I have experimented with a test language encoded in tagless final style, instead of algebraic data types, to support the typed combinators beforeEach and beforeAll. Although the intended use is for PureScript Spec, I want to share the Haskell prototype I ended up with, and explain how I got there.

The Algebraic Data Type Approach

The PureScript Spec project, inspired by Haskell’s hspec, provides an EDSL and framework for writing, organizing, and running PureScript tests. Combinators use a State monad collecting tests into an algebraic data structure, representing the test language tree structure.

describe "My Module" $ do  describe "Feature #1" $ do  it "does addition" (1 + 1 `shouldEqual` 2)  it "does subtraction" (1 - 1 `shouldEqual` 0)  describe "Feature #2"  it "does multiplication" (2 * 2 `shouldEqual` 4)

The Group data type holds describe groups and it tests, here shown in a simplified form, and translated to Haskell. The test body has the parameterized type t, making the Group structure suitable for representing not only tests to be run, but also for test results.

data Group t  = Describe String [Group t]  | It String t

A test suite to be run can have type [Group (IO ())], and a test result can have type [Group TestResult].

In a GitHub pull request for PureScript Spec, we discussed how to implement setup and tear-down functions around tests, and how to make them type safe. I started poking around in the code base, but soon realized that the required change was larger than I first imagined, and so I began on a clean slate prototype. The reason I used Haskell was to focus more on modeling different alternatives, and less time on hand-written instances for newtypes.

I wanted a setup function to provide a return value, and all tests run with the setup to receive the return value as a parameter. Thus, n setup functions would require test functions of n arguments. A test with an incorrect number of arguments would give a type error at compile-time.

My first attempt was to extend the current design by adding a new constructor BeforeEach to the Group data type. Using the already parameterized test body, tests inside a BeforeEach would be functions from the return value of type b, to some test body of type t. For each nesting of BeforeEach, test body types would get an additional argument.

data Group b t  = Describe String [Group b t]  | It String t  | BeforeEach (IO b) (Group b (b -> t))

While this structure can hold multiple nested BeforeEach values, and enforce the correct number of arguments to It body functions, the type b cannot vary throughout the structure. Requiring all setup functions in a test suite to return values of the same type was not acceptable. I suspect that there might be a way to solve this in Haskell using GADTs and existential types, but I’m not sure how it would translate to PureScript.

Following the idea of parameterizing Group further, I’d probably end up close to a specialized version of the Free monad. Why free monads matter explains a similar path, arriving at the Free monad, most eloquently. I decided, however, to try out a tagless final style encoding for the test language in my Haskell prototype.

Exploring Tagless Final Encoding

Having kept an eye out for practical examples of tagless final style, I was keen on trying it out for the test language. The discussion started on a local meetup in Malmö, where I presented the problem, together with my suspicion that a combination of tagless final style encoding and Applicative would solve it elegantly. The following design is the result of my own explorations after the meetup, and will hopefully be of use for the real implementation in PureScript Spec in the future.

We begin by declaring a bunch of language extensions, the module, and our imports.

{-# LANGUAGE DeriveFunctor #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE FlexibleInstances #-} {-# LANGUAGE GeneralizedNewtypeDeriving #-} {-# LANGUAGE MultiParamTypeClasses #-} module Test.Spec where  import Control.Monad.Identity import Control.Monad.IO.Class import Control.Monad.Writer import Data.Function import Data.List import System.IO.Memoize

Encoding the test language in tagless final style means that all operations are overloaded functions in a type class. MonadSpec takes two type arguments; m and f, constrained to Monad and Applicative, respectively. The class includes the operations it, describe, beforeEach, and beforeAll.

class (Monad m, Applicative f) => MonadSpec m f where

The operations of the type class constitute the whole language. Beginning with the leaf operation it, we see that it takes a string description, some test of type a, and returns a Spec parameterized by m and (f a).

 it :: String -> a -> Spec m (f a)

Having the test, a value of type a, wrapped up inside the (Applicative f) is essential for our operations to compose.

The describe operation takes a string describing a group of tests, and another Spec, with any type of tests, as long as they are inside the (Applicative f).

 describe :: String -> Spec m (f a) -> Spec m (f a)

The setup combinators beforeEach and beforeAll have identical type signatures. They take a setup value of type (f a), and a Spec with tests of type (f (a -> b)), returning a Spec with tests of type (f b). The type shows that the applicative test functions are applied by the setup combinators.

 beforeEach :: f a -> Spec m (f (a -> b)) -> Spec m (f b)  beforeAll :: f a -> Spec m (f (a -> b)) -> Spec m (f b)

What is a Spec? A Writer monad transformer, collecting Group values. Using an explicit WriterT is needed for the interpreter, explained shortly, to capture nested tests, apply test functions to setup combinators’ return values, and change the type of the test structure during interpretation.

type Spec m a = WriterT [Group a] m ()

The Collector interpreter collects a Spec into a data structure, much like the original approach, but with all test functions fully applied inside the applicative.

newtype Collector m a = Collector { unCollector :: m a }  deriving ( Functor  , Applicative  , Monad  , MonadIO  )

The Group data structure holds the applied test functions, and thus has no constructors for BeforeEach and BeforeAll.

data Group a  = Describe String [Group a]  | It String a  deriving (Functor)

In effect, the Collector interpreter loses information when collecting the test suite as a Group data structure. Other interpreters, e.g. a test suite pretty-printer, would provide its own instance of the beforeEach and beforeAll operations, and retain the setup information for its particular purpose.

The MonadSpec instance for the Collector interpreter defines the applicative type parameter as IO. This means that all setup combinator values must have type (IO a), where the nested test functions have type (IO (a -> b)).

instance (Monad m, MonadIO m)  => MonadSpec (Collector m) IO where

The it instance wraps the test body inside the applicative, and reports the Group in the WriterT monad.

 it name body =  tell [It name (return body)]

To wrap test groups in a Describe value, the instance of describe runs the nested WriterT to collect all groups.

 describe name spec = do  groups <- lift $ execWriterT spec  tell [Describe name groups]

The interesting part is the beforeEach instance, where the inner test function is applied using (<*>) and (&). As the Group data type has an instance of Functor, the application can be mapped recursively over the structure using fmap.

 beforeEach setup spec = do  groups <- lift $ execWriterT spec  tell $ fmap ((&) <$> setup <*>) <$> groups

This is where WriterT must be explicit in the MonadSpec operations. In a previous attempt, I had a MonadWriter constraint on the interpreter, and no mention of WriterT in MonadSpec. That approach prohibited the monoidal type of MonadWriter to change during interpretation, a change required to apply test functions when interpreting setup combinators. Instead, by making WriterT explicit in the MonadSpec operations, the Collector instance can collect groups using lift and execWriterT, and freely change the test function types.

As a slight variation on beforeEach, the beforeAll combinator must only run the setup action once, applying all test functions with the same return value. Using the io-memoize package, and the existing beforeEach combinator, we can do this neat trick:

 beforeAll setup spec = do  s <- liftIO $ once setup  beforeEach s spec

Collecting all tests, fully applied with setup return values, is a matter of running the WriterT and Collector instances, and joining the applicative values with the test body values, here constrained to be the same monadic type m. Note that Applicative is a super class of Monad since GHC 7.10.

collect  :: Monad m  => Spec (Collector m) (m (m a))  -> m [Group (m a)] collect spec = do  groups <- unCollector $ execWriterT spec  return (map (fmap join) groups)

Using collect, the run function traverses the group of tests, printing and running all tests, in the (MonadIO m) instance.

run :: MonadIO m => Spec (Collector m) (m (m ())) -> m () run spec = do  groups <- collect spec  mapM_ (go []) groups  where  go ctx (Describe name groups) =  mapM_ (go (ctx ++ [name])) groups  go ctx (It name body) = do  let heading = intercalate " > " ctx ++ " > it " ++ name  liftIO $ putStrLn heading  body

We can now nest setup combinators, describe groupings, and tests, with different types, while retaining type safety. The test suite type signature is somewhat verbose, but that could be hidden with a type alias. The only drawback, as I see it, is that the outer setup combinators provide the right-most test function parameters, which feels a bit backwards from a user point of view.

mySpec :: MonadSpec m IO => Spec m (IO (IO ())) mySpec =  beforeAll (putStrLn "before all!" >> return "foo") $ do   describe "module 1" $  beforeEach (putStrLn "before each 1!" >> return 20) $  describe "feature A" $ do  it "works!" (\x y -> assert (x == 20))  it "works again!" (\x y -> assert (y == "foo"))   describe "module 2" $  beforeEach (putStrLn "before each 2!" >> return 30) $  describe "feature B" $ do  it "works!" (\x y -> assert (x == 30))  it "works again!" (\x y -> assert (y == "foo"))  where  assert True = return ()  assert False = fail "Test failure!"

Using IO and fail to report test failures would be less then ideal a real test suite, but it serves our purposes in this experiment. OK, let’s run the test suite, already!

main :: IO () main = run mySpec

Looking at the output, we see that “once, before all!” is printed only once, and that “before each …” is printed for every test.

*Test.Spec> :main module 1 > feature A > it works! before all! before each 1! module 1 > feature A > it works again! before each 1! module 2 > feature B > it works! before each 2! module 2 > feature B > it works again! before each 2!

Summary

Using tagless final style for embedded languages is fun and powerful. I hope this demonstration can serve as a motivating example for readers interested in EDSLs and tagless final style, in addition to it perhaps influencing or finding its way into the PureScript Spec project. It would also be interesting to explore a Free monad approach, and to compare the two.

Revisions

June 5, 2017: Based on feedback from Matthias Heinzel, regarding the order of execution of setup actions, the expression (<*> setup) in the beforeEach instance was changed to ((&) <$> setup <*>).

Hyper: Elegant Weapons for a More Civilized Page

2017-01-06T00:00:00+01:00

Since laying Oden aside, I have been getting back into PureScript, and started working on a project called Hyper that I would like to introduce you to.

Other than experimenting with extensible records and row typing, and spamming my followers on Twitter, I have tried to document the design and purpose of the project so that it will be approachable for others. Now, however, I feel that a blog post introducing the project without diving too deep into implementation details would be helpful and interesting.

If you have any feedback, or like me to write more about Hyper here, please let me know in the comments at the bottom of this page.

Background & Motivation

I have been working with NodeJS for web servers, on and off, for the last 3-4 years. As projects grow, and even at the start of a project, programming errors creep in around the middleware setup. Many things can go wrong, and they are often hard to debug. You have no static guarantees that third-party middleware and your handlers are correctly setup — that they cover all cases, execute in the correct order, do not interfere with each other, and handle errors properly.

After an inspiring talk by Edwin Brady on dependent types for state machines, I decided to try to improve this situation for web server middleware using extensible records and row types in PureScript, to see if I could provide similar safety guarantees to those demonstrated in Idris. It turns out that extensible records work really well for this case!

Another thing circling around in my head, to the point of real concern, is the amount of energy that gets directed towards pure functional programming in single-page applications. The ideas are great, and very cool designs emerge, but I am afraid we completely abandon progressive enhancement and “regular” server-side rendered web applications. Thus, I wanted to direct some of the FP love towards good ol’ web server software.

I messed around a bit in Haskell with GADTs and phantom types, but PureScript records really stood out, so I decided to go ahead with PureScript. I think it deserves a web server library like the one Hyper aims to become. With alternative PureScript backends emerging, Hyper could also run on Erlang and C++ servers, not only NodeJS.

Design

The basic building blocks of Hyper are Conn and Middleware. They are heavily inspired by other middleware libraries, like Plug in Elixir and connect in NodeJS. The Conn represents the entirety of an HTTP connection, both the request and response.

type Conn req res components =  { request :: req  , response :: res  , components :: components  }

A Conn is a record containing some request req, and some response res. What are those values? Well, it depends on the middleware. Usually they are also records, where middleware specify the fields that they require, or provide, in both the request and response records. The structure of request and response records are open, and the compiler guarantees that the composition of middleware is correct.

You might wonder what the purpose of components is. For the sake of this blog post, let us just say that it is a place for user-defined things not directly related to HTTP.

The following type requires the request to have at least a body of type String, and headers of type StrMap String. The request can have more fields, this type does not care. The response and components are not constrained at all by this type.

forall req res c. Conn { body :: String  , headers :: StrMap String  | req  }  res  c

The second building block, middleware, are simply functions transforming connection values. They take a pure connection, and return another connection inside some type m, where m is usually an Applicative or a Monad.

type Middleware m c c' = c -> m c'

As middleware are monadic functions, just as the computations used with Bind, they compose using Kleisli composition.

-- Compose three middleware functions sequentially, -- from left to right: authenticateUser >=> parseForm >=> saveTodo

As m is a parameter of Conn, you can customize the middleware chain to use a monad stack for tracking state, providing configuration, gathering metrics, and much more.

Response State Transitions

In Hyper, the state of a response is tracked in its type. This guarantees correctness in response handling, preventing incorrect ordering of headers and body writes, incomplete responses, or other similar mistakes.

The contract of state transitions is encoded in a type class implemented by response writers for specific servers. This makes it possible to write reusable functions on top of the protocol that can run with different server implementations.

To safe a few keystrokes, and your innocent eyes, let us begin by looking at the type alias for middleware transitioning between response states.

type ResponseStateTransition m rw from to =  forall req res c.  Middleware  m  (Conn req {writer :: rw from | res} c)  (Conn req {writer :: rw to | res} c)

Now, on to the encoding of state transitions, the ResponseWriter type class.

class ResponseWriter rw m b | rw -> b where  writeStatus  :: Status  -> ResponseStateTransition m rw StatusLineOpen HeadersOpen   writeHeader  :: Header  -> ResponseStateTransition m rw HeadersOpen HeadersOpen   closeHeaders  :: ResponseStateTransition m rw HeadersOpen BodyOpen   send  :: b  -> ResponseStateTransition m rw BodyOpen BodyOpen   end  :: ResponseStateTransition m rw BodyOpen ResponseEnded

I know, it looks a bit scary with all the types. Stay strong, or have a look at it rendered as a state diagram instead.

We can write a middleware, based on the state transition functions of the type class, that responds friendly to all requests.

writeStatus (Tuple 200 "OK") >=> writeHeader (Tuple "Content-Type" "text/plain") >=> closeHeaders >=> write "Greetings, friend!" >=> end

Say we forget the line with closeHeaders. What do we get? An informative type error, of course!

Could not match type HeadersOpen with type BodyOpen

There are easier-to-use functions written on top of the response API so that you do not have to write out all state transitions explicitly.

writeStatus statusOK >=> contentType textHTML >=> closeHeaders >=> respond (div [] [ h1 [] [ text "Greetings!" ]  , p [] [ text "You are most welcome." ]  ])

The respond function has the added benefit of decoupling the response type, in this case HTML, from the response type required by the server, which for NodeJS is a Buffer. The transformation between HTML and Buffer is done behind the scenes by a bit of type class machinery.

The bits presented so far constitute the low level API for responding to HTTP requests in Hyper. I have plans for designing a simpler interface, based on Ring response maps in Clojure, where response handlers simply return a data structure describing the response to be written. Such an API can be built on top of the existing low-level API.

Wrapping Up

We have looked at the core design of Hyper, and the motivation behind the project, but merely scratched the surface. The documentation describes in much greater detail the implementation and current components of Hyper. I hope, however, that I have caught your interest. If so, please go ahead and check out some of the provided examples at GitHub.

Also, note that the project is highly experimental, not nearly ready for any production use. But it could be! If you are interested in contributing, do not hesitate to send me a tweet or a PM. I need help writing the library, middleware, servers for different PureScript backends, more examples, documentation, etc. If you want to have a look at the source code, it’s also on GitHub.

Thanks for reading!

Note at Feb 13, 2017: Since this blog post was published, the design of Hyper has changed. It is no longer based on simple monadic functions, but instead indexed monads, to provide safer interaction with response writing side effects. The documentation at hyper.wickstrom.tech should always be up-to-date, so please have a look there as well.

Taking a Step Back from Oden

2016-10-10T00:00:00+02:00

I have decided to stop working on Oden.

Why? Building a programming language, even a small one as Oden where I can piggy-back on the success and engineering efforts of Go, takes a lot of time. A lot of time. I currently work professionally at a startup called Empear where we develop products for analyzing and understanding large software projects based on their history. The work is a lot of fun, but also takes much of my time and energy. Working with something as demanding as Oden on nights and weekends, at least in the long run, is unrealistic for me. I managed to keep a good pace during the spring of 2016, but since this summer the tempo has steadily declined into spurious efforts every other week or so. While that may be fine in many open-source projects, I think Oden needs a far better momentum if it is to be usable in any foreseeable future.

I have wrestled with these thoughts for some time now, trying to find ways not to put the project on the shelf. Gathering and building a community of people is not easy. Neither is having the stamina to work on a project alone for an extended period of time. If I cannot build it alone, and if I cannot rally the group of people needed to build it together, then I should focus my efforts elsewhere. In other words, I should find something to work on that fits my current work situation and private life better.

Summarizing the year of 2016, it has been a hell of a ride. I went from only having talked at internal company events, to presenting at a local user group, and finally being a speaker on PolyConf in Poznań, Curry On in Rome, and Lambda World in Cádiz! I have learned a lot about programming language design, compiler implementation, open-source project management, project marketing, and blogging. Regardless if you build something that “succeeds”, whatever that means, I can highly recommend building your own language and compiler. It such a joy seeing your own little creature taking shape, running the first Hello World, writing the documentation, and meeting people that are excited about your work. There are negative sides as well, at least if you publish your work. The haters will find you and do their best to bring you down. I was shocked by that experience, loosing sleep and feeling really bad about myself. Eventually, though, I got out of it and started seeing all the wonderful reactions and encouragements from people around me, especially at conferences. That almost became a shock in itself! “Are people really this excited about my little language project?”

All of this has come together as a challenging balance for me – a balance between doing Oden for my own sake, a way of learning more about designing and building a programming language, and doing it for other people and for the Go community. I have reached a turning point and I will stop here. This post may seem as an attempt of creating some drama, which is not the point. I think it is reasonable for me to be clear why I am doing this, and to avoid any false expectations on the project’s future.

The Oden source code will remain at GitHub, with a big disclaimer that it is no longer in active development. I will shut down oden-lang.org and associated sites this week as I am paying for the servers myself. I might transfer the latest User Guide to a GitHub Pages site for free hosting.

As a final note, I still think the Go ecosystem deserves a decent functional programming language. Maybe some new language will emerge and fill those shoes, or a backend for PureScript or OCaml can win some ground. What ever the turnout, if Oden played even the smallest part in that story, I would be very happy.

Custom Formatting in HTML and LaTeX Code Listings using Pandoc

2016-07-10T00:00:00+02:00

I have worked intensively on the Oden User Guide lately, primarily on improving content, but also on providing high-quality PDF and HTML output formats with detailed control over typesetting.

For some code listings and syntax examples I want the typesetting to convey the meaning of text in the listing – user input, command output, placeholders, etc. The User Guide build uses Pandoc to transform Markdown documents to PDF (through LaTeX) and to HTML. I could not find a good way to express the custom formatting in a way that worked with both LaTeX and HTML using standard Pandoc functionality.

Therefore, I created a filter to handle my input format. Using this filter, I can write code listings in separate files, using a small subset of HTML. Here’s an example of a shell command listing source file from the User Guide:

$ <strong>GOPATH=PWD/target/go:$GOPATH go build -o hello hello/mainstrong> $ <strong>./hellostrong> Hello, world!

The strong tags in the listing are part of the HTML subset I use. To include listings in the Pandoc Markdown source I use regular code block syntax and add the custom include and formatted attributes:

```{include=src/listings/hello-world-go-build-and-run.html formatted=true} ```

The output, both in HTML and PDF, looks like this:

$ GOPATH=PWD/target/go:$GOPATH go build -o hello hello/main $ ./hello Hello, world!

For listings explaining the syntax of the Oden language I want placeholders to be typeset in italic text. Where the language supports a sequence of forms I want to express that using placeholder expressions with subscripts. The following listing source file explains the let binding syntax of Oden.

let <em>identifier<sub>1sub>em> = <em>expression<sub>1sub>em>  <em>identifier<sub>2sub>em> = <em>expression<sub>2sub>em>  ...  <em>identifier<sub>nsub>em> = <em>expression<sub>nsub>em> in <em>body-expressionem>

When included in the document, just like in the example before, the output looks like this:

let identifier₁ = expression₁ identifier₂ = expression₂ ... identifier_n = expression_n in body-expression

The filter is very simplistic in that it only supports em, strong, and sub elements, but it suits the needs of the Oden User Guide. I find extending Pandoc with filters very powerful, and I hope you find this technique and the blog post useful. If you are interested in the complete solution, including the Makefile compiling the filter and running Pandoc, please see doc/user-guide in the Oden repository.

Long live Pandoc!

Paramount Color Scheme for Vim

2016-05-15T00:00:00+02:00

Having tried a lot of color schemes for editors, especially for Vim, I have gotten quite picky. All right, very picky. Most of the time Vim has been configured to use Tomorrow Night or Gruvbox. Although they’re great, they have felt a bit over the top. Also, depending on where I’m sitting at the moment I use both dark and light backgrounds for best contrast.

The last couple of days I’ve tried off with some small modifications for accent colors, e.g. number, strings and escape sequences. This setup fit my taste very well so I decided to package that as a color scheme for Vim. I call it Paramount.

The goal of Paramount is to not clutter your editor with all colors of the rainbow, just to keep it simple. It uses three monochrome colors for most text together with a purple accent color. Diffs and some errors use red and green colors.

Paramount is based on the pencil and off color schemes. Thanks for the great work on those projects!

Screenshots

The following screenshots show some Go code together with the Latin Modern Mono font on light and dark backgrounds.

…and if you use the Monaco font to show the Ruby code from vimcolors.com:

Installation

Simply copy the color scheme file to your ~/.vim/colors directory or use a plugin manager like Plug or Vundle and add "owickstrom/vim-colors-paramount" as a plugin.

The source is available on GitHub. You can preview the theme on the ~/.vim/colors site.

Oskar Wickström

There and Back Again: From Quickstrom to Bombadil

Computer Says No: Error Reporting for LTL

QuickLTL and Picostrom

Evaluation

Rendering Problems

Implication

Small Errors, Short Tests

Diagrams

No Loose Ends

Programming in the Sun: A Year with the Daylight Computer

The Setup

Daylight DC-1 vs Boox Tab Ultra

Finding Bugs in a Coding Agent with Lightweight DST

Approach: Lightweight DST in TypeScript

The ThreadWorker Fuzzer

Results

Why the Entropy Buffer?

Machine: Learning; Human: Unlearning;

Gradually, Then Suddenly

Taking on New Projects

Unlearning

How I Built "The Monospace Web"

The Fixed Grid

The Font

The Body

The Horizontal Rule

The Table

The Layout Grid

The Media Elements

Summary

A Flexible Minimalist Neovim for 2024

Returning to Neovim

I have to mention Nix, you know?

Plugins

Batteries Are Included

Life in Monochrome

The Full Configuration

Statically Typed Functional Programming with Python 3.12

Preliminaries

Pattern Matching

Generics

Protocols

Specifying State Machines with Temporal Logic

Linear Temporal Logic

Temporal Operators

Next

Next for State Transitions

Always

Always for State Machines

Eventually

Until

Until for State Machines

What’s next?

Edits

Footnotes

Introducing Quickstrom: High-confidence browser testing

What is Quickstrom?

Past and future

Learning more

Comments

The TodoMVC Showdown: Testing with WebCheck

WebCheck

Specifications vs Models

TodoMVC as a Benchmark

The Specification

Test Results

The Future is Bright

Comments

Time Travelling and Fixing Bugs with Property-Based Testing

System Under Test: User Signup Validation

The Validation Type

Validation Property Tests

A Positive Property Test

Negative Property Tests

Accumulating All Failures

The Value of a Property

Testing Generators

Adding Coverage Checks

From Ages to Birth Dates