William Schultz

Canonicalized Distributed Protocol Specs

Sat, 07 Mar 2026 00:00:00 +0000

Formal descriptions of message passing distributed protocols are complex and heterogeneous. In theory, writing a formal spec of a distributed protocol is a good way to formalize and communicate its precise behavior. In practice, though, many of these specs become quite large and challenging to digest clearly. They use different messaging formats and patterns for how information is communicated between nodes, making protocol comprehension and modification tedious and error-prone. There are long discussions around the various message types used and comparions between Raft and Viewstamped Replication.

I’ve found the way these protocols are described also often leads to confusion around the separation between (1) the messaging-specific details and communication patterns of a protocol and (2) the essential behavior required for ensuring correctness. It would be nice to have a better canonical format for describing/modeling distributed protocols that makes their similarities & differences clearer, and potentially also facilitates mechanical derivation of protocol optimizations, modifications etc. without obscuring things with too many implementation specific choices.

Raft, for example, chooses two specific message types, RequestVote and AppendEntries, to implement its protocol behavior. It also contains a host of other specific state variables for tracking state, etc. What does a version of Raft look like if we try to abstract away concrete message types and communication patterns i.e. specify it in what we can call a so-called “canonicalized” message passing form? We can take a very simple approach and see how far it takes us.

Conceptually, we will express protocols in a model where all actions on a given node follow a simple, common template:

Read its local state and optionally a message from the network.
Update its local state based on this read.
Broadcast its entire updated state into the network as a new message.

We don’t impose any message type details on communication between nodes, so we can think of the behavior of every action as reading some message from the network and updating its state appropriately in response. More simply, since all messages are simply a full recording of a node’s local state at sending time, we can view every action as based on reading the remote (past) state of some other node and acting in response.

As an example, we can apply this to a version of the originally published Raft TLA+ spec, which contains roughly 9 distinct, core protocol actions. If we write a version of this spec in a “canonical” form, we end up with the following, election related actions GrantVote, RecordGrantedVote, and BecomeLeader actions:

\* Server i grants its vote to a candidate server.
GrantVote(i, m) ==
    /\ m.currentTerm >= currentTerm[i]
    /\ state[i] = Follower
    /\ LET  j     == m.from
            logOk == \/ LastTerm(m.log) > LastTerm(log[i])
                     \/ /\ LastTerm(m.log) = LastTerm(log[i])
                        /\ Len(m.log) >= Len(log[i])
            grant == /\ m.currentTerm >= currentTerm[i]
                     /\ logOk
                     /\ votedFor[i] \in {Nil, j} IN
            /\ votedFor' = [votedFor EXCEPT ![i] = IF grant THEN j ELSE votedFor[i]]
            /\ currentTerm' = [currentTerm EXCEPT ![i] = m.currentTerm]
            /\ UNCHANGED <<state, candidateVars, leaderVars, logVars>>
            /\ BroadcastUniversalMsg(i)
            
\* Server i records a vote that was granted for it in its current term.
RecordGrantedVote(i, m) ==
    /\ m.currentTerm = currentTerm[i]
    /\ state[i] = Candidate
    /\ votesGranted' =
        [votesGranted EXCEPT ![i] =
            \* The sender must have voted for us in this term.
            votesGranted[i] \cup IF (i = m.votedFor) THEN {m.from} ELSE {}]
    /\ UNCHANGED <<serverVars, votedFor, leaderVars, logVars, msgs>>

\* Candidate i becomes a leader.
BecomeLeader(i) ==
    /\ state[i] = Candidate
    /\ votesGranted[i] \in Quorum
    /\ state'      = [state EXCEPT ![i] = Leader]
    /\ nextIndex'  = [nextIndex EXCEPT ![i] = [j \in Server |-> Len(log[i]) + 1]]
    /\ matchIndex' = [matchIndex EXCEPT ![i] = [j \in Server |-> 0]]
    /\ UNCHANGED <<currentTerm, votedFor, candidateVars, logVars, msgs>>
    /\ BroadcastUniversalMsg(i)

where each action is parameterized on a message m whose fields exactly match the state variables on a local node, and the BroadcastUniversalMsg operator simply pushes a node’s full, updated state into the network as a new message, stored in a global msgs state variable.

BroadcastUniversalMsg(s) == 
    msgs' = msgs \cup {[
        from |-> s,
        currentTerm |-> currentTerm'[s],
        state |-> state'[s],
        votedFor |-> votedFor'[s],
        log |-> log'[s],
        commitIndex |-> commitIndex'[s]
    ]}

We can do this similarly for the core log replication related actions:

\* Server i appends a new log entry from some other server.
AppendEntry(i, m) ==
    /\ m.currentTerm = currentTerm[i]
    /\ state[i] \in { Follower } \* is this precondition necessary?
    \* Can always append an entry if we are a prefix of the other log, and will only
    \* append if other log actually has more entries than us.
    /\ IsPrefix(log[i], m.log)
    /\ Len(m.log) > Len(log[i])
    \* Only update logs in this action. Commit learning is done separately.
    /\ log' = [log EXCEPT ![i] = Append(log[i], m.log[Len(log[i]) + 1])]
    /\ UNCHANGED <<candidateVars, commitIndex, leaderVars, votedFor, currentTerm, state>>
    /\ BroadcastUniversalMsg(i)

\* Server i learns that another server has applied an entry up to some point in its log.
LeaderLearnsOfAppliedEntry(i, m) ==
    /\ state[i] = Leader
    \* Entry is applied in current term.
    /\ m.currentTerm = currentTerm[i]
    \* Only need to update if newer.
    /\ Len(m.log) > matchIndex[i][m.from]
    \* Follower must have a matching log entry.
    /\ Len(m.log) \in DOMAIN log[i]
    /\ m.log[Len(m.log)] = log[i][Len(m.log)]
    \* Update matchIndex to highest index of their log.
    /\ matchIndex' = [matchIndex EXCEPT ![i][m.from] = Len(m.log)]
    /\ UNCHANGED <<serverVars, candidateVars, logVars, nextIndex, msgs>>

\* Leader advances its commit index with quorum Q.
AdvanceCommitIndex(i, Q, newCommitIndex) ==
    /\ state[i] = Leader
    /\ newCommitIndex > commitIndex[i]
    /\ LET \* The maximum indexes for which a quorum agrees
        agreeIndexes == {index \in 1..Len(log[i]) : Agree(i, index) \in Quorum}
        \* New value for commitIndex'[i]
        newCommitIndex ==
            IF /\ agreeIndexes /= {}
                /\ log[i][Max(agreeIndexes)] = currentTerm[i]
            THEN Max(agreeIndexes)
            ELSE commitIndex[i]
    IN 
        /\ commitIndex[i] < newCommitIndex \* only enabled if it actually advances
    /\ commitIndex' = [commitIndex EXCEPT ![i] = newCommitIndex]
    /\ UNCHANGED <<serverVars, candidateVars, leaderVars, log>>
    /\ BroadcastUniversalMsg(i)

This type of specification approach gets rid of message type and communication pattern specific details from the protocol. All we do is define actions that are able to read some past state of another node and make updates based on it. In this model, we can view a protocol as specified simply in terms of (1) its state variables and (2) its actions, each of which are simply a read of some (current or past) node state.

History Queries

We can push this specification approach further, simplifying some actions to express their reads entirely in terms of history queries, rather than incrementally updating and reading an auxiliary variable. For example, for the BecomeLeader action, it is really just waiting until the votesGranted variable has accumulated the right internal state so that it can safely transition to a leader state. If we ignore this variable entirely, we can express the action precondition with one big precondition query like this:

\* Candidate i becomes a leader.
BecomeLeader(i, Q) ==
    /\ state[i] = Candidate
    /\ \A j \in Q : \E m \in msgs : m.currentTerm = currentTerm[i] /\ m.from = j /\ m.votedFor = i
    /\ state'      = [state EXCEPT ![i] = Leader]
    /\ nextIndex'  = [nextIndex EXCEPT ![i] = [j \in Server |-> Len(log[i]) + 1]]
    /\ matchIndex' = [matchIndex EXCEPT ![i] = [j \in Server |-> 0]]
    /\ UNCHANGED <<currentTerm, votedFor, candidateVars, logVars, msgs>>
    /\ BroadcastUniversalMsg(i)

which checks for the appropriate quorum of voters given the set of messages (states) in the network.

We can do something similar for the log replication related actions, the LeaderLearnsOfAppliedEntry is another similar action that records log application progress from other nodes.

\* Leader advances its commit index.
AdvanceCommitIndex(i, Q, newCommitIndex) ==
    /\ state[i] = Leader
    /\ newCommitIndex > commitIndex[i]
    /\ \A j \in Q : \E m \in msgs : 
        /\ m.from = j 
        /\ Len(m.log) >= newCommitIndex
        /\ log[i][newCommitIndex] = m.log[newCommitIndex]
        /\ m.currentTerm = currentTerm[i]
    /\ commitIndex' = [commitIndex EXCEPT ![i] = newCommitIndex]
    /\ UNCHANGED <<serverVars, candidateVars, leaderVars, log>>
    /\ BroadcastUniversalMsg(i)

Applying this history query specification approach, we end up with a simplified set of actions for the protocol:

BecomeCandidate
GrantVote
BecomeLeader
ClientRequest
AppendEntry
TruncateEntry
AdvanceCommitIndex
LearnCommit
UpdateTerm

where the previously required RecordGrantedVote and LeaderLearnsOfAppliedEntry actions have been subsumed into the BecomeLeader and AdvanceCommitIndex actions respectively, as well as their associated state variables votesGranted and matchIndex.

Simplifying the action structure by utilizing history queries can also have a non-trivial impact on model checking performance, as we are able to cut out a number of intermediate steps from the protocol. For example, in one experiment, even for a relatively small model (3 servers, MaxTerm = 2, MaxLogLen=1), running the original spec with RecordGrantedVote and LeaderLearnsOfAppliedEntry actions enabled generates 2,060,946 distinct states. With these actions disabled and using the history query based spec, only 27,062 distinct states were generated, a potential 75x reduction.

Query Incrementalization

Specifying a protocol in terms of history queries is conceptually satisfying and a nice way to abstract away more of the lower level protocol details. It moves the protocol further away from a practical implementation, though, since it’s not realistic for a node to have the ability to continuously read and query over the entire history of all states of other nodes. We can bridge this over to practical implementations, though, by viewing this as an incremental view maintenance problem.

That is, in a real system, we essentially want to maintain the correct output of these precondition queries based on the current state of the network. We can view this as an online maintenance problem i.e. instead of computing the query output over a giant batch of historical messages, we update the output of the query incrementally as each new message arrives. This is a formal way to map between the abstract, query-oriented protocol specification and a more practical, operational algorithmic implementation. It also, in theory, is sufficiently general i.e. as long as know that the queries we write down can be computed incrementally, any protocol we specify in this manner could in theory always be automatically “incrementalized” into a practical, operational version.

A lot of previous work has explored the foundations of evaluating these types of (first order logic) queries incrementally, particularly in the context of Datalog. I’m not as clear, though, what work has been done on automatically “incrementalizing” these types of queries into practical, operational versions for realistic protocols like Raft. Hydroflow might be the closest project tackling similar ideas.

This approach is similar to past work on the Heard-Of Model, and also a specification approach taken in some PaxosStore specifications from WeChat that they refer to as semi-symmetric message passing. The notion of specifying protocols as queries over histories also has been around for a while. This includes the foundational work done on Dedalus and Bloom by Peter Alvaro and also on DistAlgo. My understanding is that this also overlapped somewhat with the “relational transducer” model for declarative networking used in NDLog and similar techniques. The general idea of a history-oriented approach to specification has appeared in a kind of folk way in some of Lamport’s original specs of Paxos. Similar concepts also appear in posts on a message soup approach to modeling. I believe the Hydroflow work is also more recently taking these ideas further by concretely exploring ways to incrementally compute (e.g. compile) network or dataflow queries.

Verified Transpilation with Claude

Tue, 20 Jan 2026 00:00:00 +0000

We can check correctness of a TLA+ specification using the TLC model checker, which will exhaustively explore a spec’s reachable states to check that a specified property (i.e. an invariant) holds. TLC was originally developed over 20 years ago and has had a lot of development effort put into it. It is a mature and performant tool, but it is written in Java and it is essentially a dynamic interpreter for the TLA+ language. So, it is still likely unable to reach a theoretical upper limit of performance for checking finite, explicit-state system models, meaning there are still performance gains to be had by moving to a lower-level representation for model checking. This is basically the approach taken by other state-of-the-art model checkers within their domain like SPIN i.e. they generate C code for model checking that can be compiled and run natively, rather than dynamically interpreting the model code.

Transpiling with Claude

In general, doing this kind of lower-level translation task for TLA+ would be relatively nontrivial e.g. compiling TLA+ constructs down into some lower level representation (e.g. in C/C++ data structures) for compilation and native execution. Building any kind of general approach here requires a somewhat detailed understanding of the language and existing interpreter implementations, and how to effectively translate this into a lower level representation while preserving semantics accurately.

Instead of building a whole compilation engine, we can try asking Claude to do these as one-off translations for us. This is a kind of standard transpilation/compilation task, but in a “bespoke” way, since we’re not aiming to build any kind of generic compiler, and can also take advantage of any details specific to the given problem instance (more and more software problems seem to be falling under this type of “bespoke” category with LLMs). Some other folks have tried doing this recently for a variety of standard programming languages.

Since we already have TLC as an existing, reference interpreter, we can also ask Claude to generate an automated validation harness for us i.e. one that checks (at least for finite domains), that the output of the optimized C++ version of the model exactly matches that from the original TLA+ model. This gives us a convenient kind of (approximately) verified compilation step for going from high level TLA+ spec to a lower level model.

We can easily try this out for a given TLA+ spec by condensing this whole workflow into a prompt to Claude Code (i.e. wrap it into a skill). The prompt itself was developed over a few rounds of trial and error and refinement, to make sure Claude knew how to generate scripts with the right arguments, compare outputs properly, etc. The overall prompt is as follows:

## Generate Optimized C++ version of TLA+ Spec

Take the chosen TLA+ spec (ask the user for which one) and generate a C++ program that generates its full reachable state space as the model checker would do but in a way as optimized as possible for a C++ implementation. Do this single threaded, and check with the user for how to instantiate the constant finite parameters in the compiled C++ version. Ensure that the C++ version dumps out all states in a standard JSON format, and can output this to a JSON file. Assume a general JSON dump format that contains a state array like `{ "states": [ {fp: <uint64_t>, val: <JSON>, initial: <boolean>}, ... ]}`, where 'val' is the actual JSON representation of that state, and 'fp' is some hash/fingerprint for that state. Also add an option to run this state space exploration with JSON dumping disabled.

Finally generate a Makefile with a simple, barebones default target for building it.

## Validate Conformance between TLC and C++ version 

Now validate to make sure that the set of states generated and dumped into JSON by the C++ version match the set of states generated and dumped by TLC in JSON. Generate a Python script that runs TLC to generate the same state space and dump it to JSON using the tla2tools-checkall.jar binary which supports a `-dump json states.json` argument, and then have the script validate that the states match between the TLC output and the C++ generated state space. 

Generate a simple validation report in Markdown after completing this.

## Benchmark Throughput Difference

Measure the throughput (states/second) difference in the state space generation states between TLC and the C++ version. Check with the user for the finite model config parameters to use for this run, and update the generated C++ version of the spec to account for this if needed. You can do this benchmark by measuring the total runtime of TLC for an exhaustive run, and measuring its time duration and from this compute distinct states per second, and doing this similarly for the C++ version. When doing this, disable JSON dumping for both TLC and C++ to avoid the associated overhead. In order to measure the throughput of TLC, make sure to use the time duration reported by the final output of TLC. You don't need to do multiple runs of each, a single run is fine.

Generate a simple markdown report file on the results once the benchmark is complete.

From within a repo, we can store this as a Markdown file under ~/.claude/commands and then open up Claude Code and run the compile_tla command, which will then prompt us to get started with a given TLA+ spec.

Benchmarks

We can start with a test on the TwoPhase.tla specification, a standard TLA+ example and benchmark modeling two-phase commit. If we start up Claude Code and run our compilation command on this spec, Claude chugs away, with a few interaction points from the user (e.g. to confirm finite model parameters, etc.) and we can see it generate the following validation report, for a finite model with 4 transaction resource managers:

# TwoPhase Spec Validation Report

## Configuration
- **Spec**: TwoPhase.tla
- **Resource Managers**: 4 (rm1, rm2, rm3, rm4)
- **Depth Limit**: Unlimited (full state space)

## Results Summary

| Metric | TLC | C++ |
|--------|-----|-----|
| Total States | 1568 | 1568 |
| Initial States | 1 | 1 |
| Runtime | 1.90s | 0.48s |

## State Comparison

- **Common states**: 1568
- **Only in TLC**: 0
- **Only in C++**: 0

## Validation Status: PASSED

As a sanity check, we can go into this spec’s directory and take a look. Claude generated a 456 line C++ file, TwoPhase.cpp, that when compiled and run produces:

$ ./twophase
TwoPhase State Space Generator (C++)
Configuration: NUM_RM = 4
Depth limit: unlimited
JSON output: disabled

Exploration complete.
States found: 1568
Transitions: 5377
Duration: 0.000155417 seconds
Throughput: 10088986 states/second

If we run TLC with the same model parameters, we get the following:

Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 3.2E-13
5378 states generated, 1568 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 14.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 9 and the 95th percentile is 4).
Finished in 00s at (2026-01-20 21:42:36)

which feels a strong extra sanity check that the C++ model is doing the right thing. Even generating the exactly correct number of distinct, reachable states would be hard to cheat, and the generated Python validation script should also ensure that the generated JSON state spaces match exactly between both TLC and the C++ version.

As a few extra sanity “spot checks”, we can also run a few manual queries on the JSON outputs from TLC and the C++ version. As an example, one of the generated JSON states from TLC looks like the following:

{
    "fp": 12161962213042174405,
    "val": {
        "rmState": {
            "rm1": "working",
            "rm2": "working",
            "rm3": "working",
            "rm4": "working"
        },
        "tmState": "init",
        "tmPrepared": [],
        "msgs": []
    },
    "initial": true
}

so even counting the occurrences of a few field values serves as a reasonable validation hash on the outputs e.g.

$ for val in "aborted" "working" "prepared"; do grep -o $val tlc_states.json | wc -l && grep -o $val cpp_states.json | wc -l; done
    4144
    4144
    1120
    1120
    2272
    2272

In general, it feels helpful and important to have these kinds of lightweight, human verifiable “spot checks”. Even with a nice, verifiable task like we do here, asking the model to “verify all its outputs” still doesn’t engender a strong enough degree of trust that it did the right thing, at least for these experimental tasks where you are less confident in its outputs.

After running the benchmarking step, Claude also generated this report:

# TwoPhase State Space Generation Benchmark

## Configuration

- **Spec**: TwoPhase.tla
- **Resource Managers**: 8 (rm1, rm2, rm3, rm4, rm5, rm6, rm7, rm8)
- **JSON Output**: Disabled (pure exploration benchmark)
- **Workers**: 1 (single-threaded)
- **Benchmark Iterations**: 3

## Results

| Metric | TLC | C++ |
|--------|-----|-----|
| States | 1,745,408 | 1,745,408 |
| Avg Duration | 49.9251s | 0.851219s |
| Min Duration | 43.0652s | 0.808712s |
| Avg Throughput | 34,961 states/s | 2,050,480 states/s |
| Max Throughput | 40,529 states/s | 2,158,257 states/s |

## Speedup

- **Average throughput speedup**: C++ is **58.7x faster** than TLC
- **Peak throughput speedup**: C++ is **53.3x faster** than TLC

showing us the C++ version achieved over a 50x throughput speedup over TLC for a larger parameter configuration (8 resource managers).

This is pretty cool, and impressive that Claude is able to generate what seems to be a semantically accurate translation of the high level spec in essentially one-shot. It also seems reasonable that validating these kinds of translation steps for smaller finite parameters would be sufficient to assume generalization to larger parameter configurations e.g. if it is desirable to run larger model checking runs but would be infeasible to do full validation at those larger parameters.

AbstractDynamicRaft

We can run the above command for another spec, an abstracted variant of Raft that also includes basic dynamic reconfiguration functionality. Running our prior command again, Claude generates a 740 line C++ file and generates the following validation report:

# AbstractDynamicRaft Validation Report

## Summary

| Metric | Value |
|--------|-------|
| TLC States | 470098 |
| C++ States | 470098 |
| Common States | 470098 |
| Only in TLC | 0 |
| Only in C++ | 0 |
| TLC Initial States | 7 |
| C++ Initial States | 7 |

## Result

**✓ PASSED**: The state spaces match exactly.

The C++ implementation correctly generates the same state space as TLC.
## Configuration

- **Spec**: AbstractDynamicRaft.tla
- **Server**: {n1, n2, n3}
- **MaxTerm**: 2
- **MaxLogLen**: 2
- **MaxConfigVersion**: 2
- **InitTerm**: 0

and followed by the benchmark report:

# AbstractDynamicRaft Benchmark Report

## Summary

Comparison of state space exploration throughput between TLC (Java) and optimized C++ implementation.

| Metric | TLC | C++ |
|--------|-----|-----|
| Distinct States | 470,098 | 470,098 |
| Duration | 60.00s | 1.696s |
| Throughput | 7,835 states/sec | 277,180 states/sec |

## Performance Analysis

- **C++ Speedup**: **35.4x** faster than TLC
- **TLC Throughput**: 7,835 states/second
- **C++ Throughput**: 277,180 states/second

## Configuration

- **Spec**: AbstractDynamicRaft.tla
- **Server**: {n1, n2, n3}
- **MaxTerm**: 2
- **MaxLogLen**: 2
- **MaxConfigVersion**: 2
- **InitTerm**: 0

## Notes

- JSON state dumping was disabled for both TLC and C++ during benchmarking
- TLC was run with `-Xmx8g` heap and `-XX:+UseParallelGC`
- C++ was compiled with `-O3 -march=native -flto` optimizations
- Single-threaded execution for both

showing a roughly 35x speedup over TLC when running with the C++ version.

Bakery

Finally, we can try out one more benchmark, a specification of Lamport’s Bakery algorithm for mutual exclusion, with the following validation results:

# TLA+ / C++ State Space Validation Report

## Specification
- **TLA+ Spec**: Bakery.tla (Lamport's Bakery Algorithm)
- **Configuration**: N=2, Nat={0,1,2,3}

## Results

| Metric | TLC | C++ |
|--------|-----|-----|
| Distinct States | 7161 | 7161 |

## Validation Status

**PASSED**: State spaces are identical.

and followed by the benchmarking report:

# TLA+ / C++ State Space Exploration Benchmark Report

## Specification
- **TLA+ Spec**: Bakery.tla (Lamport's Bakery Algorithm)
- **Configuration**: N=3, Nat={0,1,2,3}
- **Distinct States**: 6,016,610

## Benchmark Configuration
- JSON state dumping: **disabled** for both tools (measuring pure exploration throughput)
- TLC: Single worker thread with parallel GC
- C++: Single-threaded BFS exploration with O3 optimization

## Results

| Metric | TLC | C++ |
|--------|-----|-----|
| Total Time | 102 sec | 5.45 sec |
| Distinct States | 6,016,610 | 6,016,610 |
| Throughput | ~58,986 states/sec | ~1,103,780 states/sec |

## Performance Comparison

| Metric | Value |
|--------|-------|
| **Speedup** | **18.7x** |
| C++ / TLC Throughput Ratio | 18.71 |

## Analysis

The C++ implementation achieves approximately **18.7x higher throughput** than TLC for the Bakery algorithm state space exploration.

Again, along with validation success, we get almost a 19x throughput speedup.

Final Thoughts

This is another impressive, general capability of coding agents, and also an example that re-frames the types of programming tasks we might traditionally care about solving. In a “classical” view of programming, the only natural way to solve this task would be to build a general purpose compiler, but LLMs let us consider just making these one-off tasks solvable in a bespoke way, that simplifies away the task of building a compiler altogether. So, both the intelligence and generality of the LLMs can reduce the hardness of the types of problems that need to be solved, for problems that have this “bespoke” quality to them.

It is worth pointing out a variety of caveats that still limit this approach as a practical, real-world solution. First, all of the above was limited to single threaded execution, and TLC is able to safely run many model checking workers in parallel, which requires extra care around concurrency control and efficient data structure design e.g. a shared, concurrent BFS queue is required to be managed between workers, as well as the state hash (fingerprint) set, which has been a source of nontrivial performance engineering work in past. Additionally, one of TLC’s unique features is also its ability to spill states to disk when they are too large to fit in memory. The above approach would be fundamentally memory limited, but with modern machines this is becoming less of a concern. Nevertheless, this is still quite a promising solution for simply the inner loop of any model checking or verification task which ultimately still requires fast generation and evaluation of the transition relation of a spec in order to generate reachable states.

A recurring pattern in this type of experiment is the balance of trust between you and the agent. Having a verifiable feedback loop helps a lot, but even still, it still felt necessary to review the outputs at different stages and manually verify things were actually being done correctly and not cheated in odd ways. For example, checking that the generated state spaces actually contain real, nonempty sets of states, etc.

I think there also seemed like a slightly related, missing feature here, which is kind of similar to the continual learning idea. That is, being confident that the LLM is developing a deeper understanding of the problem at hand to a point where you trust it further to take on things more autonomously. It felt hard to predict when and where it would or wouldn’t make the same mistakes twice, or go off on small tangents that felt unexpected. I think a lot of these issues can sort of be addressed with careful trial and error and prompt refinement, but there may still be a better kind of “teaching” workflow here to get an agent up to speed on a new problem and have it record the knowledge it learned more durably. At one point it seemed like it could be promising to start off with a “teaching” session, to have Claude learn about the workflow and develop its own, repeatable prompt for the task itself, but I didn’t go very far with this.

As with many LLM-oriented tasks, the determinism of the outputs of these type of workflows was also hazy to understand well, and there often seemed to be a better ideal breakdown of a workflow into steps which are truly “non-deterministic” or LLM-driven and those which can be cached as relatively deterministic scripts (e.g. the validation scripts). When starting out, though, it is easiest to write everything up a single agent prompt and re-run the workflow from scratch to test it out and experiment. Working with Claude in this way also makes it really nice to think about these experimentation workflows “end to end” without focusing on chaining together various Python, bash scripts, compilation steps, etc. In a sense, a true kind of end-to-end research/experimentation workflow. Especially when going further and generating whole written reports or visuals from an experiment, that is something that typically is super manual and requires a lot of analysis and stitching things together.

All the tests here were run with Claude Code v2.1.14 on Opus 4.5, on 2024 Apple M3 Macbook Pro, and the code and Claude prompts found here.

Git for Transactions

Tue, 14 Oct 2025 00:00:00 +0000

The idealized model of a transactional data storage system is one of a sequential, serializable system, where clients can submit transactions and the system ensures the outcomes are as if those transactions were executed against a single copy of that data. In practice, performance limitations of this model have historically pushed systems to explore a wide set of alternative, weakly consistent models.

In the weakly consistent world, one approach is to define some reasonable isolation or consistency level that is applicable to a wide enough range of applications, or allow the application to tune the consistency level to their specific needs. Another perspective is to lean even more strongly into the detailed mechanics of the weakly consistent model, abandoning the strictly sequential view of storage, and expose this flexibility to users. This is the type of approach explored in TARDiS (SIGMOD 2016), a concurrency control approach for transactional storage systems that essentially throws out the sequential data store model, and instead adopts weak consistency by making explicit a notion of branch and merge concurrency model.

Essentially, TARDIS adopts a view of transactional isolation and consistency in the style of Git i.e. at the start of a transaction a client forks history onto their own local branch, and are able to perform reads and writes in isolation on this branch. When they have completed their operations, they can go ahead and “merge” back their changes into a main branch of history. TARDiS leaves this merging task to the application, rather than to the underlying storage layer.

Branch and Merge Transactions

The proposed TARDiS system consists of a transactional key-value store that tracks conflicting execution branches with 3 mechanisms:

branch on conflict
inter-branch isolation
application-specific merge

In a standard transactional data store, we can imagine that the entire system consists of a single, linear history of states. As transactions commit, the effects of their write operations are applied to the latest state in this linear history, and new transactions may read from some state in this history (e.g. from either the latest state or some historical snapshot).

TARDiS breaks from this model and instead includes an explicit notion of branching into their data store abstraction. That is, when clients execute transactions, they may can do so in a single mode, which means they are executing their transactions against a chosen branch, or in a merge mode, which allows them to explicitly decide how to merge together conflicting changes across branches. As in Git, branches are conceptually isolated from concurrent transactions, so can be viewed conceptually as their own linear/sequential thread of history by an application.

Begin and Commit

In this model, there are a few natural modifications to the lifecycle of a transaction. First, when a transaction begins, it is not obvious where the transaction will “start from”, since there is no longer a global, sequential state history. So, a transaction first needs some strategy for selecting a branch to being execution against i.e. which state in the history DAG it will begin from, which is called its read state. Similarly, upon commit, a transaction can choose a commit state, which is the state where it will append its new changes to in the history DAG.

There is also a notion of begin and end constraints, which are additional conditions on start and commit that place extra validity conditions on a transaction, allowing a user more control over the degree of local branching. Essentially, begin constraints place conditions on what read states are valid for a transaction to choose from, and and end constraints place conditions on whether a transaction is valid to successfully commit.

These constraints can be used and composed to guarantee the properties of various standard database isolation levels e.g. snapshot isolation or serializability.

For example, to achieve serializability, one can combine the constraints of

Begin Constraint: Ancestor
End Constraint: Serializability, No Branching

These constraints require that a transaction starts from a read state that is the child of its latest committed transaction, and enforces that upon commit, the state does not fork the history. It also will implicitly require tracking of read and write sets of transactions on a branch, since for serializability, we may need to validate that no concurrent transactions intersected with our write/read sets.

There are also constraints for ensuring snapshot isolation e.g. if you do something similar to the serializability constraints but validate write-write conflicts between transactions. The paper does not go into depth on the formal definitions of these constraints, but my impression is that they are sufficient to provide guarantees analogous to these standard isolation levels.

Merging

To make the concept of branch merging explicit, TARDiS includes a concept of merge transactions. Conceptually, these can be viewed as similar to standard transactions in single mode, except that they may operate on multiple read states (i.e. multiple branches). These merge transactions are also a bit special in that they are given access to additional structure about the global state DAG, most notably

Fork Points: the fork point in the history between the set of states being merged
Conflict Writes: the set of conflicting writes that occurred on the set of branches being merged.

Access to this information allows merge transactions to explicitly resolve conflicts between branches in an application-specific manner. For example, they take the example of a simple counter value that has diverged among conflicting branches. Given the values on each branch and the fork point, a merge transaction can compute a new, resolved value by summing the difference between the value on each branch plus the value at the fork point.

Concluding Thoughts

It is interesting to note how the ideas in this paper echo the work that came just a bit later, in Crooks work on state-based isolation formalism, which appeared in PODC 2017. It seems that related ideas were present in this work, and the similar ideas were being developed concurrently. For example, the notions of read states and end constraints appear quite analogous to the “read state” and “commit test” concepts in the client-centric formulation. In general, both papers seem to share a common conceptual core of viewing transactional isolation models as centered around state-centric histories i.e. the database moves through a sequence of states over time, and new transactions may conceptually read from one of these states, and upon commit may create a new state, appending to this history.

TARDiS is an interesting attempt at an alternative to managing weakly consistent data interfaces in a more principled manner. On the flip side, my intuition is that managing and merging these branches in a complex application would become burdensome and unintuitive for most application developers. For software and systems builders, and those familiar with Git, DAGs, etc. this may be more palatable, but even in Git I find that it is rare I have ever dealt with merging of more than a 1-2 branches at a time. Even then, dealing with merge conflicts in general can still be somewhat tedious. Perhaps this type of system, though, would be effective as a slightly more internal layer, that other tools/apps could build on top of, rather than having users directly interface with it themselves. Regardless, I think the ideas in the paper are productive and useful as an alternative model for conceptualizing transactions in general and especially weak consistency or isolation models.

They also note that similar ideas have been explored in past, including Olive and Bayou, and there is sort of a folk understanding of the underlying relationships between multiversion concurrency control, Git, snapshot transactions, etc. This also bear similarities to other earlier work on eventually consistent transactions, and to some more recent practical systems primitives like the Merge operator in RocksDB.

On Writing, Specification, and Outputs

Fri, 19 Sep 2025 00:00:00 +0000

In Andy Grove’s High Output Management, on his experiences from management at Intel, he makes a comment about the value of writing “reports” in a business or organizational setting:

But reports also have another totally different function. As they are formulated and written, the author is forced to be more precise than he might be verbally. Hence their value stems from the discipline and the thinking the writer is forced to impose upon himself as he identifies and deals with trouble spots in his presentation. Reports are more a medium of self-discipline than a way to communicate information. Writing the report is important; reading it often is not.

This comment felt nicely portable to a core type of value derived from the use of formal methods in an engineering and design context, somewhat analogous to Leslie Lamport’s quip on writing and mathematics:

Writing is nature’s way of letting you know how sloppy your thinking is…Mathematics is nature’s way of letting you know how sloppy your writing is.

and one that has been re-emphasized by those across industry e.g. by folks recently at AWS:

First, the act of deeply thinking about and formally writing down distributed protocols forces a structured way of thinking that leads to deeper insights about the structure of protocols and the problem to be solved.

In the setting of Grove’s book, as its title suggests, a core theme is about measuring outputs of your work, not merely activity. For programmers or other type of lower-level individual contributors, outputs may be significantly easier to measure quantitatively e.g. lines of code written, features shipped, bugs fixed, etc. For managers (or, more generally, a broader class of modern knowledge workers or researchers), these outputs may be less tangible and harder to measure concretely.

A main idea of his framework is that, more abstractly, output can essentially be measured as the output of a team you manage and/or the output of the other people or teams in an organization that you influence. A related aspect of this framework is one of Grove’s alternate definitions of “manager”, which he calls a know-how manager:

If the manager is a knowledge specialist, a know-how manager, his potential for influencing neighboring organizations is enormous. The internal consultant who supplies needed insight to a group struggling with a problem will affect the work and the output of the entire group…Thus, the definition of a “manager” should be broadened: individual contributors, who gather and disseminate know-how and information should also be seen as middle managers, because they exert great power within the organization.

This know-how manager concept is perhaps somewhat natural to organizations where individual engineers may hold significant influence even if they have not formally assumed a “management” role. It also reinforces the broader space of possible outputs one might produce. That is, if disseminating know-how or transferring knowledge to others is one way to increase output of an organization, we may consider the writing process in a similar view. Deepening and expanding one’s own understanding of a problem by writing (or specifying) is an activity that later acts to increase output of the organization when others may need to come to understand or extend that problem or system. If the activity of writing or specifying is a way to deepen the understanding or knowledge of a problem or domain, this should ultimately be valuable since it implies additional leverage in the future when this person serves as a core contributor of insights or knowledge on this domain that impacts many others in the organization.

Further applying this “output-oriented” framework to Grove’s first statement above would imply that a detailed written report (or a formal specification, analogously) may not necessarily be a direct output (if no-one reads it, how could it be?). Rather, the valuable output associated with writing a report may be decoupled from the written report itself. Instead, it is associated with the way it impacts the eventual output of a team, or organization. That is, the process of writing itself and the understanding, clarity, and knowledge gained in the process is more closely tied to output here, though less tangible and harder to measure concretely.

The ideas presented by Grove here also felt somewhat related and applicable to a classic talk on academic writing from Larry McEnerney, where he notes:

In the real world you’re going to stop paying your readers to care about what’s inside of your head…You think writing is communicating ideas to your readers. It is not…It’s not conveying your ideas to your reader…It’s changing their ideas.

So, in this setting (academic writing), we can consider a fundamental output as changing of the reader’s ideas. This framing abstracts the process away from the concrete written document itself. That is, the output of an academic researcher is not papers, strictly, but to what degree they can change the behavior or ideas of others who consume their work. I think this also serves as a decent proxy for the concept of impact, especially in modern academic research, where a published tool, benchmark, dataset, blog post, talk, lecture, or tweet may have equal if not more impact (e.g. effect on others ideas/behaviors) than a traditional academic paper.

I found Grove’s writing on this topic a helpful lens to think about the value of writing and particularly the use of formal methods and formal specification in an engineering context. Especially given that I have often been discouraged when discussions in this domain often search for justifications of value based on how well a formal specification can map to a running system implementation, or help test and validate that system implementation more effectively. Those are certainly valuable auxiliary goals, but I find it helpful to separate them from the other, core value in developing these kinds of specifications, which may often be quite abstract or difficult to measure, similar to writing a report that no-one ever reads.

Logless Raft

Mon, 25 Aug 2025 00:00:00 +0000

The standard use of Raft is for implementing a fault tolerant, replicated state machine by means of a replicated log, maintained at each server within a replication group. Depending on the nature of the state we want to replicate, we can employ a simpler variant of Raft that achieves the same essential correctness properties. We can call this logless Raft and it can be useful when we are only replicating a single, small piece of state (e.g. configuration, metadata, etc.) between servers.

Simplifying Log Management

There is a lot of machinery included in the standard descriptions of Raft related to the intricacies of replicating log entries between servers, recording the applied indices of the log on each server, etc. (e.g. matchIndex,nextIndex,commitIndex). There are also strategies for specifically dealing with clean up and garbage collection of stale, divergent logs, handled as part of the AppendEntries request/response flow.

Most of this log and index management machinery is bookkeeping around what log entries a node (e.g. a leader) should send to other nodes (nextIndex), what entries other nodes have received so far (matchIndex), and which entries have been marked as committed (commitIndex). These details may be required from an implementation perspective, but from a protocol correctness perspective they are somewhat extraneous.

We can reduce this complexity with a variant of Raft that gets rid of the lower level implementation details around log index management and propagation of this information. Instead, Raft servers can send their entire logs to each other in each message. Receiving nodes can, based on their local log state and the log they received, determine which entries (if any) they can go ahead and append to their own log. Individual nodes no longer track any of the nextIndex/matchIndex bookkeeping variables, and the information flow between leaders/followers can also become more symmetric e.g. both can propagate their entire logs to each other as a way of communicating new updates or feedback about which log entries have been appended.

Log Merging vs. Log Replication

In this model, both log append and log truncation operations, normally incremental processes that may occur via repeated rounds of AppendEntries messages from a leader, are subsumed into a single log merge operation. That is, when a node \(i\) receives a log from node \(j\), it determines whether it can install this incoming log based on certain conditions.

At a high level, these conditions can be expressed as a check whether a node \(i\)’s own log, \(log[i]\) is a prefix of \(log[j]\). If so, it is safe for the node to extend its log to the received log, by updating \(log[i]\) to the value of \(log[j]\). If \(log[i]\) is not a prefix of \(log[j]\), then it must check for a “staleness” or “divergence” condition, by comparing the last term of both logs. If \(i\)’s log has an older last term than \(log[j]\), then it is safe to replace \(log[i]\) with \(log[j]\). Otherwise, it is not safe to modify its own log.

In both cases, this “prefix” check can be implemented in Raft by simply comparing the last term of each log, similar to how logs are compared in standard vote requests in Raft. That is, if the terms of the last entry in each logs are the same, then the prefix check can be done by comparing log lengths, and otherwise, the check is done by comparing the terms of the last entry in each log, with newer terms taking precedence.

A simplified version of this Raft variant is defined in this TLA+ specification (along with an explorable version). In that specification the MergeEntries action represents the key “log merge” operation, and encodes the log prefix checking rules for both append and/or garbage collection.

A Closer Look at Raft Log Structure

We can gain some additional intuition on the above merging view with another, closer look at the way that logs are structured across nodes in classic Raft. Specifically, we can view the set of all node logs as forming a global log tree structure, where each node’s local log is a “view” on this global tree e.g. a local log can be seen as a path in this tree. Over time, new branches may be created or pruned from this tree (e.g. via log truncation), and nodes may sync their local logs to move back in sync with (newer) branches.

We can illustrate this more concretely if we look at a sample protocol behavior through this lens. The diagram below shows a behavior from the above TLA+ specification of the abstract variant of Raft with a configuration of 4 servers ({n1,n2,n3,n4}). The log tree structure shown is defined where nodes correspond to log entries (i.e. (index,term) pairs) and edges correspond to adjacent log entries in some given log across any node. The log tree is also annotated with each node’s current “position” in the tree i.e. the log entry that corresponds to their current last log entry (nodes with an empty log are simply omitted in those annotations), and entries marked as committed are highlighted in green. A special “root” node in gray denotes an empty log, the initial state for all nodes.

State 0: Initial State
State 1: BecomeLeader(n1, ['n1', 'n2', 'n3'])
State 2: ClientRequest(n1)
State 3: ClientRequest(n1)
State 4: ClientRequest(n1)
State 5: MergeEntries(n2, n1)
State 6: MergeEntries(n3, n1)
State 7: CommitEntry(n1, ['n1', 'n2', 'n3'])
State 8: ClientRequest(n1)
State 9: BecomeLeader(n2, ['n2', 'n3', 'n4'])
State 10: ClientRequest(n2)
State 11: MergeEntries(n3, n2)
State 12: MergeEntries(n4, n2)
State 13: CommitEntry(n2, ['n2', 'n3', 'n4'])
State 14: ClientRequest(n2)
State 15: BecomeLeader(n3, ['n1', 'n3', 'n4'])
State 16: ClientRequest(n3)
State 17: MergeEntries(n1, n2)

When a new leader gets elected, a “fork” may be created in this tree, if the new leader did not contain all previously created (but uncommitted) log entries. For example, this first occurs in State 10, when node n2 has become leader and written a new entry but without the log entry (4,1) created by n1. Similarly, another fork is created when a branch via n3 is created in State 16.

Note also that local log “pointers” move along paths in this tree as new logs are replicated or “merged” around. For example, in State 10 to State 11 transition, n3 replicates the log from n2, and so moves its pointer in the tree ahead to entry (4,2). Note also that due to the key “log matching” property that is maintained in Raft, (index, term) pairs should identify unique prefixes/paths within this tree.

Pruning of branches in this tree also occurs when a node with an old/stale node merges its log with a newer log. In standard Raft, this pruning will also occur, but typically occur in stages e.g. first as a node truncates its log, and then replicates new entries to come into alignment with an up to date branch. For example, in State 17, n1 has merged itself onto the newer branch of n2, pruning its older, stale branch ending in entry (4,1).

This perspective on Raft logs helps to provide intuition on the “merging” strategy we outlined above. Local logs can be seen as views or paths in this global tree structure, and replication of logs between nodes can be viewed as a way of bringing divergent branches back in sync and replicating a branch to a sufficient number of nodes to ensure safe commit. Note that this blog post on chaining in Raft puts forth similar perspectives, partially through the lens of blockchain protocols.

Going Logless

In this abstract, “merging” based variant of the Raft, lower level log management operations have been abstracted away. That is, the entire log is a monolithic piece of state that is replicated around between nodes in one shot, and we only care about some notion of logical “ordering” between two different logs, which is determined by the “last term” ordering condition described above.

In this monolithic log model, we only care about comparison between the end of each log. So, it is relatively straightforward to see that we can view such a protocol as simply storing a “rolled up” log at each node i.e. storing the full piece of state that corresponds to application of all entires in a log, tagged by the “(last index, last term)” of that log. When we propagate around logs, we don’t actually need to store the whole log, but only the state corresponding to application of that log’s entries. And we can easily compare two pieces of this state by simply comparing the tagged index/term values.

From this perspective, we can now imagine a variant of Raft that stores some arbitrary piece of state, which gets updated “in-place” via client operations at a leader node. This state is propagated to followers via messages that contain the entire state, and they decide whether to install the newer state or not based on this simple “merging” logic which does this logical comparison in version between their own local state and the state they received. When we write a new entry down on a leader, we can simply update that state in-place and increment the “index” (perhaps more appropriately, can be called an object “version”) for the local state.

There are many other versions of logless or “register” style consensus algorithms. Recent proposals like CASPaxos and RMWPaxos try to do something similar for Paxos-based systems, and there is also a history of literature on “atomic registers”, implementing this type of primitive in a distributed fashion. This post from the author of Bizur also discusses similar ideas.

I haven’t seen this logless variation specifically appear in the context of a Raft-based protocol, though it is essentially similar to the ideas employed in the design of a new reconfiguration protocol within MongoDB’s Raft-based consensus system. It is also somewhat informative to derive this logless variant through a series of relatively straightforward modifications to standard, “log-based” versions of Raft.

Simple Serializable Snapshot Isolation

Tue, 13 May 2025 00:00:00 +0000

In A Critique of Snapshot Isolation, published in EuroSys 2012, they present write-snapshot isolation, a simple but clever approach to making snapshot isolation serializable. This work was published a few years after Michael Cahill’s original work on Serializable Snapshot Isolation (TODS 2009), and around a similar time as the work of Dan Ports on implementing serializable snapshot isolation (VLDB 2012), which applied Cahill’s ideas in PostgreSQL.

At the highest level, the idea of this paper is that instead of detecting and aborting “write-write” conflicts, as is done in classic snapshot isolation, it is sufficient to guarantee serializability by instead detecting and preventing “read-write” conflicts. That is, a conflict where one transaction writes to a key that is read by another concurrent transaction. They also show that, at least for some workloads, there is no significant fundamental concurrency/performance impact of this approach vs. snapshot isolation.

Snapshot Isolation

Classic snapshot isolation ensures that each transaction observes a consistent snapshot of the database, and prevents conflicting writes by concurrent transactions. There are standard lock-based and lock-free implementations of SI, which both basically rely on assignment of a “read” and “commit” timestamp to each transaction. That is, a centralized timestamp oracle is used to assign timestamps for ordering transactions. For a transaction \(T_i\) with read timestamp \(T_s(T_i)\), it will read the latest version of data with commit timestamp \(\delta < T_s(T_i)\). Two transactions conflict if they (1) write to the same row \(r\) and (2) they have temporal overlap: \(T_s(T_i) < T_c(T_j)\) and \(T_s(T_j) < T_c(T_i)\) (i.e. their read and commit timestamp spans overlap).

Google’s Percolator system implemented a standard lock-based implementation of SI, which adds lock and write columns, where the write column maintains the commit timestamp. Basically, it runs a 2PC algorithm, and updates the lock column on all modified rows during a first phase of 2PC. If a transaction tries to write into a locked item it may either wait, abort, or force the transaction holding that lock to abort. In second phase of 2PC, the data is then updated with the commit timestamp and the locks removed. Slow or failed transactions that are holding locks, though, may prevent others from making progress.

A basic lock-free implementation of snapshot isolation can be done using a centralized oracle, that is responsible for receiving commit requests from all transactions and checking for conflicts.

This algorithm checks, for each row modified by a transaction, \(R\), whether there is temporal overlap with any other transaction on that row i.e. has any other transaction concurrently written to it. If so, the transaction must be aborted. Otherwise, it is assigned a new commit timestamp and allowed to commit, marking each of its modified rows with the newly chosen commit timestamp.

Serializability

The paper first examines the question of what role write-write conflicts play in snapshot isolation and serializability. The most standard example of non-serializable snapshot isolation histories are those containing write skew anomalies, where transactions don’t write to conflicting keys, but may update keys in a way that violates some global constraint.

They note, though, that aborting transactions on write-write conflicts is also overly restrictive in some ways i.e. transactions will be aborted in some cases even if no serialization anomaly would manifest. They consider a modified variant of the lost update anomaly, like

\[r_1(x) \, \, w_2(x) \, \, w_1(x) \, \, c_1 \, \, c_2\]

Standard write-write conflict checks will abort one of these transactions unnecessarily, since a lost update anomaly won’t actually manifest here. As they summarize:

In other words, write-write conflict avoidance of snapshot isolation, besides allowing some histories that are not serializable, unnecessarily lowers the concurrency of transactions by preventing some valid, serializable histories.

Making Snapshot Isolation Serializable

Instead of detecting write-write conflicts of concurrent transactions, as done under classic snapshot isolation, they introduce write-snapshot isolation (WSI), which instead detects and aborts read-write conflicts. They state the conflict conditions more formally as:

RW-spatial overlap: \(T_j\) writes into row \(r\) and \(T_i\) reads from row \(r\);
RW-temporal overlap: \(T_s(T_i) < T_c(T_j) < T_c(T_i)\).

Essentially, if a transaction \(T_j\) is concurrent with \(T_i\) and writes a key \(k\) that \(T_i\) reads from, this is manifested as a conflict and \(T_i\) must be prevented from committing. Most importantly, write-snapshot isolation is sufficient to strengthen snapshot isolation to be fully serializable.

They also point out that the simple condition of checking for read-write conflicts is not quite precise enough, and would, by default, lead to unnecessary aborts of read-only transactions. Read-only transactions needn’t abort even if they fall into the conflict detection condition for write-snapshot isolation, since they don’t affect the values read by other transactions, and/or concurrent transactions as well.

They prove that write-snapshot isolation is serializable, by basically showing that you can use commit timestamps of transactions for a serial ordering, and that read-write conflict detection is sufficient to ensure that all transaction reads would be equivalent to those read in a serial history, since they are not allowed to proceed if they conflict with a concurrent write into their read set. And, similarly, the output of writes from each transaction is maintained and respects the commit timestamp ordering.

They present a lock-free implementation of write-snapshot isolation, which augments the classic SI approach by recording both the read sets \(R_w\) and write sets \(R_r\) of each transaction that is used upon transaction commit at an “oracle”.

This is a nice idea since it is mostly the same as write-write conflict detection of SI, just a bit generalized to handle reads as well as writes. Marc Brooker makes some similar observations in a related blog post.

Performance

Their approach raises the question of how different classic SI is from write-snapshot isolation in terms of histories that are allowed or proscribed. Intuitively, it doesn’t seem that there would be something inherently more restrictive about the prevention of read-write conflicts vs. write-write conflicts.

They compare the concurrency level offered by a centralized, lock-free implementation of write-snapshot isolation with that of standard snapshot isolation implementation. They implemented both snapshot isolation and write-snapshot isolation in HBase (an open-source clone of BigTable) to test this. Overall, they test a YCSB workload variant with both normally distributed and zipfian (modeling case where some items are extremely popular) row selection, and find that essentially there is minimal performance difference between the two, at least for these (somewhat artificial) workloads.

They find similar results for abort rate comparison between SI and WSI, with the latter being slightly higher, but the overall difference is negligible.

Concluding Thoughts

Overall, this paper provides a nice perspective on snapshot isolation in general, and a re-consideration of its underlying assumptions. One takeaway is a reinforcement of the somewhat arbitrary delineations between isolation levels. For example, snapshot isolation is intuitive in some ways i.e. every transaction reads from a consistent snapshot, but the notion of write-write conflicts is somewhat arbitrary. This paper kind of sheds light on that by showing that, in some sense, read-write conflicts are actually the more “natural” type of conflict you would care about, at least in the sense that they give you a more fundamental guarantee i.e. serializability. I’m not sure that the specific anomalies allowed under SI (i.e. write skew) are fundamentally intuitive in any way.

Note that Adya style formalisms and considerations of these type of anomalies center around the concept of anti-dependencies and their appearance in cycles (e.g. the G2 anomaly class). Adya defines an anti-dependency for a transaction that writes a newer version of a value read by another transaction. Cahill’s work on serializable snapshot isolation builds on earlier results from Fekete (TODS 2005), which showed a result that any non-serializable SI history must contain a cycle with 2 consecutive anti-dependency edges, and furthermore, that each of these edges involves two transactions that are active concurrently. For example, in the classic write skew anomaly, such a cycle exists with just two transactions, each with a mutual anti-dependency on the other, satisfying Fekete’s condition. Cahill’s technique basically tracks incoming and outgoing \(rw\) dependencies. This bears similarities to the global approach of checking read set / write set conflicts between transactions, as done in WSI, but using per-transaction metadata.

Some of the ideas from this paper have been more directly implemented in transactional systems like Badger, and are similar to how serializable transactions are implemented in CockroachDB.

Transactions as Transformers

Sun, 04 May 2025 00:00:00 +0000

Database transactions are traditionally modeled as a sequence of read/write operations on a set of keys, where each read operation returns some value and each write sets a key to some value. This is reflected in most of the formalisms that define various transactional isolation semantics (Adya, Crooks, etc.).

For many isolation levels used in practice in modern database systems, (e.g. snapshot isolation or above), we can alternatively consider viewing transactions as state transformers. That is, instead of a lower-level sequence of read/write operations, a transaction can be viewed as a function that takes in a current state, and returns a set of modifications to a subset of database keys, based on values in the current state that it read. This view is not fully general in its applicability to all isolation levels (e.g. read committed), but we can explore this perspective and how it simplifies various aspects of reasoning about existing isolation levels and their anomalies.

State Transformer Model

Most standard formalisms represent a transaction as a sequence of read/write operations over a subset of some fixed set of database keys and values e.g

\[T: \begin{cases} &r(x,v_0) \\ &r(y, v_1) \\ &w(x, v_2) \\ &r(z, v_0) \end{cases}\]

For transactions operating at isolation levels that read from a consistent database snapshot, though (e.g. Read Atomic and stronger), we can think about transactions as more cleanly partitioned between a “read phase” and “update phase”. That is, we can consider the “input” of a transaction as the subset of keys it reads from its snapshot, and its “output” as writes to some subset of keys, each of which, at most, can depend on some subset of keys that were read from that transaction’s snapshot. In other words, at levels like snapshot isolation, although a transaction seems to pass through many stages, doing reads and writes internally, we can compact these stages down into a single read and update phase.

We can formalize this idea into the view of transactions as state transformers. For example, for a database with a key set \(\mathcal{K}=\{x,y,z\}\), we can consider an example of a transaction modeled in this way:

\[T: \begin{cases} &\mathcal{R}=\{x,y\} \\ &x' = f_x(y,z) \\ &y' = f_y() \end{cases}\]

In this representation, \(\mathcal{R}=\{x,y\}\) is the set of keys read by the transaction upfront, and each \(f_k\) is a key transformer function i.e. a pure function describing the updates that get applied to each key \(k\) that is being updated by that transaction. Each such function can optionally depend on the values read from the current snapshot state for that transaction. We can refer to the set of keys passed as arguments to a key transformer as the update dependencies of a key.

Note that we separate transactions into “read” and “update” portions, where the read-only phase is considered to happen upfront and may or may not be over a different subset of keys than exist in the update dependencies of the key transformers. We might choose a different model where transactions only consist of the key transformer functions, but the above allows us to also include read-only transactions in our model.

Isolation Anomalies and Lost Update

Viewing transaction operations as being composed of these key transformer functions i.e. functions that read some values and produce some writes, helps clarify some awkward aspects of existing transaction isolation models and their treatment of anomalies. Particularly due to the fact that most traditional transaction formalisms don’t make these kind of update/transformer operations an explicit first-class member of their formal model.

For example, we can consider some standard treatments of the lost update anomaly. In the Cerone 2015 framework, they represent transactions as sequences of read/write operations over a set of keys.

When they define the lost update anomaly in their model, though, it sort of requires skirting the issue a bit by resorting to a notion of “application code” that could have produced this sequence of writes:

This is common across many other descriptions of the lost update anomaly. One unsatisfying aspect of these descriptions of lost update is that the underlying formalism doesn’t seem to do an accurate job at capturing the underlying reason for the anomaly occurring. In addition, lost update is presented often as the canonical anomaly that “write-write” conflicts in snapshot isolation (SI) exist to prevent. But, if we removed the reads from the \(T_1\) and \(T_2\) in the above example, this SI approach to preventing lost updates would need to still abort one of the transactions, but it’s not really clear why that is necessary, since the notion of “lost update” goes away when both transactions are doing only writes, albeit to the same key.

One view is that anomalies like lost update (which are the specific anomaly which write-write conflicts in snapshot isolation are supposed to prevent), are fundamentally unnatural to express without resorting to some model that can take into account the true “update” semantics (e.g. read-write dependencies) between transactions. In other words, the underlying problem of “lost update” arises due specifically to a case where a write is done that is dependent on a value that was read in that transaction. Most formalisms don’t make this “write that depends on a read” semantics explicit, though, and so resort to a kind of vague notion of “application code” that might have done such updates.

In the state transformer model, we might say that a more precise definition of lost update is the case where two transactions \(T_1\) and \(T_2\) update the same key \(k\) via key transformers \(f^{T_1}_x\) and \(f^{T_2}_x\) and \(k\) is a dependency of one of these transformer functions e.g.

\[\begin{aligned} f^{T_1}_x(x) = x + 1 \\ f^{T_2}_x(x) = x + 3 \end{aligned}\]

That is, a lost update is a problem specifically due to the read-write dependency that exists between the two transactions. This creates a potential serializability anomaly since, if you execute two transactions with transformers as in the above example, the order of these transactions matters for the final outcome, since they incur a semantic (read-write) dependency on each other. That is, if they both execute on the same data snapshot and are allowed to commit, the result will be semantically incorrect i.e. you really have “lost” one of the updates, since the outcome will be either \(x=1\) or \(x=3\), but not \(x=4\) as it should be (assuming \(x=0\) in the shared snapshot).

Similarly, such an anomaly can also arise with a different dependency structure e.g.

\[\begin{aligned} f^{T_1}_x&() = 6 \\ f^{T_2}_x&(x) = x + 3 \end{aligned}\]

In this case, the order of execution can matter, but if these transactions are concurrent and \(T_1\)’s write were allowed to “win”, then we end up in the state \(x=6\), which is equivalent to a scenario where the transactions executed serially with \(T_2\) going first. If \(T_2\)’s write “wins”, though, then we end up in a state where \(x=3\) which is not equivalent to either serial execution of these transactions, which produces either \(x=9\) or \(x=6\).

Finally, we can also have a case where both transactions perform “blind” writes to the same key, incurring no dependency on each other, e.g.

\[\begin{aligned} f^{T_1}_x() &= 6 \\ f^{T_2}_x() &= 3 \end{aligned}\]

In this case, no true lost update anomaly can manifest, since the resulting state after commit of both transactions will always be equivalent to their execution in some sequential order. Essentially, existing transaction formalisms can be seen as behaving as this case i.e. where all key transformers have no key dependencies. That is, they always write “constant” values i.e. those that are not dependent on any values read by the transaction.

Write Skew and a Generalized View of Anomalies

This transformer model also gives us a way to see that lost update can really be seen as a special case of a more general class of anomalies. In particular, two transactions writing to the same key is not a fundamental condition for this class of anomalies, it just occurs in the lost update special case.

For example, we can also consider write skew within this framework, the canonical anomaly permitted under snapshot isolation. The classic write skew example manifests when two transactions don’t write to intersecting key sets, but they both update keys in a way that may break some external “semantic” constraint. As illustrated again in Cerone by example:

We can represent this example in the state transformer model as something like:

\[T_1: \quad \begin{aligned} f_x(x,y) &= \text{if } (x + y) > 100 \text{ then } (x - 100) \text{ else } x \\ \end{aligned}\] \[T_2: \quad \begin{aligned} f_y(x,y) &= \text{if } (x + y) > 100 \text{ then } (y - 100) \text{ else } y \end{aligned}\]

In this case, even though transactions \(T_1\) and \(T_2\) write to disjoint keys, the key transformers in each transaction depend on both keys, \(x\) and \(y\), based on their conditional update logic. Again, the core problem here arises due to the read-write dependencies between these transactions i.e. the writes of one transaction affect the update dependency key set of the writes (i.e. key transformers) of the other. Thus, their order of execution matters, and so the resulting state will not be equivalent to some serial execution.

When viewed in this perspective, it is clearer to understand lost update and write skew as special cases of a more general class of anomalies that can arise when there are data dependencies between the write set of a transaction and the update dependency set of another transaction. This provides a more general view of this type of anomaly e.g. these arise when there exists some read-write dependencies between a set of transactions.

For example, we can also consider subtle variations on the standard write-skew example:

\[\begin{aligned} T_1: \quad &f_x(y) = y - 50 \\ \\ T_2: \quad &f_y(x,y) = y + x \end{aligned}\]

This isn’t quite the same as the classical write skew constraint violation example, but it can still cause a serialization anomaly. For example, from a starting state where \((x=100, y=200)\), executing both transactions concurrently (i.e. against the same snapshot) ends us in a state where

\[(x=150, y=300)\]

whereas serial execution gives us either

\[(x=150, y=350)_{T_1 \rightarrow T_2}\] \[(x=250, y=300)_{T_2 \rightarrow T_1}\]

The state transformer model also sheds light on various quirks of default transaction isolation definitions and implementations that appear somewhat confusing or ad hoc when you examine them in more detail. For example, in classic snapshot isolation, write-conflicts force two transactions to conflict and one to abort if they both write to the same key. In theory, this behavior exists to prevent lost update anomalies, as mentioned above. In reality, though, this is a very coarse-grained and conservative way of trying to prevent these anomalies. Really, what we care about is aborting a write that depended on a value read that was written by another transaction. Aborting write conflicts is just one way to prevent the particular “lost update” special case (i.e. when two transactions directly write to the same key), doing nothing to handle write skew and the more general class of update-based anomalies. Moreover, it is done in an overly conservative way e.g. if two transactions do no reads but write to the same key, there is no strict need to abort either of them, though one of them must be aborted under most snapshot isolation implementations.

Similarly, there are related workarounds in other models where a basic notion of “read set”/”write set” conflicts is not precise enough. For example, in A Critique of Snapshot Isolation, they make the simple but clever observation that detecting read-write conflicts (rather than write-write) is sufficient to make snapshot isolation serializable. But, even here, they have to add a particular special case for read-only transactions e.g.

Plainly, since a read-only transaction does not perform any writes, it does not affect the values read by other transactions, and therefore does not affect the concurrent transactions as well. Because the reads in both snapshot isolation and write-snapshot isolation are performed on a fixed snapshot of the database that is determined by the transaction start timestamp, the return value of a read operation is always the same, independent of the real time that the read is executed. Hence, a read-only transaction is not affected by concurrent transactions and intuitively does not have to be aborted….In other words, the read-only transactions are not checked for conflicts and hence never abort.

When viewed in the state transformer model, we can see why read-only transactions are not needed to be checked for conflicts, since none of these reads are used in an update dependency key set. Anomalies of this class only arise when you take into account reads that are used as a part of a true “update”, which most models just don’t explicitly represent.

Another related aspect is that the transformer based view can theoretically assist in finer-grained conflict analysis between transactions. For example, in the write-snapshot isolation approach, any transactions that do writes may be prone to abort due to a read-write conflict (i.e. they read a key that was written by a concurrent transaction). But, if transactions have very large read sets, this can make them very prone to aborts, since their read conflict surface area is large. But, we should really only need to be concerned with those keys that are read and used in some update dependency set in a key transformer of that transaction e.g. it is possible we may read 1000s of keys in a transaction, but only a small number of these keys are used in an update dependency.

The state transformer model helps to see many of these special anomaly cases in a unified way e.g. two write-only transactions that conflict in the transformer model can be understood as both having transformer functions with empty key dependency sets, so no conflict manifests by the conflict rules of this model. Similarly, for read-only transactions i.e. if you do reads but none of these reads are actually used as dependencies of a key transformer function, no conflict needs to be manifested.

Overall, given a set of transactions that may be executing concurrently, we can say that, in our state transformer model, some serialization anomalies (and therefore conflicts) may arise if there exists dependencies between the write set of a transaction and the key transformer dependency set of another transaction. This serves as a way to generalize both lost update and write skew anomalies into a broader, unified class of anomalies. Furthermore, it becomes clear that special cases like “write conflicts” or “lost updates” are not fundamentally related in any way to “transactions writing to the same key”, but rather a case of the general problem of these update dependency relationships.

Merging and Deterministic Scheduling

This state transformer view of transactions also opens up a few interesting questions about whether we can be smarter when thinking about conflicts. That is, if transactions are formally expressed in this state transformer structure, we could consider cases where, instead of aborting transactions that encounter certain type of conflicts (write-write), we may consider semantically merging the effects of their key transformers into a unified operation that reflects the correct, sequential execution of both transformers. This is in essence similar to CRDT-based ideas, but applied in the context of a more classic transaction processing paradigm. Similarly, we might consider related ideas on “re-execution” e.g. we could imagine that, if a conflict is detected from a concurrent transaction writing into your update dependency key set, it may be fine to simply re-compute the result of your update based on the written value, dynamically “correcting” the serialization anomaly.

Similarly, if transactions can be represented in this fashion for most practical systems/isolation levels, this also seems to raise the question of whether we can also apply ideas from deterministic transaction scheduling (i.e. Calvin style), since in theory the read/write sets of a transaction are known upfront. In practice, many transactions may still determine their full read/write sets dynamically (or via predicates), as they execute, but it may still be useful to model transactions within this formalism, even if it doesn’t fully align with the execution model in practice.

Note that the existing world of stored procedure transactions, and one-shot transaction models like those described in Spanner, share similarities with this state transformer view of transactions. It is also similar to some formal reasoning arguments that talk about moving all reads to the beginning of a transaction, and all writes to the commit point, to simplify reasoning. It may be helpful, though, to consider this type of model a more fundamental part of the isolation formalism itself.

Modern Views of Transaction Isolation

Mon, 17 Mar 2025 00:00:00 +0000

There have been many attempts to formalize the zoo of various transaction isolation and consistency concepts over the years. It is not always clear, though, to what extent these attempts have clarified things, especially when each approach has introduced new variations of complexity and formal notation. The rise of distributed storage and database systems and the need to reason about isolation in these contexts has likely worsened the situation.

There are a host of proposed formalisms that all approach the problem from different angles, with different frameworks, notations, etc. (Adya, Cerone, Crooks). They are all quite dense and differ in nontrivial ways, so it is helpful to try to understand some of the common underlying concepts between them. In particular there are two “modern” (post Adya 1999) formalisms of isolation, Cerone 2015 and Crooks 2017, which take a similar “read-centric” view of isolation. Their surface details and formalizations appear quite different, but they share many similarities in their core ideas.

Modern Isolation Formalisms

A unifying concept of essentially any transaction isolation formalism is that an isolation definition can be viewed as a condition over a set of committed transactions. That is, given some set of transactions that were committed by a database system, these transactions either satisfy a given isolation level or not, based on the sequence of read and write operations present in each of these transactions.

Note that a core aspect of any formal isolation definition is about putting conditions on how reads observe database state. If we have a set of transactions that only perform writes, we might have some intuitive correctness notion for how a database should execute these transactions, but we can’t make such a definition formal unless there exist some read operations that may observe the effect of other transaction’s writes. So, we can say that, to a first degree, a transaction isolation definition should be about conditions on the set of values that a transaction can read. The modern models of 2015 model of Cerone et al. and the subsequent Crooks 2017 client-centric model both approach isolation in a somewhat similar, “read-centric” view.

Under this “read-centric” view, when we think about how to define an isolation level, we should first be concerned with how we define what values a transaction can read. If our isolation level makes no restrictions on this, then a transaction can read any value (in practice how you might define a level like read uncommitted). More sensibly, we would expect a transaction to read states that are reasonable, in some sense. More concretely, we should expect transactions to actually read values written by other transactions. This could be a starting definition for isolation (similar to read committed), and one step up in strength from allowing transactions to read any possible value.

There are some other reasonable constraints, though. Basically, we likely expect that the possible states we read from came about through some “reasonable” execution of the transactions we gave to the database. One “reasonable” type of execution would be to execute these transactions in some sequential order. This is, for example, what we would expect out of a database system if we gave it a series of transactions one-by-one, with no concurrent overlapping between transactions (e.g. the classic notion of serializability). The Cerone and Crooks model both allow for a more precise formalization of these ideas.

Cerone 2015

The Cerone paper, A Framework for Transactional Consistency Models with Atomic Visibility, starts with a core simplifying assumption of atomic visibility, which is that either all or none of the operations of a transaction can become visible to other transactions. This means that their model cannot represent isolation levels like Read Committed, which is weaker than Read Atomic, the weakest level their model can express.

Their model encodes the intuitive idea of “read-centric” isolation by first defining a visibility relation between transactions i.e. a way of defining which transactions are visible to other transactions. That is, if a transaction reads a key, what other transaction writes should it observe. It defines this in terms of abstract executions, where an abstract execution consists of a set of committed transactions (called a history \(\mathcal{H}\)) along with two relations over this set:

Visibility (\(\mathsf{VIS} \subseteq \mathcal{H} \times \mathcal{H}\)): acyclic relation where \(T \overset{\mathsf{VIS}}{\rightarrow} S\) means that \(S\) is aware of \(T\).
Arbitration (\(\mathsf{AR} \subseteq \mathcal{H} \times \mathcal{H}\)): total order such that \(\mathsf{AR} \supseteq \mathsf{VIS}\) where \(T \overset{\mathsf{AR}}{\rightarrow} S\) means that the writes of \(S\) supersede those written by \(T\) (essentially only orders write by concurrent transactions).

Basically, \(\mathsf{VIS}\) is a partial ordering of transactions in a history, and \(\mathsf{AR}\) is a total order on transactions that is a superset of \(\mathsf{VIS}\) (i.e. any edge in \(\mathsf{VIS}\) is also by default an edge in \(\mathsf{AR}\)). Note that \(\mathsf{AR}\) is a total order, so every two transactions are comparable by this ordering even if, in some cases (as discussed below), this ordering is not relevant, and could be omitted.

Figure 1: A history \(\mathcal{H}\) of committed transactions with a possible visibility and arbitration relation. Note that \(\mathsf{AR}\) is a total order, so we can visualize this by ordering all transactions in some linear (left-to-right) order.

A consistency model (e.g. isolation level) is then defined as a set of consistency axioms constraining executions, where a consistency model allows histories for which there exists an abstract execution satisfying the axioms. In other words, given a set of transactions that executed against the database, they satisfy a consistency/isolation level if there exists an abstract execution that obeys the axioms of that consistency/isolation level, meaning that there exists a \(\mathsf{VIS}\) and \(\mathsf{AR}\) relation over this set that satisfies the axioms.

The weakest isolation level defined in the Cerone model, Read Atomic, imposes only two conditions: internal and external consistency, which are defined intuitively as:

\(I\small{NT}\) (internal consistency): reads from an object returns the same value as the last write to or read from this object in the transaction.
\(E\small{XT}\) (external consistency): the value returned by an external read in \(T\) is determined by the transactions \(\mathsf{VIS}\)-preceding \(T\) that write to \(x\). If there are no such transactions, then \(T\) reads the initial value 0. Otherwise it reads the final value written by the last such transaction in \(\mathsf{AR}\).

Internal consistency is a bit tedious from a formal perspective and is the less interesting condition, stating essentially that you read your own writes within a transaction. External consistency is the more important condition and depends on the visibility relation, stating that a transaction will read the value written by the latest transaction preceding it in the visibility relation, with conflicts decided by the arbitration relation. Note that if the read and write sets of two transactions are disjoint, then visibility relation is kind of irrelevant for them, so it doesn’t matter whether such an edge is or isn’t included in \(\mathsf{VIS}\).

So, at this weakest defined isolation level, Read Atomic, we can think about a whole batch of committed transactions, and the only restrictions that we are placing on their reads is that they observe the effects of some other transaction(s) in this set, determined by the transaction’s incoming visibility (\(\mathsf{VIS}\)) edges (e.g. as illustrated in Figure 1). If multiple transactions among the incoming visibility edges wrote to conflicting key sets, then the \(\mathsf{AR}\) exists to arbitrate between them, determining which write is observed. Note also that \(\mathsf{AR}\) is a total order, so we can think about (and visualize) it as a global, linear ordering of all transactions, as illustrated by the left-to-right ordering in Figure 1. In some cases this total ordering is not relevant to the semantics of transactions, but we can imagine that it always exists in the background. Also, note that \(\mathsf{AR}\) is a superset of \(\mathsf{VIS}\), so this means you can’t have a visibility edge that goes “backwards” in this arbitration total order.

The underlying model requires that the visibility relation is acyclic, but without any other restrictions there are still some unintuitive semantics allowed at this weakest definition, with causality violations as the notable example. Basically, the visibility relation is not, by default, required to be transitive at Read Atomic, so you can end up with transactions observing the effects of some other transaction that observed the effect of an “earlier” transaction, but you don’t observe the effects of the “earlier” transaction e.g. as shown by example below with 3 transactions (i.e. \(T_3\) observes the effect of \(T_2\) via \(y\), and \(T_2\) observes the effect of \(T_1\) via \(x\), but \(T_3\) does not observe the effect of \(T_1\)’s write to \(x\)).

Moving up the isolation hierarchy from Read Atomic in Cerone’s model, we start strengthening requirements on what the reads in transactions can observe. In their framework, this starts by first adding a transitivity condition on visibility (\(T{\small{RANS}}V{\small{IS}}\)), to get Causal Consistency. This is then extended to Parallel Snapshot Isolation (PSI) and Prefix Consistency, two levels that are not strictly comparable to each other in the hierarchy (see Figure 2).

Figure 2: Consistency models in the Cerone framework.

Note that the \(N{\small{O}}C{\small{ONFLICT}}\) condition enforced at PSI is the first condition that is not related to the values observed by reads. Rather, it places conditions on valid cases of conflicting writes between transactions.

Similarly, there is a notable transition from PSI to Prefix Consistency in this hierarchy, which is related to a switch from a partial to total ordering requirements on the visibility relation. Basically, the \(P\small{REFIX}\) condition requires that if \(T\) observes \(S\), then it also observes all \(\mathsf{AR}\) predecessors of \(S\). In the example below, which illustrates the long fork anomaly of PSI, transactions \(T_3\) and \(T_4\) can be understood as observing the effects of \(T_1\) and \(T_2\) in “different orders” i.e. for \(T_3\) it appears as if \(T_1 \rightarrow T_2\), but for \(T_4\) it observed \(T_2 \rightarrow T_1\).

Case of long fork anomaly allowed under Parallel Snapshot Isolation. Omission of the dotted visibility edge \(\mathsf{VIS}_{\small{PREFIX}}\) enables this anomaly, but its existence is forced under the \(P\small{REFIX}\) condition (e.g. at full snapshot isolation).

Under the \(P\small{REFIX}\) condition, the arbitration ordering of \(\mathsf{AR}\) between \(T_1\) and \(T_2\) comes into play, effectively enforcing a fixed order on how \(T_1\) and \(T_2\) are observed by other transactions. That is, in the above example, if \(T_3\) observes \(T_1\), then by \(P\small{REFIX}\) it must observe its \(AR\) predecessor \(T_2\). Similarly, \(T_4\) is then only required to observe \(T_2\), conforming to the \(T_2 \rightarrow T_1\) ordering enforced by \(AR\). Recall that since \(AR\) is a total order, this condition is basically saying that if you are ever going to observe some transaction out of some transaction set, then you are also forced to observe all transactions in this set in a fixed order, decided by the arbitration ordering. So, this effectively forces visibility to be totally ordered for concurrent transactions.

If we move all the way to serializability, the conditions are strengthened to \(T{\small{OTAL}}V{\small{IS}}\), requiring simply that \(VIS\) is a total order (along with \(I\small{NT}\) and \(E\small{XT}\) conditions). If we look at A Critique of Snapshot Isolation, though, this offers another approach to formalizing serializability. That is, we instead alter snapshot isolation to prevent read-write conflicts instead of write-write conflicts i.e. if a transaction’s read set is written to by a concurrent transaction, then we must abort it. This is an alternative way to formalize serializability that mirrors more closely the \(N{\small{O}}C{\small{ONFLICT}}\) strengthening added for snapshot isolation levels.

Note that the Read Atomic isolation model (the weakest expressed in the Cerone formalism) can be viewed as an interesting “boundary” in isolation strength since, for something weaker like Read Committed, you only need to ensure that reads within a transaction of a key \(k\) read the value written by some other transaction to key \(k\). At such a weak level, there is no restriction on reading from a “consistent” state across keys. So, the weakest interpretation of read committed might be simply that any read can read any value that was written to that key at some point by any transaction in the history. This may not even impose any notion of ordering on transactions, since you really only care about your consistency guarantees at the level of a single key.

The Read Atomic model was first discussed in Bailis’ 2014 paper on RAMP transactions. Note that Read Atomic can be viewed as similar to Snapshot Isolation but with an allowance for concurrent updates (i.e. it does not prevent write-write conflicts). This was also preceded by their earlier proposal of Monotonic Atomic View (MAV) isolation which is strictly weaker than Read Atomic. Essentially, MAV ensures that you always observe all effects of a transaction, but doesn’t require that reads necessarily read from the same, fixed database snapshot (i.e. it is stronger than Read Committed but weaker than Read Atomic).

Crooks 2017

While the Cerone 2015 formalization starts with the visibility and arbitration ordering concepts, the Crooks formalism, presented in Seeing is Believing: A Client-Centric Specification of Database Isolation, takes a different starting point, though there are underlying similarities. Crooks similarly defines isolation over a set of committed transactions, but formalizes its definitions in terms of executions, which are simply a totally ordered sequence of these transactions.

An execution of transactions with associated read states \(s_i\) in Crooks model.

The basic idea of Crooks’ formalism is centered on a state-based or client-centric view of isolation. That is, the values observed by any transactions will be determined based on read states, which are the states that the database passed through as it executed the transactions according to the execution ordering you defined. In a sense, this is similar to the notion of serializability as classically defined i.e. the values observed by each transaction being consistent with some sequential execution ordering that could have occurred.

This is ultimately quite similar to the Cerone view, since the visibility relation (\(\mathsf{VIS}\)) serves a similar purpose i.e. by basically picking out which transactions writes are visible to you. Cerone doesn’t formulate this in terms of “read states” as Crooks does, but essentially the same idea is present i.e. the “read state” in the Cerone model is created by the application of your \(\mathsf{VIS}\)-preceding transactions.

Crooks does also have a technical difference from the Cerone model, in that it allows expression of weaker models like Read Committed, since it does not make the assumption of atomic visibility that Cerone does. It does this by allowing each read operation of a transaction to potentially read from a different read state, allowing for expression of the fractured reads anomaly the Cerone cannot represent.

Figure 3: Execution of transactions with associated read states in Crooks model.

Crooks is also naturally able to represent Read Atomic, though. The formal definition (as shown in Figure 3) is somewhat dense (note that \(sf_o\) represents the first read state for an operation \(o\)), but intuitively it is saying that if an operation \(o\) observes the writes of a transaction \(T_i\), all subsequent operations reading a key in \(T_i\)’s write set must read from a state that include \(T_i\)’s effects.

While Crooks and Cerone models are kind of different on the surface and in their formal details, they can be viewed as quite similar in their core ideas, which are about first establishing what possible values a transaction can read. We can roughly map Cerone’s model to Crooks’ model as well. We can consider the \(\mathsf{AR}\) total order of Cerone as analogous to the “execution order” used in the Crooks model, which is also a total order of transactions. The \(\mathsf{VIS}\) relation of Cerone is then akin to the selection of read states in Crooks’ model. That is, each transaction in the chosen total order picks out some transactions that are visible to it, and reads values accordingly. In Crooks’ model, based on the read state you pick, the transactions visible to you (as in Cerone) would be determined by the transactions preceding that read state.

Interactive Formal Specifications

Thu, 12 Dec 2024 00:00:00 +0000

Formal specifications have become a core part of rigorous distributed systems design and verification, but existing tools have still been lacking in providing good interfaces for interacting with, exploring, visualizing and sharing these specifications and models in a portable and effective manner.

Visualization of dynamic reconfiguration behavior in Raft.

Spectacle aims to address this shortcoming by providing a browser-based tool for exploring and visualizing formal specifications written in the TLA+ specification language. It takes inspiration from past attempts at building similar tools, like Diego Ongaro’s Runway, but it builds on top of TLA+, taking advantage of an existing, well-defined formal specification language, rather than trying to build a new language alongside the tool.

The JavaScript Interpreter

At the core of the tool is a native Javascript interpreter for TLA+. TLC is the primary, existing interpreter and model checker for TLA+ specifications, and it is mature, well-maintained, and has been optimized for performance over many years. It is, however, a somewhat complex and intricate codebase, written in Java, and so it was not a great candidate for integration into a browser-based tool that would allow for dynamic interaction with specifications.

One could build a type of language server into TLC that allows for remote interaction, but this seemed to provide a less than ideal dynamic interaction experience, and would require an external server to be maintained whenever the tool is being used. The now defunct Rise4Fun site from Microsoft Research illustrates the pitfalls of relying on a remote service for running these types of tools.

The development of a Javascript interpreter for TLA+ was enabled by earlier work on building a tree-sitter parser for TLA+, which can be compiled to WebAssembly and run in the browser. Parsing TLA+ itself is a non-trivial task, so the development of this browser-based parser was a big step forward in terms of enabling the development of the interpreter. The interpreter itself is written entirely in vanilla Javascript, and currently consists of around 5000 lines of code. The goal is for the interpreter semantics to conform as closely as possible with TLC semantics, which we try to achieve via a conformance testing approach that compares the results of the Javascript interpreter and TLC on a large corpus of TLA+ specifications.

A benefit of this interpreter implementation is its ability to dynamically evaluate TLA+ specifications and expressions in the browser. For example, the demo below shows the dynamic evaluation of initial states for a single variable declaration (e.g VARIABLE x):

Initial state expression

Initial states generated

Interactive Trace Exploration

A core feature of the tool is the ability to load a TLA+ specification and interactively explore its behaviors. It provides the capability for a user to, from any current state, select an enabled action to transition to a next state, and also allows for back-tracking in the current trace. It also allows for the definition of trace expressions, which allow arbitrary TLA+ expressions to be evaluated at each state of the current trace.

For example, below shows a partial trace of the two-phase commit protocol specification in the tool:

The tool also provides with the ability to easily share traces via static links, which can be reloaded in a new browser window while retaining the generated trace and its existing parameters/settings. This provides a universal, portable way to share system traces, something that was quite awkward with existing tools. For example, here is a link showing two-phase commit driving all the way to commit, and another link showing it driving through to abort. It is also easy to link to system traces/counterexamples that illustrate interesting behaviors and/or edge cases of different protocols e.g. here is a link to a case of a Raft leader being elected, writing a log entry and then committing it across all nodes.

In addition to the trace exploration and expression features, the tool also provides a basic REPL interface, which allows arbitrary expressions to be evaluated in the context of the currently loaded specification. This feature mostly subsumes previous attempts at providing a REPL-like interface for TLA+ specifications.

Visualization

The above features are effective for exploring and understanding a specification, but in some cases it can be helpful to have a more polished and visual way to understand a system and its states/behaviors. Currently, the tool provides a very simple, SVG-based DSL for defining visualizations directly in a TLA+ specification itself, rather than requiring a separate interface/language for defining visualizations.

For example, here is a simple visualization of the final state of the Wolf, Goat, and Cabbage puzzle solution:

and here is a visualization of an abstract Raft specification with an elected leader and some log entries replicated across nodes:

The visualization DSL can currently be defined directly in the TLA+ specification itself, as seen here, and provides a set of basic SVG primitives that can be arranged and positioned in hierarchical groups, following standard SVG conventions. In future, these visualization primitives could be expanded with a variety of richer strucutres (e.g. graphs, lists, charts, constraint-based approaches etc.), but for now even this simple set of primitives allows for a variety of helpful system visualizations.

Conclusion

Overall, the vision is for Spectacle to be complementary to existing TLA+ tooling. For example, it is expected that TLC will remain the primary tool for model checking of non-trivial TLA+ specifications, since it is still the most performant tool for doing so. The goal is for Spectacle to be a tool for prototyping, exploring, and understanding specs, and sharing the results of these explorations in a convenient and portable manner, aspects that few existing tools in the ecosystem excel at.

Decomposing Protocols with Interaction Graphs

Mon, 02 Dec 2024 00:00:00 +0000

Concurrent and distributed protocols can be formally viewed as a set of logical actions, each of which symbolically describe allowed state transitions of the system. We can analyze the structure of a protocol’s actions to understand the interaction between them, and to reason about a protocol’s underlying compositional structure.

One approach to decomposing a protocol into subcomponents is to break up its actions into disjoint subsets, and view each disjoint subset of actions as a separate logical component. This is a useful starting point for decomposition of protocols since actions represent the atomic units of concurrent behavior within a protocol specification. We can also use this basic type of decomposition to define various formal notions of interaction between individual actions or subcomponents of a protocol.

As a simple example, consider the following protocol specification:

\[\small \begin{align*} &\text{VARIABLES } a,b,c \\[0.4em] &{Init } \triangleq \\ & \quad \land \, a = 0 \\ & \quad \land \, b = 0 \\ & \quad \land \, c = 0 \\[0.4em] & IncrementA \triangleq \\ & \quad \land \, b = 0 \\ & \quad \land \, a' = a + b \\ &\quad \land {\text{UNCHANGED }} \langle b,c \rangle \\[0.4em] & IncrementB \triangleq \\ & \quad \land \, b' = b + c \\ & \quad \land \, \text{UNCHANGED } \langle a,c \rangle \\[0.4em] & IncrementC \triangleq \\ &\quad \land \, c < cycle \\ &\quad \land \, c' = (c + 1) \% cycle \\ &\quad \land \, \text{UNCHANGED } \langle a,b \rangle \\[0.4em] &Next \triangleq \\ &\quad \lor IncrementA \\ &\quad \lor IncrementB \\ &\quad \lor IncrementC \\ \\ &Inv \triangleq a \in \{0,1\} \quad \text{(* top-level invariant. *)} \\ & L1 \triangleq b \in \{0,1\} \\ & L2 \triangleq c \in \{0,1\} \end{align*}\]

In this case, we can consider decomposing the protocol into 2 logical sub-components,

\[\begin{align*} M_1 &= \{IncrementA\} & \qquad Vars(M_1)=\{a,b\} \\ M_2 &= \{IncrementB, IncrementC\} & \qquad Vars(M_2)=\{b,c\} \end{align*}\]

with the state variables associated with each component.

In this case, it is clear that the logical interaction between \(M_1\) and \(M_2\) can be defined in terms of their single shared variable, \(b\). Furthermore, this interaction is “uni-directional” in terms of the data flow between components i.e. only \(M_1\) reads from \(b\) and only \(M_2\) writes to \(b\). In this simple case of interaction it is also clear that, for example, verification of \(M_1\) behavior’s should only depend on the behavior of the interaction variable \(b\). The full behavior of \(M_2\) is irrelevant to the behavior of \(M_1\), enabling a natural type of compositional verification.

More generally, if we consider every action of a protocol as its own, fine-grained component, with associated read/write variables, we can check pairwise interactions between all actions of an original protocol to produce an interaction graph, as shown below. This then serves as a starting point for understanding the interaction between protocol actions and the potential boundaries for protocol decomposition.

As another example, we can consider this simplified consensus protocol for selecting a value among a set of nodes via a simple leader election protocol. There are 5 actions of this protocol, related to nodes sending out votes for a leader, a leader processing those votes, getting electing as a leader, and a leader deciding on a value. We can examine this protocol’s interaction graph as follows:

Figure 1. Interaction graph for simple consensus protocol.

Here, we see its interaction graph admits a simple, acyclic structure, with uni-directional dataflow between nearly all actions.

We can see another example of an interaction graph, for the two-phase commit protocol, based on its formal specification here:

Figure 2. Two phase commit protocol interaction graph.

This interaction graph, annotated with the interaction variables along its edges, makes explicit the logical dataflow between actions of the protocol, and also suggests natural action groupings for decomposition. Specifically, into the resource manager (\(RM\)) sub-component and the transaction manager (\(TM\)) sub-component i.e.

\[\small \begin{align*} &RM = \{RMRcvAbortMsg, RMRcvCommitMsg, RMPrepare, RMChooseToAbort\} \\ &TM = \{TMRcvPrepare, TMAbort, TMCommit\} \end{align*}\]

Figure 3. Two phase commit protocol interaction graph from Figure 2 with partitioned components shown.

For example, we can note that the only outgoing dataflow from the \(RM\) set of actions is via the \(msgsPrepared\) variable, which is read via \(TMRcvPrepare\). The only incoming dataflow to the resource manager sub-component is via the \(msgsAbort\) and \(msgsCommit\) variables, which are written to by the transaction manager.

This matches our intuitive notions of the protocol whereby the resource manager and transaction manager behave as logically separate processes, and only interact via the relevant message channels (\(msgsAbort\), \(msgsCommit\), and \(msgsPrepared\)).

Compositional Verification

The decomposition concepts above provide a way to view a protocol in terms of how its fine-grained atomic sub-components interact. We can, in some cases, utilize this structure for a kind of compositional verification when a protocol’s interaction graph is amenable.

Simple Consensus Protocol

For example, we can consider the interaction graph of the simple consensus protocol from above. Its mostly acyclic interaction graph (Figure 1) makes it directly amenable to a simple form of efficient, compositional verification. If we want to verify the core safety property of this protocol, \(NoConflictingValues\), which states that no two nodes decide on distinct values, we can check this with the TLC model checker in a few seconds using a model with 3 nodes (\(Node=\{n1,n2,n3\}\)), generating a reachable state space with 110,464 states.

From the protocol’s interaction graph, however, it is easy to see that the actions \(\{SendRequestVote, SendVote\}\), operate independently from the rest of the protocol, interacting only via writes to the \(vote\_msg\) variable. So, one approach to verifying this protocol is to start by verifying the \(\{SendRequestVote, SendVote\}\) actions independently of the rest of the protocol, and then verify the rest of the protocol against this behavior. More specifically, the overall protocol only depends on the observable behavior of this \(\{SendRequestVote, SendVote\}\) sub-component with respect to the \(vote\_msg\) variable.

For example, if we model check the protocol with the pruned transition relation of

\[\begin{align*} &Next_A \triangleq \\ & \quad \vee SendRequestVote\\ & \quad \vee SendVote \\ \end{align*}\]

we generate 16,128 distinct reachable states, a ~7x reduction from the full state space. Now, since the only “interaction variable” between this \(Next_A\) sub-protocol and the rest of the protocol is the \(vote\_msg\) variable, we could project the state space of \(Next_A\) to the \(vote\_msg\) variable and verify the rest of the protocol against this projected state space.

With an explicit state model checker, we could directly compute this projection by generating and projecting the full state graph, and using this projected state graph as the “environment” under which to verify the rest of the protocol. Alternatively, we can come up with an abstraction of the \(Next_A\) protocol that reflects the external behavior of the interaction variable \(vote\_msg\) adequately.

For example, consider the following abstract model over the single \(vote\_msg\) variable that logically merges the \(SendRequestVote\) and \(SendVote\) actions into one atomic action:

\[\begin{align*} &SendRequestVote\_SendVote(src, dst) \triangleq \\ &\quad \wedge \, \nexists m \in vote\_msg : m[1] = src \\ &\quad \wedge \, vote\_msg' = vote\_msg \cup \{\langle src,dst \rangle\} % &\quad \wedge \, \text{UNCHANGED } \langle vote\_request\_msg, voted, votes, leader, decided \rangle\\ \end{align*}\]

This atomic action adds a new message into \(vote\_msg\) only if no existing node has already put such a message into \(vote\_msg\) (i.e. since nodes can’t vote twice in the original protocol).

We can formally check that this is a valid abstraction of the \(Next_A\) sub-protocol by showing a refinement between them e.g. showing that every behavior of \(Next_A\) is a valid behavior of this abstract spec:

\[(Init \wedge \square [Next_A]_{vars}) \Rightarrow (Init \wedge \square [SendRequestVote\_SendVote]_{vote\_msg})\]

Verifying this refinement is one way of ensuring that the abstract spec preserves the “externally observable” transitions of this sub-component (e.g. with respect to the \(vote\_msg\) variable).

Due to the acyclic nature of this protocol’s interaction graph, we could continue applying this compositional rule to further accelerate verification, but even with this initial reduction, we can see significant improvement. That is, now that we have an abstraction of the \(\{SendRequestVote, SendVote\}\) sub-protocol that preserves its interactions with the rest of the protocol, we can try verifying the rest of the protocol against this abstraction e.g.

\[\begin{align*} & Next_B \triangleq \\ &\quad \vee \wedge \exists i,j \in Node : SendRequestVote\_SendVote(i,j) \\ &\quad \quad \wedge \text{UNCHANGED } \langle vote\_request\_msg,voted,votes,leader,decided \rangle \\ &\quad \vee \exists i,j \in Node : RecvVote(i,j) \\ &\quad \vee \exists i \in Node, Q \in Quorum : BecomeLeader(i,Q) \\ &\quad \vee \exists i \in Node, v \in Value : Decide(i,v) \end{align*}\]

Model checking the above protocol (\(Next_B\)) with TLC, produces 514 distinct reachable states, a > 200x reduction from the original state space.

So, in this case, with only a simple dataflow/interaction analysis, we were able to reduce the largest model checking problem by a factor of ~10x e.g. in this case model checking of the \(Next_A\) sub-protocol was the most expensive verification sub-problem.

Conclusions

The ideas and techniques discussed above are similar to various types of compositional verification techniques that have been applied in various contexts. Similar ideas are utilized in the “interaction preserving abstraction” techniques in this paper, and also in the work on recomposition, which builds similar techniques within the TLC model checker. The concept of using dataflow to analyze distributed and concurrent protocols has also appeared in various works in the past (e.g. distributed data flow), and also more recent work on using a Datalog like variant to automatically optimize distributed protocols using pre-defined rewrite rules.

Note that the code used to model the protocols above and generate their associated interaction graphs can be found here.

William Schultz

Canonicalized Distributed Protocol Specs

History Queries

Query Incrementalization

Related Work

Verified Transpilation with Claude

Transpiling with Claude

Benchmarks

AbstractDynamicRaft

Bakery

Final Thoughts

Git for Transactions

Branch and Merge Transactions

Begin and Commit

Merging

Concluding Thoughts

On Writing, Specification, and Outputs

Logless Raft

Simplifying Log Management

Log Merging vs. Log Replication

A Closer Look at Raft Log Structure

Going Logless

Related Work

Simple Serializable Snapshot Isolation

Snapshot Isolation

Serializability

Making Snapshot Isolation Serializable

Performance

Concluding Thoughts

Transactions as Transformers

State Transformer Model

Isolation Anomalies and Lost Update

Write Skew and a Generalized View of Anomalies

Merging and Deterministic Scheduling

Modern Views of Transaction Isolation

Modern Isolation Formalisms

Cerone 2015

Crooks 2017

Interactive Formal Specifications

The JavaScript Interpreter

Interactive Trace Exploration

Visualization

Conclusion

Decomposing Protocols with Interaction Graphs

Compositional Verification

Simple Consensus Protocol

Conclusions