reublogNotes on distributed systems, .NET, and software.https://reubenbond.github.ioDeploying Wyam To GitHub Using Visual Studio Onlinehttps://reubenbond.github.io/posts/-setting-up-wyamhttps://reubenbond.github.io/posts/-setting-up-wyamHere goes nothing! This blog is built with Dave Glick's Wyam static site generator and deployed from a git repo in Visual Studio Online to GitHub Pages. Here's how to set up something similar.Tue, 03 Oct 2017 00:00:00 GMT<p>Here goes nothing! This blog is built with <a href="https://twitter.com/daveaglick">Dave Glick's</a> <a href="https://wyam.io/">Wyam</a> static site generator and deployed from a git repo in Visual Studio Online to GitHub Pages. Here's how to set up something similar.</p> <h1>Prerequisites</h1> <ul> <li>A Visual Studio Online repository for your blog source. <ul> <li>You could have also VSO pull the source from GitHub or somewhere else instead, but I haven't covered that here.</li> </ul> </li> <li>A GitHub repository which will serve the compiled output via GitHub Pages. <ul> <li>I created a repository called <a href="https://github.com/ReubenBond/reubenbond.github.io"><code>reubenbond.github.io</code></a> under my profile, <a href="https://github.com/ReubenBond/"><code>ReubenBond</code></a>.</li> </ul> </li> <li>Cake so you can test it out locally. Install it via <a href="https://chocolatey.org/">Chocolatey</a>: <code>choco install cake.portable</code></li> </ul> <h1>Kick-starting Wyam with Cake</h1> <p>Create a file called <code>build.cake</code> in the root of your repo with these contents:</p> <pre><code>#tool nuget:?package=Wyam #addin nuget:?package=Cake.Wyam var target = Argument("target", "Default"); Task("Build") .Does(() =&gt; { Wyam(new WyamSettings { Recipe = "Blog", Theme = "CleanBlog", UpdatePackages = true }); }); Task("Preview") .Does(() =&gt; { Wyam(new WyamSettings { Recipe = "Blog", Theme = "CleanBlog", UpdatePackages = true, Preview = true, Watch = true }); }); Task("Default") .IsDependentOn("Build"); RunTarget(target); </code></pre> <p>Add a file called <code>config.wyam</code> like so:</p> <pre><code>#recipe Blog #theme CleanBlog Settings[Keys.Host] = "yourname.github.io"; Settings[BlogKeys.Title] = "MegaBlog"; Settings[BlogKeys.Description] = "Blog of the Gods"; </code></pre> <p>Create a folder called <code>input</code> and add a folder called <code>posts</code> inside that. Now create <code>input/posts/fist-post.md</code>:</p> <pre><code>Title: Fist Post! A song of fice and ire Published: 10/30/2017 Tags: ['Fists'] --- This post is about fists and how clumpy they always are. </code></pre> <p>Great! Try running it using Cake. Because Wyam targets an older version of Cake at the time of writing, I'm adding the <code>--settings_skipverification=true</code> option so that Cake doesn't complain.</p> <pre><code>cake --settings_skipverification=true -target=Preview </code></pre> <p>Open a browser to http://localhost:5080 and see the results. The <code>Preview</code> target watches for file changes so it can automatically recompile &amp; refresh your browser whenever you save changes.</p> <h1>Automating Deployment</h1> <ol> <li>Install the <a href="https://marketplace.visualstudio.com/items?itemName=cake-build.cake">Cake build task from the Visual Studio Marketplace</a> into VSO.</li> <li>In Visual Studio Online, create a new, empty build for your repo, selecting an appropriate build agent.</li> <li>Add the Cake Build task.</li> <li>Select the <code>build.cake</code> file from the root of your repo as the <em>Cake Script</em>.</li> <li>Set the <em>Target</em> to <code>Default</code>.</li> <li>Optionally add the <code>--settings_skipverification=true</code> option to <em>Cake Arguments</em>.</li> <li>Add a new <em>PowerShell Script</em> build task, set <em>Type</em> to <code>Inline Script</code> and add these contents:</li> </ol> <pre><code>param ( [string]$Token, [string]$UserName, [string]$Repository ) $localFolder = "gh-pages" $repo = "https://$($UserName):$($Token)@github.com/$($Repository).git" git clone $repo --branch=master $localFolder Copy-Item "output\*" $localFolder -recurse Set-Location $localFolder git add * git commit -m "Update." git push </code></pre> <ol> <li>Create a new GitHub Personal Access token from GitHub's Developer Settings page, or by <a href="https://github.com/settings/tokens/new">clicking here</a>. I added all of the <code>repo</code> permissions to the token.</li> <li>In VSO, add arguments for the script, replacing <code>TOKEN</code> with your token and replacing the other values as appropriate:</li> </ol> <pre><code>-Token TOKEN -UserName "ReubenBond" -Repository "ReubenBond/reubenbond.github.io" </code></pre> <ol> <li>Up on the <em>Triggers</em> pane, enable Continuous Integration.</li> <li>Click <em>Save &amp; queue</em>, then cross your fingers.</li> </ol> <p>Hopefully that's it and you can now add new blog posts to the <code>input/posts</code> directory.</p> Reuben BondCode Generation on .NEThttps://reubenbond.github.io/posts/codegen-1https://reubenbond.github.io/posts/codegen-1A brief overview of code generation APIs in .NETWed, 01 Nov 2017 00:00:00 GMT<p><em>This is the first part in what's hopefully a series of short posts covering code generation on the .NET platform.</em></p> <p>Almost every .NET application relies on code generation in some form, usually because they rely on a library which generates code as a part of how it functions. Eg, Json.NET <a href="https://github.com/JamesNK/Newtonsoft.Json/blob/473a7721bd67cca8fef1ecc37da1951a1c180022/Src/Newtonsoft.Json/Utilities/DynamicReflectionDelegateFactory.cs">leverages code generation</a> and so does <a href="https://github.com/aspnet/MvcPrecompilation">ASP.NET</a>, Entity Framework, <a href="https://github.com/dotnet/orleans">Orleans</a>, most serialization libraries, many dependency injection libraries, and probably every test mocking library.</p> <p>Let's skip past <em>why</em> code generation is useful and jump straight into a high level overview of code generation technologies for .NET.</p> <h2>Kinds of Code Generation</h2> <p>The 3 code gen methods for .NET which we'll discuss are: <strong>Expression Trees</strong>, <strong>IL Generation</strong>, and <strong>Syntax Generation</strong>. There are other methods, such as text templating (eg using T4). Here are the pros and cons of each as I see them.</p> <h3>Expression Trees</h3> <p>Using <strong>LINQ Expression Trees</strong> to compile expressions at runtime.</p> <p>:::tip Easy to use, expressive, and often the most approachable place to start when you need runtime code generation. :::</p> <p>:::caution Expression trees are interpreted on AOT-only platforms like iOS, and some language constructs simply are not available. :::</p> <h3>IL Generation</h3> <p>Using <strong>Reflection.Emit</strong> to dynamically create types and methods using Common Intermediate Langage (known as CIL or just IL), which is the assembly language of the CLR.</p> <p>:::tip IL generation can produce code which cannot be expressed in C#, such as direct access to private members. :::</p> <p>:::warning The trade-off is ergonomics: IL is verbose, awkward to debug, difficult to use for higher-level features like <code>async</code>/<code>await</code>, and unavailable on AOT-only platforms. :::</p> <h3>Syntax Generation</h3> <p>Using <strong>Roslyn</strong> or some other API to generate C# syntax trees or source code and compile it either at runtime or when the target project is built.</p> <p>:::tip Syntax generation gives you direct access to the full C# language and works well on AOT-only platforms because the output is plain source code. :::</p> <p>:::note The API can feel indirect because it was designed for parsing and compilation first, not authoring, and runtime scenarios mean shipping Roslyn with your app. :::</p> <h2>Orleans</h2> <p><a href="https://github.com/dotnet/orleans">Microsoft Orleans</a> uses the latter two approaches: IL and Roslyn. It uses Roslyn wherever possible, since it allows for easy access to C# language features like <code>async</code> and since it's easy to comprehend both the code generator and the generated code. Otherwise, IL generation is used for two things:</p> <ol> <li>Generating code at runtime. For example <a href="https://github.com/dotnet/orleans/blob/375a98191ca40c27ca8ed61199a6a77a7995e75e/src/Orleans.Core/Serialization/ILSerializerGenerator.cs"><code>ILSerializerGenerator</code></a> generates serializers as a last resort for types which C# serializers couldn't be generated for (for example, private inner classes). It's a faster and less restricted alternative to .NET's <a href="https://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter(v=vs.110).aspx"><code>BinaryFormatter</code></a>.</li> <li>Producing code which cannot be expressed in C#. For example, <a href="https://github.com/dotnet/orleans/blob/375a98191ca40c27ca8ed61199a6a77a7995e75e/src/Orleans.Core.Abstractions/Serialization/FieldUtils.cs#"><code>FieldUtils</code></a> provides access to private fields and methods for serialization.</li> </ol> <h2>General Strategy</h2> <p>Regardless of which technology a library makes use of, code generation typically involves two phases:</p> <ol> <li>Metadata Collection <ul> <li>The code generator takes some input and creates an abstract representation of it in order to drive the code synthesis process.</li> <li>Eg, a library for deeply cloning objects might take a <code>Type</code> as input and generate an object describing each field in that type.</li> </ul> </li> <li>Code Synthesis <ul> <li>The code generator uses the metadata model to drive the process of actually generating code (LINQ expressions, IL instructions, syntax tree nodes).</li> <li>Eg, our deep cloning library will generate a method which takes an object of the specified type from the metadata model and then recursively copy each of the fields.</li> </ul> </li> </ol> <p>The two phases can be merged for simple code generators. Orleans uses two phases. In phase 1, the input assembly is scanned and metadata is collected for types matching various criteria: Grain classes, Grain interfaces, serializable types, and custom serializer registrations. In phase 2, support classes are generated. For example, each grain interface has two classes generated: an RPC proxy and an RPC stub.</p> <h2>Conclusion</h2> <p>That's enough for now. Maybe next time we'll take a look at writing that hypothetical deep cloning library using IL generation. After that, we can take a look at a serialization library I've been working on which uses Roslyn for both metadata collection and syntax generation. If either of those things are interesting to you, let me know here or on <a href="https://twitter.com/reubenbond">Twitter</a>.</p> <p>:::important If IL generation is the piece you want to see in practice, the next post walks through a deep-copy implementation step by step. :::</p> <p><a href="/posts/codegen-2-il-boogaloo"><strong>Next Post: .NET IL Generation - Writing DeepCopy</strong></a></p> Reuben Bond.NET IL Generation - Writing DeepCopyhttps://reubenbond.github.io/posts/codegen-2-il-boogaloohttps://reubenbond.github.io/posts/codegen-2-il-boogalooImplementing a powerful object cloning library using IL generation.Sat, 04 Nov 2017 00:00:00 GMT<p><em>This is the second part in a series of short posts covering code generation on the .NET platform.</em></p> <h3>IL Generation</h3> <p><a href="/posts/codegen-1">Last time</a>, we skimmed over some methods to generate code on .NET and one of them was emitting IL. IL generation lets us circumvent the rules C# and other languages put in place to protect us from our own stupidity. Without those rules, we can implement all kinds of fancy foot guns. Rules like “don't access private members of foreign types” and “don't modify <code>readonly</code> fields”. That last one is interesting: C#'s <code>readonly</code> translates into <code>initonly</code> on the IL/metadata level so theoretically we shouldn't be able to modify those fields even using IL. As a matter of fact we can, but it comes at a cost: <strong>our IL will no longer be verifiable</strong>. That means that certain tools will bark at you if you try to write IL code which commits this sin, tools such as <a href="https://docs.microsoft.com/en-us/dotnet/framework/tools/peverify-exe-peverify-tool">PEVerify</a> and <a href="https://github.com/dotnet/corert/tree/master/src/ILVerify">ILVerify</a>. Verifiable code also has ramifications for <a href="https://docs.microsoft.com/en-us/dotnet/framework/misc/security-transparent-code">Security-Transparent Code</a>. Thankfully for us, Code Access Security and Security Transparent Code <a href="https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/porting.md#code-access-security-cas">don't exist in .NET Core</a> and they usually don't cause issue for .NET Framework.</p> <p>Enough stalling, onto our mission briefing.</p> <h3>DeepCopy</h3> <p>Today we're going to implement the guts of a library for creating deep copies of objects. Essentially it provides one method:</p> <pre><code>public static T Copy&lt;T&gt;(T original); </code></pre> <p>Our library will be called <em>DeepCopy</em> and the source is up on GitHub at <a href="https://github.com/ReubenBond/DeepCopy">ReubenBond/DeepCopy</a> feel free to mess about with it. The majority of the code was adapted from the <a href="https://github.com/dotnet/orleans">Orleans</a> codebase.</p> <p>Deep copying is important for frameworks such as <a href="https://github.com/dotnet/orleans">Orleans</a>, since it allows us to safely send mutable objects between grains on the same node without having to first serialize &amp; then deserialze them, among other things. Of course, immutable objects (such as strings) are shared without copying. Oddly enough, serializing then deserializing an object is the <a href="https://stackoverflow.com/a/78612/635314">accepted Stack Overflow answer</a> to the question of “how can I deep copy an object?”.</p> <p>Let's see if we can fix that.</p> <h3>Battle Plan</h3> <p>The <code>Copy</code> method will recursively copy every field in the input object into a new instance of the same type. It must be able to deal with multiple references to the same object, so that if the user provides an object which contains a reference to itself then the result will also contain a reference to itself. That means we'll need to perform reference tracking. That's easy to do: we maintain a <code>Dictionary&lt;object, object&gt;</code> which maps from original object to copy object. Our main <code>Copy&lt;T&gt;(T orig)</code> method will call into a helper method with that dictionary as a parameter:</p> <pre><code>public static T Copy&lt;T&gt;(T original, CopyContext context) { /* TODO: implementation */ } </code></pre> <p>The copy routine is roughly as follows:</p> <ul> <li>If the input is <code>null</code>, return <code>null</code>.</li> <li>If the input has already been copied (or is currently being copied), return its copy.</li> <li>If the input is 'immutable', return the input.</li> <li>If the input is an array, copy each element into a new array and return it.</li> <li>Create a new instance of the input type and recursively copy each field from the input to the output and return it.</li> </ul> <p>Our definition of immutable is simple: the type is either a primitive or it's marked using a special <code>[Immutable]</code> attribute. More elaborate immutability could be probably be soundly implemented, so <a href="https://github.com/ReubenBond/DeepCopy/pull/new/master">submit a PR</a> if you've improved upon it.</p> <p>Everything but the last step in our routine is simple enough to do without generating code. The last step, recursively copying each field, can be performed using reflection to get and set field values. Reflection is a real performance killer on the hot path, though, and so we're going to go our own route using IL.</p> <h3>Diving Into The Code</h3> <p>The main IL generation in <em>DeepCopy</em> occurs inside <a href="https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/CopierGenerator.cs#L52"><code>CopierGenerator.cs</code></a> in the <code>CreateCopier&lt;T&gt;(Type type)</code> method. Let's walk through it:</p> <p>First we create a new <code>DynamicMethod</code> which will hold the IL code we emit. We have to tell <code>DynamicMethod</code> what the signature of the type we're creating is. In our case, it's a generic delegate type, <code>delegate T DeepCopyDelegate&lt;T&gt;(T original, CopyContext context)</code>. Then we get the <code>ILGenerator</code> for the method so that we can begin emitting IL code to it.</p> <pre><code>var dynamicMethod = new DynamicMethod( type.Name + "DeepCopier", typeof(T), // The return type of the delegate new[] {typeof(T), typeof(CopyContext)}, // The parameter types of the delegate. typeof(CopierGenerator).Module, true); var il = dynamicMethod.GetILGenerator(); </code></pre> <p>The IL is going to be rather complicated because it needs to deal with immutable types and value types, but let's walk through it bit-by-bit.</p> <pre><code>// Declare a variable to store the result. il.DeclareLocal(type); </code></pre> <p>Next we need to initialize our new local variable to a new instance of the input type. There are 3 cases to consider, each corresponding to a block in the following code:</p> <ul> <li>The type is a value type (struct). Initialize it by essentially using a <code>default(T)</code> expression.</li> <li>The type has a parameterless constructor. Initialize it by calling <code>new T()</code>.</li> <li>The type does not have a parameterless constructor. In this case we ask the framework for help and we call <code>FormatterServices.GetUninitializedObject(type)</code>.</li> </ul> <pre><code>// Construct the result. var constructorInfo = type.GetConstructor(Type.EmptyTypes); if (type.IsValueType) { // Value types can be initialized directly. // C#: result = default(T); il.Emit(OpCodes.Ldloca_S, (byte)0); il.Emit(OpCodes.Initobj, type); } else if (constructorInfo != null) { // If a default constructor exists, use that. // C#: result = new T(); il.Emit(OpCodes.Newobj, constructorInfo); il.Emit(OpCodes.Stloc_0); } else { // If no default constructor exists, create an instance using GetUninitializedObject // C#: result = (T)FormatterServices.GetUninitializedObject(type); var field = this.fieldBuilder.GetOrCreateStaticField(type); il.Emit(OpCodes.Ldsfld, field); il.Emit(OpCodes.Call, this.methodInfos.GetUninitializedObject); il.Emit(OpCodes.Castclass, type); il.Emit(OpCodes.Stloc_0); } </code></pre> <h3>Interlude - What IL Should We Emit?</h3> <p>Even if you're not a first-timer with IL, it's not always easy to work out what IL you need to emit to achieve the desired result. This is where tools come in to help you. Personally I typically write my code in C# first, slap it into <a href="https://www.linqpad.net/">LINQPad</a>, hit run and open the IL tab in the output. It's great for experimenting.</p> <p><img src="/images/linqpad-il.png" alt="LINQPad is seriously handy!" title="LINQPad makes quick experiments with generated IL easy to inspect" /></p> <p>Another option is to use a decompiler/disassembler like <a href="https://www.jetbrains.com/decompiler/">JetBrains' dotPeek</a>. You would compile your assembly and open it in dotPeek to reveal the IL.</p> <p>Finally, if you're like me, then <a href="https://www.jetbrains.com/resharper/">ReSharper</a> is indispensible. It's like coding on rails (train tracks, not Ruby). ReSharper comes with a convenient <a href="https://www.jetbrains.com/help/resharper/Viewing_Intermediate_Language.html">IL Viewer</a>.</p> <p><img src="/images/resharper-il.png" alt="ReSharper IL Viewer" title="ReSharper helps inspect the IL produced by a compiled assembly" /></p> <p>Alright, so that's how you work out what IL to generate. You'll occasionally want to <a href="https://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes(v=vs.110).aspx">visit the docs</a>, too.</p> <h3>Back To Emit</h3> <p>Now we have a new instance of the input type stored in our local result variable. Before we do anything else, we must record the newly created reference. We push each argument onto the stack in order and use the non-virtual <code>Call</code> op-code to invoke <code>context.RecordObject(original, result)</code>. We can use the non-virtual <code>Call</code> op-code to call <code>CopyContext.RecordObject</code> because <code>CopyContext</code> is a <code>sealed</code> class. If it wasn't, we would use <code>Callvirt</code> instead.</p> <pre><code>// An instance of a value types can never appear multiple times in an object graph, // so only record reference types in the context. if (!type.IsValueType) { // Record the object. // C#: context.RecordObject(original, result); il.Emit(OpCodes.Ldarg_1); // context il.Emit(OpCodes.Ldarg_0); // original il.Emit(OpCodes.Ldloc_0); // result, i.e, the copy of original il.Emit(OpCodes.Call, this.methodInfos.RecordObject); } </code></pre> <p>On to the meat of our generator! With the accounting out of the way, we can enumerate over each field and generate code to copy each one into our <code>result</code> variable. The comments narrate the process:</p> <pre><code>// Copy each field. foreach (var field in this.copyPolicy.GetCopyableFields(type)) { // Load a reference to the result. if (type.IsValueType) { // Value types need to be loaded by address rather than copied onto the stack. il.Emit(OpCodes.Ldloca_S, (byte)0); } else { il.Emit(OpCodes.Ldloc_0); } // Load the field from the result. il.Emit(OpCodes.Ldarg_0); il.Emit(OpCodes.Ldfld, field); // Deep-copy the field if needed, otherwise just leave it as-is. if (!this.copyPolicy.IsShallowCopyable(field.FieldType)) { // Copy the field using the generic Copy&lt;T&gt; method. // C#: Copy&lt;T&gt;(field) il.Emit(OpCodes.Ldarg_1); il.Emit(OpCodes.Call, this.methodInfos.CopyInner.MakeGenericMethod(field.FieldType)); } // Store the copy of the field on the result. il.Emit(OpCodes.Stfld, field); } </code></pre> <p>Return the result and build our delegate using <code>CreateDelegate</code> so that we can start using it immediately.</p> <pre><code>// C#: return result; il.Emit(OpCodes.Ldloc_0); il.Emit(OpCodes.Ret); return dynamicMethod.CreateDelegate(typeof(DeepCopyDelegate&lt;T&gt;)) as DeepCopyDelegate&lt;T&gt;; </code></pre> <p>That's the guts of the library. Of course many details were left out, such as:</p> <ul> <li>Caching <code>Type</code> values in static fields so that we can reference them from our generated code. See <a href="https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/StaticFieldBuilder.cs#L64"><code>StaticFieldBuilder.cs</code></a>.</li> <li>The special handling of arrays in <a href="https://github.com/ReubenBond/DeepCopy/blob/1b00515b6b6aece93b4bea61bf40780265c2e349/src/DeepCopy/DeepCopier.cs#L69"><code>DeepCopier.cs</code></a>.</li> <li>Optimizations such as using <a href="https://github.com/ReubenBond/DeepCopy/blob/master/src/DeepCopy/CachedReadConcurrentDictionary.cs"><code>CachedReadConcurrentDictionary&lt;TKey, TValue&gt;</code></a> for a slight improvement over <code>ConcurrentDictionary&lt;TKey, TValue&gt;</code> for workloads with a diminishing write volume.</li> </ul> Reuben BondPerformance Tuning for .NET Corehttps://reubenbond.github.io/posts/dotnet-perf-tuninghttps://reubenbond.github.io/posts/dotnet-perf-tuningSome of you may know I've been spending whatever time I can scrounge together grinding away at a new serialization library for .NET. Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach. I won't convince you that serialization libraries have to be quick — in this post, that's a given. These are some tips from my experience in optimizing Hagar's performance. Most of this advice is applicable to other types of libraries or applications.Tue, 15 Jan 2019 00:00:00 GMT<p>Some of you may know I've been spending whatever time I can scrounge together grinding away at a new serialization library for .NET. Serializers can be complicated beasts. They have to be reliable, flexible, and fast beyond reproach. I won't convince you that serialization libraries have to be quick — in this post, that's a given. These are some tips from my experience in optimizing <a href="https://github.com/ReubenBond/Hagar">Hagar</a>'s performance. <strong>Most of this advice is applicable to other types of libraries or applications.</strong></p> <p>A post on performance should have minimal overhead and get straight to the point, so this post focuses on tips to help you and things to look out for. <a href="https://twitter.com/reubenbond">Message me on Twitter</a> if something is unclear or you have something to add.</p> <p>:::note This post is intentionally heuristic-heavy: each tip is aimed at hot paths where nanoseconds and allocations add up under load. :::</p> <h2>Maximize profitable inlining</h2> <p>Inlining is the technique where a method body is copied to the call site so that we can avoid the cost of jumping, argument passing, and register saving/restoring. In addition to saving those costs, inlining is a requirement for other optimizations. Roslyn (C#'s compiler) does not inline code. Instead, it is the responsibility of the JIT, as are most optimizations.</p> <p>:::tip Inlining is rarely about saving the call itself. The bigger win is that once a method is inlined, the JIT can see more of the surrounding code and unlock other optimizations. :::</p> <h3>Use static <em>throw helpers</em></h3> <p>A recent change which involved a significant refactor added around 20ns to the call duration for the serialization benchmark, increasing times from ~130ns to ~150ns (which is significant).</p> <p>The culprit was the <code>throw</code> statement added in this helper method:</p> <pre><code>public static Writer&lt;TBufferWriter&gt; CreateWriter&lt;TBufferWriter&gt;( this TBufferWriter buffer, SerializerSession session) where TBufferWriter : IBufferWriter&lt;byte&gt; { if (session == null) throw new ArgumentNullException(nameof(session)); return new Writer&lt;TBufferWriter&gt;(buffer, session); } </code></pre> <p>When a method contains a <code>throw</code> statement, the JIT will not inline it. The common trick to solve this is to add a static "throw helper" method to do the dirty work for you, so the end result looks like this:</p> <pre><code>public static Writer&lt;TBufferWriter&gt; CreateWriter&lt;TBufferWriter&gt;( this TBufferWriter buffer, SerializerSession session) where TBufferWriter : IBufferWriter&lt;byte&gt; { if (session == null) ThrowSessionNull(); return new Writer&lt;TBufferWriter&gt;(buffer, session); void ThrowSessionNull() =&gt; throw new ArgumentNullException(nameof(session)); } </code></pre> <p>Crisis averted. The codebase uses this trick in many places. Having the <code>throw</code> statement in a separate method may have other benefits such as improving the locality of your commonly used code paths, but I'm unsure and haven't measured the impact.</p> <h3>Minimize virtual/interface calls</h3> <p>Virtual calls are slower than direct calls. If you're writing a performance critical system then there's a good chance you'll see virtual call overhead show up in the profiler. For one, virtual calls require indirection.</p> <p>Devirtualization is a feature of many JIT Compilers, and RyuJIT is no exception. It's a complicated feature, though, and there are not many cases where RyuJIT can currently <em>prove</em> (to itself) that a method can be devirtualized and therefore become a candidate for inlining. Here are a couple of general tips for taking advantage of devirtualization, but I'm sure there are more (so let me know if you have any).</p> <ul> <li>Mark classes as <code>sealed</code> by default. When a class/method is marked as <code>sealed</code>, RyuJIT can take that into account and is likely able to inline a method call.</li> <li>Mark <code>override</code> methods as <code>sealed</code> if possible.</li> <li>Use concrete types instead of interfaces. Concrete types give the JIT more information, so it has a better chance of being able to inline your call.</li> <li>Instantiate and use non-sealed objects in the same method (rather than having a 'create' method). RyuJIT can devirtualize non-sealed method calls when the type is definitely known, such as immediately after construction.</li> <li>Use generic type constraints for polymorphic types so that they can be specialized using a concrete type and interface calls can be devirtualized. In Hagar, our core writer type is defined as follows:</li> </ul> <pre><code>public ref struct Writer&lt;TBufferWriter&gt; where TBufferWriter : IBufferWriter&lt;byte&gt; { private TBufferWriter output; // --- etc --- </code></pre> <p>All calls to methods on <code>output</code> in the CIL which Roslyn emits will be preceded by a <code>constrained</code> instruction which tells the JIT that instead of making a virtual/interface call, the call can be made to the precise method defined on <code>TBufferWriter</code>. This helps with devirtualization. All calls to methods defined on <code>output</code> are successfully devirtualized as a result. Here's <a href="https://github.com/dotnet/coreclr/issues/9908">a CoreCLR thread by Andy Ayers</a> on the JIT team which details current and future work for devirtualization.</p> <h2>Reduce allocations</h2> <p>.NET's garbage collector is a remarkable piece of engineering. GC allows for algorithmic optimizations for some lock-free data structures and also removes whole classes of bugs and lightens the developer's cognitive load. All things considered, garbage collection is a <em>tremendously</em> successful technique for memory management.</p> <p>However, while the GC is a powerful work horse, it helps to lighten its load not only because it means your application will pause for collection less often (and more generally, less CPU time will be devoted to GC work), but also because lightening working set is beneficial for cache locality.</p> <p>The rule-of-thumb for allocations is that they should either die in the first generation (Gen0) or live forever in the last (Gen2).</p> <p>:::important A useful rule of thumb is that allocations should either die young in Gen0 or live long enough to justify promotion. The awkward middle is where GC overhead tends to hurt. :::</p> <p>.NET uses a bump allocator where each thread allocates objects from its per-thread context by 'bumping' a pointer. For this reason, better cache locality can be achieved for short-lived allocations when they are allocated and used on the same thread.</p> <p>For more info on .NET's GC, see <a href="https://twitter.com/matthewwarren">Matt Warren</a>'s blog post series, <a href="http://mattwarren.org/2016/02/04/learning-how-garbage-collectors-work-part-1/"><em>Learning How Garbage Collectors Work</em></a> here and pre-order <a href="https://twitter.com/konradkokosa">Konrad Kokosa</a>'s book, <a href="https://prodotnetmemory.com/"><em>Pro .NET Memory Management</em> here</a>. Also check out his fantastic free <a href="https://prodotnetmemory.com/data/netmemoryposter.pdf">.NET memory management poster here</a>, it's a great reference.</p> <h3>Pool buffers/objects</h3> <p>Hagar itself doesn't manage buffers but instead defers the responsibility to the user. This might sound onerous but it's not, since it's compatible with <a href="https://blogs.msdn.microsoft.com/dotnet/2018/07/09/system-io-pipelines-high-performance-io-in-net/"><code>System.IO.Pipelines</code></a>. Therefore, we can take advantage of the high performance buffer pooling which the default <code>Pipe</code> provides by means of <code>System.Buffers.ArrayPool&lt;T&gt;</code>.</p> <p>Generally speaking, reusing buffers lets you put much less pressure on the GC - your users will be thankful. Don't write your own buffer pool, unless you truly need to, though - those times have passed.</p> <p>:::caution Reach for <code>ArrayPool&lt;T&gt;</code> or <code>System.IO.Pipelines</code> before building your own pool. Custom pooling code is easy to get subtly wrong and hard to benchmark honestly. :::</p> <h3>Avoid boxing</h3> <p>Wherever possible, do not box value types by casting them to a reference type. This is common advice, but it requires some consideration in your API design. In Hagar, interface and method definitions which might accept value types are made generic so that they can be specialized to the precise type and avoid boxing/unboxing costs. As a result, there is no hot-path boxing. Boxing is still present in some cases, such as string formatting for exception methods. Those particular boxing allocations can be removed by explicit <code>.ToString()</code> calls on the arguments.</p> <p>:::warning Boxing on a hot path is easy to miss because the code still looks clean. Generic APIs often pay for themselves here by letting the JIT specialize away the allocation. :::</p> <h3>Reduce closure allocations</h3> <p>Allocate closures only once and store the result for repeated use. For example, it's common to pass a delegate to <code>ConcurrentDictionary&lt;K, V&gt;.GetOrAdd</code>. Instead of writing the delegate as an inline lambda, define is as a private field on the class. Here an example from the optional <code>ISerializable</code> support package in Hagar:</p> <pre><code>private readonly Func&lt;Type, Action&lt;object, SerializationInfo, StreamingContext&gt;&gt; createConstructorDelegate; public ObjectSerializer(SerializationConstructorFactory constructorFactory) { // Other parameters/statements omitted. this.createConstructorDelegate = constructorFactory.GetSerializationConstructorDelegate; } // Later, on a hot code path: var constructor = this.constructors.GetOrAdd(info.ObjectType, this.createConstructorDelegate); </code></pre> <h2>Minimize copying</h2> <p>.NET Core 2.0 and 2.1 and recent C# versions have made considerable strides in allowing library developers to eliminate data copying. The most notable addition is <code>Span&lt;T&gt;</code>, but it's also worth mentioning <code>in</code> parameter modifiers and <code>readonly struct</code>.</p> <h3>Use <code>Span&lt;T&gt;</code> to avoid array allocations and avoid data copying</h3> <p><code>Span&lt;T&gt;</code> and friends are a gigantic performance win for .NET, particularly .NET Core where they use an optimized representation to reduce their size, which required adding GC support for interior pointers. Interior pointers are managed references which point to within the bounds of an array, as opposed to only being able to point to the first element and therefore requiring an additional field containing an offset into the array. For more info on <code>Span&lt;T&gt;</code> and friends, read Stephen Toub's article, <a href="https://msdn.microsoft.com/en-us/magazine/mt814808.aspx"><em>All About Span: Exploring a New .NET Mainstay</em> here</a>.</p> <p>Hagar makes extensive use of <code>Span&lt;T&gt;</code> because it allows us to cheaply create views over small sections of larger buffers to work with. Enough has been written on the subject that there's no use me writing more here.</p> <h3>Pass structs by <code>ref</code> to minimize on-stack copies</h3> <p>Hagar uses two main structs, <code>Reader</code> &amp; <code>Writer&lt;TOutputBuffer&gt;</code>. These structs each contain several fields and are passed to almost every call along the serialization/deserialization call path.</p> <p>Without intervention, each method call made with these structs would carry significant weight since the entire struct would need to be copied onto the stack for every call, not to mention any mutations would need to be copied back to the caller.</p> <p>We can avoid that cost by passing these structs as <code>ref</code> parameters. C# also supports using <code>ref this</code> as the target for an extension method, which is very convenient. As far as I know, there's no way to ensure that a particular struct type is always passed by ref and this can lead to subtle bugs if you accidentally omit <code>ref</code> in the parameter list of a call, since the struct will be silently copied and modifications made by a method (eg, advancing a write pointer) will be lost.</p> <h3>Avoid defensive copies</h3> <p>Roslyn has to do some work to guarantee some language invariants sometimes. When a <code>struct</code> is stored in a <code>readonly</code> field, the compiler will insert instructions to <em>defensively copy</em> that field before involving it in any operation which isn't guaranteed to <em>not</em> mutate it. Typically this means calls to method defined on the struct type itself because passing a struct as argument to a method defined on another type already requires copying the struct onto the stack (unless it's passed by <code>ref</code> or <code>in</code>).</p> <p>This defensive copy can be avoided if the struct is defined as a <code>readonly struct</code>, which is a C# 7.2 language feature, enabled by adding <code>&lt;LangVersion&gt;7.2&lt;/LangVersion&gt;</code> to your csproj file.</p> <p>Sometimes it's better to omit the <code>readonly</code> modifier on an otherwise immutable struct field if you are unable to define it as a <code>readonly struct</code>.</p> <p>See Jon Skeet's NodaTime library as an example. In <a href="https://github.com/nodatime/nodatime/pull/1130">this PR</a>, Jon made most structs <code>readonly</code> and was therefore able to add the <code>readonly</code> modifier to fields holding those structs without negatively impacting performance.</p> <h2>Reduce branching &amp; branch misprediction</h2> <p>Modern CPUs rely on having long pipelines of instructions which are processed with some concurrency. This involves the CPU analyzing instructions to determine which ones aren't reliant on previous instructions and also involves guessing which conditional jump statements are going to be taken. In order to do this, the CPU uses a component called the branch predictor which is responsible for guessing which branch will be taken. It typically does this by reading &amp; writing entries in a table, revising its prediction based upon what happened last time the conditional jump was executed.</p> <p>When it guesses correctly, this prediction process provides a substantial speedup. When it mispredicts the branch (jump target), however, it needs to throw out all of the work performed in processing instructions after the branch and re-fill the pipeline with instructions from the correct branch before continuing execution.</p> <p>The fastest branch is no branch. First try to minimize the number of branches, always measuring whether or not your alternative is faster. When you cannot eliminate a branch, try to minimize misprediction rates. This may involve <a href="https://stackoverflow.com/a/11227902/635314">using sorted data</a> or restructuring your code.</p> <p>One strategy for eliminating a branch is to replace it with a lookup. Sometimes an algorithm can be made branch-free instead of using conditionals. Sometimes <a href="https://mijailovic.net/2018/06/06/sha256-armv8/">hardware</a> <a href="https://blogs.msdn.microsoft.com/dotnet/2018/10/10/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/">intrinsics</a> can be used to eliminate branching.</p> <h2>Other miscellaneous tips</h2> <ul> <li>Avoid LINQ. LINQ is great in application code, but rarely belongs on a hot path in library/framework code. LINQ is difficult for the JIT to optimize (<code>IEnumerable&lt;T&gt;</code>...) and tends to be allocation-happy.</li> <li>Use concrete types instead of interfaces or abstract types. This was mentioned above in the context of inlining, but this has other benefits. Perhaps the most common being that if you are iterating over a <code>List&lt;T&gt;</code>, it's best to <em>not</em> cast that list to <code>IEnumerable&lt;T&gt;</code> first (eg, by using LINQ or passing it to a method as an <code>IEnumerable&lt;T&gt;</code> parameter). The reason for this is that enumerating over a list using <code>foreach</code> uses a non-allocating <code>List&lt;T&gt;.Enumerator</code> struct, but when it's cast to <code>IEnumerable&lt;T&gt;</code>, that struct must be boxed to <code>IEnumerator&lt;T&gt;</code> for <code>foreach</code>.</li> <li>Reflection is exceptionally useful in library code, but it <em>will</em> kill you if you give it the chance. Cache the results of reflection, consider generating delegates for accessors using IL or Roslyn, or better yet, use an existing library such as <a href="https://github.com/aspnet/Common/blob/ff87989d893b000aac1bfef0157c92be1f04f714/shared/Microsoft.Extensions.ObjectMethodExecutor.Sources/ObjectMethodExecutor.cs"><code>Microsoft.Extensions.ObjectMethodExecutor.Sources</code></a>, <a href="https://github.com/aspnet/Common/blob/ff87989d893b000aac1bfef0157c92be1f04f714/shared/Microsoft.Extensions.PropertyHelper.Sources/PropertyHelper.cs"><code>Microsoft.Extensions.PropertyHelper.Sources</code></a>, or <a href="https://github.com/mgravell/fast-member"><code>FastMember</code></a>.</li> </ul> <h2>Library-specific optimizations</h2> <h2>Optimize generated code</h2> <p>Hagar uses Roslyn to generate C# code for the POCOs you want to serialize, and this C# code is included in your project at compile time. There are some optimizations which we can perform on the generated code to make things faster.</p> <h3>Avoid virtual calls by skipping codec lookup for well-known types</h3> <p>When complex objects contain well known fields such as <code>int</code>, <code>Guid</code>, <code>string</code>, the code generator will directly insert calls to the hand-coded codecs for those types instead of calling into the <code>CodecProvider</code> to retrieve an <code>IFieldCodec&lt;T&gt;</code> instance for that type. This lets the JIT inline those calls and avoids virtual/interface indirection.</p> <h3>(Unimplemented) Specialize generic types at runtime</h3> <p>Similar to above, the code generator could generate code which uses specialization at runtime.</p> <h2>Pre-compute constant values to eliminate some branching</h2> <p>During serialization, each field is prefixed with a header – usually a single byte – which tells the deserializer which field was encoded. This field header contains 3 pieces of info: the wire type of the field (fixed-width, length-prefixed, tag-delimited, referenced, etc), the schema type of the field (expected, well-known, previously-defined, encoded) which is used for polymorphism, and dedicates the last 3 bits to encoding the field id (if it's less than 7). In many cases, it's possible to know exactly what this header byte will be at compile time. If a field has a value type, then we know that the runtime type can never differ from the field type and we always know the field id.</p> <p>Therefore, we can often save all of the work required to compute the header value and can directly embed it into code as a constant. This saves branching and generally eliminates a lot of IL code.</p> <h2>Choose appropriate data structures</h2> <p>One of the big performance disadvantages Hagar has when compared to other serializers such as <a href="https://github.com/mgravell/protobuf-net">protobuf-net</a> (in its default configuration?) and <a href="https://github.com/neuecc/MessagePack-CSharp">MessagePack-CSharp</a> is that it supports cyclic graphs and therefore must track objects as they're serialized so that object cycles are not lost during deserialization. When this was first implemented, the core data structure was a <code>Dictionary&lt;object, int&gt;</code>. It was clear in initial benchmarking that reference tracking was a dominating cost. In particular, clearing the dictionary between messages was expensive. By switching to an array of structs instead, the cost of indexing and maintaining the collection is largely eliminated and reference tracking no longer appears in the benchmarks. There is a downside to this: for large object graphs it's likely that this new approach is slower. If that becomes an issue, we can decide to dynamically switch between implementations.</p> <h2>Choose appropriate algorithms</h2> <p>Hagar spends a lot of time encoding/decoding variable-length integers, often referred to as varints, in order to reduce the size of the payload (which can be more compact for storage/transport). Many binary serializers use this technique, including <a href="https://developers.google.com/protocol-buffers/docs/encoding#varints">Protocol Buffers</a>. Even .NET's BinaryWriter uses this encoding. Here's a <a href="https://github.com/Microsoft/referencesource/blob/60a4f8b853f60a424e36c7bf60f9b5b5f1973ed1/mscorlib/system/io/binarywriter.cs#L414">snippet from the reference source</a>:</p> <pre><code>protected void Write7BitEncodedInt(int value) { // Write out an int 7 bits at a time. The high bit of the byte, // when on, tells reader to continue reading more bytes. uint v = (uint) value; // support negative numbers while (v &gt;= 0x80) { Write((byte) (v | 0x80)); v &gt;&gt;= 7; } Write((byte)v); } </code></pre> <p>Looking at this source, I want to point out that <a href="https://developers.google.com/protocol-buffers/docs/encoding#signed-integers">ZigZag encoding</a> may be more efficient for signed integers which contain negative values, rather than casting to <code>uint</code>.</p> <p>VarInts in these serializers use an algorithm called Little Endian Base-128 or LEB128, which encodes up to 7 bits per byte. It uses the most significant bit of each byte to indicate whether or not another byte follows (1 = yes, 0 = no). This is a simple format but it may not be the fastest. It might turn out that PrefixVarint is faster. With PrefixVarint, all of those 1s from LEB128 are written in one shot, at the beginning of the payload. This may let us use <a href="https://mijailovic.net/2018/06/06/sha256-armv8/">hardware</a> <a href="https://blogs.msdn.microsoft.com/dotnet/2018/10/10/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/">intrinsics</a> to improve the speed of this encoding &amp; decoding. By moving the size information to the front, we may also be able to read more bytes at a time from the payload, reducing internal bookkeeping and improving performance. If someone wants to implement this in C#, I will happily take a PR if it turns out to be faster.</p> <hr /> <p>Hopefully you've found something useful in this post. <a href="https://twitter.com/reubenbond">Let me know</a> if something is unclear or you have something to add. Since I started writing this, I've moved to Redmond and officially joined Microsoft on the <a href="https://github.com/dotnet/orleans">Orleans</a> team, working on some very exciting things.</p> Reuben BondCASPaxoshttps://reubenbond.github.io/posts/caspaxoshttps://reubenbond.github.io/posts/caspaxosLinearizable databases without logsTue, 21 Jan 2020 00:00:00 GMT<p>Recently I've been playing around with a new algorithm known as <a href="https://arxiv.org/abs/1802.07000">CASPaxos</a>. In this post I'm going to talk about the algorithm and its potential benefits for distributed databases, particularly key-value stores.</p> <p>Distributed databases must be <strong>reliable</strong> and <strong>scalable</strong>. To achieve reliability, DBs replicate data to other servers. To achieve scalability in terms of total storage capacity, DBs must allow the data to be replicated to only a subset of servers - enough to make the data reasonably reliable but not so much that adding a new server does not increase the total storage capacity of the system or make the system unbearably slow. A typical replication factor is 3: each piece of data is stored on 3 servers. Replications is typically implemented using a consensus algorithm. Well-known algorithms in this family that are used for replication are Raft, Multi-Paxos, and ZAB (which is used in ZooKeeper). Those 3 algorithms make servers agree on the ordering of operations in a log. By executing those operations in order, the database engines on each server can create identical replicas of a database. Logs feature very prominently in distributed/reliable systems (Read <em><a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">The Log: What every software engineer should know about real-time data's unifying abstraction</a></em> by Jay Kreps).</p> <p><a href="https://arxiv.org/abs/1802.07000">CASPaxos</a> is a new algorithm in this space and it is significantly simpler than the aforementioned algorithms because it does not use log replication. It is a slight modification of the original Paxos algorithm, which is very simple and typically used as a minimal building block for more complicated algorithms such as Multi-Paxos. Instead of replicating log entries between servers, CASPaxos replicates entire values. Because of this, it is best suited for relatively small values, such as individual entries in a key-value store.</p> <p>So why is this interesting? In short: it offers us simplicity &amp; performance. Before getting into its benefits, here's a <strong>sloppy, inaccurate description of CASPaxos - <a href="https://arxiv.org/abs/1802.07000">I recommend you read the paper</a></strong>.</p> <p>:::tip <strong>Why it stands out:</strong> CASPaxos replaces replicated logs with replicated values. That keeps the core protocol small and makes it a useful mental model before tackling full replicated-log systems. :::</p> <h2>CASPaxos</h2> <p>CASPaxos replicates changes to a single register amongst a set of replicas. The register holds a user-defined value which is modified by successive application of some change function (which is a closure). Each of these modifications are protected by version stamps (ballot numbers) which help to ensure that previously committed register values are not clobbered without being first observed by the writer. The protocol facilitates learning previously committed values so that replicas can keep up with one another.</p> <p>If you are familiar with Raft, you will know that at its core it replicates a log of values. Conceptually, a log-based replicated state machine folds a fixed function over multiple data (the log entries). By contrast, CASPaxos does not use a fixed function and instead folds varying closures over state, with the resulting state itself being replicated to other replicas.</p> <p>To illustrate, the following expansions show the result of applying <code>[e0, e1, e2]</code> (log entries) in Raft, versus <code>[f0, f1, f2]</code> (closures) in CASPaxos:</p> <ul> <li>Raft: <code>state = f(e2, f(e1, f(e0, ∅)]))</code></li> <li>CASPaxos: <code>state = f2(f1(f0(∅)))</code></li> </ul> <p>Aside from what gets replicated and how the current state of the system is computed, Raft and CASPaxos are vastly different. For example, CASPaxos is leaderless, whereas Raft uses a strong leader. CASPaxos does not specify the use of heartbeats (in the core algorithm), whereas Raft does. Many of these differences are present because Raft is a more <em>batteries included</em> algorithm which covers much of the practical concerns involved in building a replicated database.</p> <p>Neither approach is strictly better than the other, but since the CASPaxos approach (replicating state values rather than log entries) was fairly novel to me in the context of distributed conensus, I'd like to explore some of the implications, especially as they might apply to the systems I work on.</p> <p><a href="https://arxiv.org/abs/1802.07000">Read the paper</a> to understand the algorithm in more detail.</p> <h2>Simplicity</h2> <p>The canonical implementation of CASPaxos by its author <a href="https://twitter.com/rystsov">Denis Rystsov (@rystsov)</a> is <a href="https://github.com/gryadka/js">Gryadka</a>, a key-value store written in JavaScript which sits atop Redis. The core, including the CASPaxos implementation, has less than 500 lines of code. <a href="https://raft.github.io/">Raft</a> was also designed to be a simple and understandable algorithm, but it carries with it the weight of log replication, which brings with it the need for log compaction, which brings with it the need for snapshotting and snapshot transfer. Raft also requires leadership elections because it is built around the concept of a "strong leader". All writes must be served by the single master in a Raft system, whereas writes can be served by any replica in a CASPaxos system. CASPaxos is simpler to implement than Raft. The <a href="https://raft.github.io/raft.pdf">extended Raft paper</a> is a great read. <a href="https://github.com/ongardie/dissertation#readme">Diego Ongaro's Ph. D dissertation</a> includes an important simplification to the original paper's membership change algorithm. Let's be clear here: Raft definitely achieved its goal of understandability and it truly deserves the widespread adoption it's seen.</p> <p>:::important <strong>What the simplicity buys you:</strong> if your workload looks like a replicated key-value store, fewer moving parts means less machinery for leader routing, log compaction, and snapshot transfer. :::</p> <h2>Storage Performance</h2> <p>To analyse the performance implications of CASPaxos, we need to take a little detour and discuss real-world systems. One great example is <a href="https://www.cockroachlabs.com/">CockroachDB</a>, a distributed SQL database. CockroachDB aims to be <strong>reliable</strong> and <strong>scalable</strong>. To achieve this, they partition their data and replicate each piece of data to a subset of the servers in the system using an algorithm they call <a href="https://www.cockroachlabs.com/blog/scaling-raft/">MultiRaft</a>. If they were to use a single Raft consensus group, then adding additional servers would not increase the total capacity of the database. If they use many Raft consensus groups naively, the overhead of each consensus group would have a toll on throughput. For example, Raft requires heartbeat messages while idle to maintain leadership. MultiRaft requires multiplexing each consensus group's log records on disk for performance. That means that log entries for each group might not live near each other on disk, since they are interspersed with many other groups' records. This may take a toll on recovery performance. The alternative is to store each group's log in contiguous disk segments, but this reduces write throughput: spinning disks and SSDs both perform better when operating sequentially. The optimizations required to make Raft scale well are tricky largely because of its log-based nature.</p> <p>Speaking of storage, let's talk briefly about storage engines. The storage engine is the database component responsible for reading and writing data in a reliable way. Examples include RocksDB, LMDB, ESENT (used in Exchange &amp; Active Directory), WiredTiger, TokuDB, and InnoDB. Two of the most common data structures for implementing a storage engine are <a href="https://en.wikipedia.org/wiki/B%2B_tree">B+ Trees</a> and more recently, <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree">Log-Structured Merge-Trees</a> (LSM trees). In order to make B+ Trees reliable (any machine may crash at any time), a <a href="https://sqlite.org/wal.html">Write-Ahead Log</a> (WAL) is used. This log is a file containing a sequential list of the database transactions which are being performed. The storage engine eventually applies these transactions to the database image. During crash recovery, the storage engine reads this file and ensures that all of the committed transactions have been applied. This recovery algorithm is called <a href="https://en.wikipedia.org/wiki/Algorithms_for_Recovery_and_Isolation_Exploiting_Semantics">ARIES</a> and it can be found in many reliable storage engines. So B+ Trees split your data into two parts: a log file and a tree. Log-Structured Merge-Trees also generally adopt a Write-Ahead Log for recovery. Since spinning disks and SSDs perform best with sequential reads &amp; writes, log files are a good fit for high-performance, reliable systems.</p> <p>Raft is built around log replication, so it might make sense to integrate with the storage engine so that a single log can be used for both purposes: local durability as well as replication. Unfortunately, the storage engine's log is generally not visible to the storage engine consumer and is usually considered an implementation detail. This means that Raft implementations which use an off-the-shelf storage engine such as RocksDB must store log records inside the storage engine so that they can be read back later. The result is that each operation needs at least 2 writes (1 on the critical path): one for the log entry and one for the result of applying the log entry once it's committed (eg, updating a value in a key-value store). A B+ Tree engine needs 4 writes (1 critical). By contrast, CASPaxos needs just 1 write: updating the value itself. Log-based algorithms have natural write amplification where as CASPaxos does not.</p> <p>By removing the need for logs, CASPaxos can achieve high write throughput with off-the-shelf storage engines.</p> <p>:::tip <strong>Write amplification matters:</strong> when the storage engine already maintains its own WAL, layering a replicated log on top often means writing both the log entry and the materialized result. CASPaxos avoids that extra replicated-log layer. :::</p> <h2>Coordination</h2> <p>Each key in a key-value store based on CASPaxos is completely independent of all other keys. This means that no cross-key coordination is required when serving operations on individual keys. Compare this with Raft or MultiRaft where all operations within a given consensus group are strictly ordered. This ordering requires coordination which has some overhead. It means that a slow operation on one key can more easily impact operations on other keys. The low level of coordination required by CASPaxos supports high-concurrency systems without added complexity.</p> <p>Coordination is sometimes required. For example, when implementing multi-object transactions. Multi-object transactions can be implemented as a higher layer on top of a key-value store with linearizable keys using <a href="https://en.wikipedia.org/wiki/Two-phase_commit_protocol">two-phase commit (2PC)</a>. For example, this is how we implement <a href="https://www.microsoft.com/en-us/research/publication/transactions-distributed-actors-cloud-2/">ACID transactions in Orleans</a>, supporting any strong consistency key-value store.</p> <h2>Challenges</h2> <p>So far we've talked about ways in which CASPaxos might be more suitable for building a distributed key-value store than Raft or MultiRaft. CASPaxos is a simple algorithm and there are many system design questions which are not addressed by the paper definition. So here are some potential challenges when building a real-world system on CASPaxos, as well as some thoughts on how to solve them.</p> <p>:::warning The paper defines the core protocol, not a full production database. The remaining sections are the practical questions you still need to answer when turning CASPaxos into a complete system. :::</p> <h2>Server Catch-up</h2> <p>When adding a new server to the database system, the server needs to be brought up to speed with the existing servers. This requires adding it to the consensus group as well as copying all data for the keys which it will be replicating. The CASPaxos paper describes this process as a part of membership change. However, a similar process is needed to ensure that data is sufficiently reliable. For example, if a server loses network connectivity for a few seconds then it may miss some updates to some rarely updated keys. The CASPaxos algorithm does not discuss how to ensure that all updates are eventually replicated. In Raft, it is the leader's responsibility to keep followers up to speed. In a system built around CASPaxos, which is leaderless, we will likely need to implement a different solution.</p> <h2>Membership Change</h2> <p>The membership change algorithm in the paper does not offer safety in all cases and it implies a single administrator in the system. Therefore, it is not suitable for use with automated cluster management systems. The <a href="https://github.com/ReubenBond/orleans/tree/poc-caspaxos/src/Orleans.MetadataStore">proof-of-concept CASPaxos implementation</a> on <a href="https://dotnet.github.io/orleans/Documentation/Introduction.html">Orleans</a>, uses a <a href="https://github.com/ReubenBond/orleans/blob/f617b0ce67079a6b79c80fa3c73540fe24d2db7b/src/Orleans.MetadataStore/Configuration/ConfigurationManager.cs#L138">different membership change algorithm</a>. It ought to be suitable for automated systems (such a the <a href="https://dotnet.github.io/orleans/Documentation/Runtime-Implementation-Details/Cluster-Management.html">cluster membership algorithm used in Orleans</a>). I believe the algorithm will be safe once fully implemented, but that has not been demonstrated yet. The key idea is to leverage the consensus mechanism of the protocol for cluster membership change, similar to how Raft and Multi-Paxos commit configuration changes to the log. It uses a special purpose register to store cluster configuration. Proposers indicate which version of the configuration they are using in all calls to Acceptors and Acceptors reject requests from Proposers running old configurations. This is similar to Raft's notion of neutralizing old leaders. Additionally membership changes are restricted to at-most one server at a time, which is a special case of <em>joint consensus</em>. This the same restriction that Diego Ongaro specified in <a href="https://github.com/ongardie/dissertation#readme">his Ph. D dissertation</a> for Raft. In a sense, this extension turns CASPaxos into a 2-level store with the cluster configuration register at the top and data registers below, so the ballot vector is <code>[configuration ballot, data ballot]</code>.</p> <h2>Scale-out</h2> <p>Adding additional servers should increase the total storage capacity of the system. CASPaxos specifies only the minimal building block of a key-value store, so this scale-out is not discussed in the paper. The Raft paper also does not specify this, motivating the development of MultiRaft for CockroachDB. The dynamic range-based partitioning scheme used by CockroachDB is a good candidate. Implementing this might involve storing range configurations in registers and extending the membership change modification to include 3 levels, <code>[cluster ballot, range ballot, data ballot]</code>.</p> <h2>Large Values</h2> <p>CASPaxos is not suitable for replicating large values because each value is sent over the wire every time it is updated. For a replication factor of 3, the entire value is sent 3 times for every update and 6 times if the proposer cannot take advantage of the <em>distinguished leader</em> optimization.</p> <p>This limitation could be alleviated in several ways, or it can be ignored and argued away, leaving users to tackle the problem themselves if they truly need large values.</p> <p>:::caution CASPaxos shines when values are modest in size. If updates routinely move large blobs, the simplicity win can be eroded by network and storage bandwidth costs. :::</p> <p>One way to alleviate it might be to split keys over several registers. Without going into detail, this might involve extending the membership change modification yet again to include 4 levels, at which point it may make sense to generalize it into a <em>ballot vector</em>, <code>[...parent ballots, register ballot]</code>. Specifically, <code>[config ballot, range ballot, file ballot, register ballot]</code>. At this point, the system is structured more like a tree than a flat key-value store.</p> <h2>Conclusion</h2> <p>I hope you've enjoyed the post. If you'd like to discuss any aspects of it, for example some glaring inaccuracies, drop me a line via Twitter (<a href="https://twitter.com/reubenbond">@ReubenBond</a>).</p> <p>Distributed Systems is a young field with many exciting areas for research and development.</p> Reuben Bond