Code Thoughts

Does Your Code Leave a Trail of Slowness?

Mon, 27 Feb 2017 19:17:27 +0000

The Trail of Slowness

Often I implore people to make better performing coding decisions when the downsides are small or nonexistent. A common response to this is that you should only worry about performance when you have measured the code in question and found performance to be an issue. There a couple of problems with that ethos, the first of which is that sometimes early decisions will be hard to reverse late in a project if performance turns out to be an issue. But there is a more insidious problem. Areas of your code for which performance is not important, may be causing other code, or even other programs to slow down. This trail of slowness left behind by uninformed performance decisions will not show up in any particular place in a profiler, they will just slow down everything a bit.

Why Does This Happen

Modern CPU performance is severely limited by RAM latency. A request to get data from RAM can take over 100 CPU cycles to complete, and while your CPU waits, it just sits there, doing nothing useful. This problem is addressed by a series of caches in your CPU, smaller but much faster regions of memory. Whenever you request data from RAM, you get back an entire cache line of contiguous memory into the CPU caches. Assuming that the next memory your request is in that contiguous segment of memory in the cache line, you will get it very quickly.

This is why iterating over an array in order (contiguous memory) is much, much faster than iterating over a LinkedList, where each node is in a random(ish) location in the heap. Many of those requests from RAM while iterating over a LinkedList miss the cache, and cause long CPU stalls.

But it is worse than just being slow. Every time you request memory that isn’t in the cache already, you have to pull down a whole cache line and replace data that is already in the cache. Any code running that might have hoped to use that cached data won’t have it, and will now have to get it from RAM again. This is sometimes called “thrashing the cache”.

How Bad Can It Be?

This can vary wildly based on the workloads involved. In the contrived example below, I do a fixed amount of work with a LinkedList, while at the same time I have other threads running doing as much work as possible for 5 seconds. They are able to perform the ArraySumSquare function about 3.5 million times in that 5 seconds.

When I alter the RunLinkedList function to do the same fixed amount of work on arrays instead, the original array loops are now able to perform ArraySumSquare about 4.4 million times.

This could be similar to a situation where a server has a periodic computation it does, where taking a few seconds isn’t considered a problem at all. But that job is having a significant impact on users using the live system, increasing latency by a ~20%. Or it could be similar to a code editor that is doing some parsing behind the scenes where the performance seems fine when you profile it, but it is slowing down UI response due to the cache thrashing.

What To Do

Use arrays or array backed lists by default unless you have good reasons not to. Most languages have resize-able array backed structures that are as convenient to use as LinkedLists:

Java - ArrayList
C# - List
F# - ResizeArray (see FSharpX for higher order functions on these!)
C++ - std::vector
Rust - vec

Even if you are inserting or removing from the front or middle of your collection periodically it is usually still faster overall. You may also consider using a sorted array along with BinarySearch instead of a tree, when applicable. Think about how you lay out your data structures, avoid unnecessary pointer hops. Avoid virtual functions when they aren’t really necessary. Understand how branch prediction works so you can set your code up for success. Take some time to learn how memory works, and you can make better default decisions, and better designs early in your projects.

Sample Code

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;

namespace ConsoleApplication1 {
    class Program {
         
        static void Main(string[] args) {
            Thread [] arrayThreads = new Thread[8];
            Thread [] listThreads = new Thread[8];

            for (int i = 0; i < 8; i++) {
                arrayThreads[i] = new Thread(RunArray);
                listThreads[i] = new Thread(RunLinkedList);
                arrayThreads[i].Start();
                listThreads[i].Start();
            }

            for (int i = 0; i < 8; i++) {
                arrayThreads[i].Join();
                listThreads[i].Join();                
            }

            Console.WriteLine("Perf Critical Ops:" + totalOps);                        
            Console.ReadLine();

        }

        public static int totalOps = 0;
        public const int LEN = 10000;
        public const int TIME = 5000;

        public static void RunArray() {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            int[] a = new int[LEN];
            for (int i = 0; i < a.Length; i++) {
                a[i] = 2;
            }
            long sum = 0;
            int count = 0;
            while (sw.ElapsedMilliseconds < TIME) {                
                sum += ArraySumSquare(a);
                count++;
            }
            Interlocked.Add(ref totalOps, count);                        
        }

        public static int ArraySumSquare(int[] a) {                        
            int sum = 0;
            for (int i = 0; i < a.Length; i++) {
                sum += a[i] * a[i];
            }
            return sum;
        }

        public static void RunLinkedList() {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            LinkedList<int> l = new LinkedList<int>();
            for (int i = 0; i < LEN; i++) {
                l.AddLast(2);
            }
            long sum = 0;
            int count = 0;
            while (count < 75000) {
                sum += LinkedListSumSquare(l);
                count++;
            }            
        }
        
        public static int LinkedListSumSquare(LinkedList<int> l) {            
            int sum = 0;
            foreach (var num in l) {
                sum += num * num;
            }
            return sum;
        }
    }
}

PVS-Studio C#

Tue, 01 Nov 2016 19:17:27 +0000

PVS-Studio is a popular static analysis tool in the C++ world, and plenty of articles have been written about the kinds of bugs it can find in C++ projects, such as this entertaining one about the Unreal Engine. About year ago they added C# support, and have steadily been adding more C# analysis features since. Today I grabbed version 6.10 and ran it on my code base at work, which is a fairly large ASP/MVC web application. Here are some of the things it found:

The ‘user.Group’ object was used before it was verified against null

  var user = _entityRepository.GetOnlineUserByUsername(username);
  string nsId = user.Group.NetSuiteInternalId;

One of the nicer features of PVS-Studio is that it can identify cases like this, where a null pointer exception is possible. By identifying these and dealing with them you can eliminate an common class of error and deal with it more gracefully.

Expression ‘result.Succes’ is always true

if(!result.Success)
  return Json(result);

//...

if (result.Success)
{    
    //...    

}           

This and a couple other similar examples were identified. This error could imply a serious logic error in the code. At the very least it identifies unecessary checks cluttering up the code and slowing down execution.

It is odd that the body of ‘CanReadType’ function is fully equivalent to the body of ‘CanWriteType’ function

public override bool CanReadType(Type type)
{
    return SupportedType(type);
}

public override bool CanWriteType(Type type)
{
    return SupportedType(type);
}

This turned out to be correct for us, since we needed to override both methods. This class of message can sometimes identify code that can be cominbed to shrink your code base down, or may identify a copy-paste error.

An odd precise comparison: transaction.TaxRate == 0. Consider using a comparison with a defined precision: Math.Abs(A-B) < Epsilon

 if(transaction.TaxRate == 0)

Any instances of comparing floating point values to exact values will be indentified as a low-risk problem.

A part of conditional expression is always true if it is evaluated: billingAddressSameAsShipping != “on”

if (string.IsNullOrEmpty(CFM.BillingAddress.Id) && 
    string.IsNullOrEmpty(billingAddressSameAsShipping) && 
    billingAddressSameAsShipping != "on")

Another common mistake, this can often arise when an if statement is modified with an extra check later. In this case, the check for != “on” is now unecessary. But this could expose logic mistakes as well.

The ‘url’ variable is assigned to itself

url = url = "user/" + userId + "/attorney/" + id;

A copy/paste error, that likely gets compiled away, but is cluttering up the code still.

IDisposable object ‘serverError’ is not disposed before method returns

var serverError = new HttpResponseMessage(HttpStatusCode.InternalServerError);
return Request.CreateResponse(HttpStatusCode.InternalServerError);

PVS will identify a few different problems with the IDisposable interface, including this, classes that implement the Dispose method but not the IDisposable interface, and classes which have IDisposable memebers but don’t implement IDisposable themselves. Proper handling of these issues can reduce memory use and GC pressure.

The ‘DateTime’ constructor could receive the ‘0’ value while positive value is expected. Inspect the first argument

DateTime now = DateTime.Now;
if (year < 0 || year > now.Year || month <= 0 || month > 12)
{
    throw new Exception("The input query time is not valid");
}

DateTime StartOfMonth = new DateTime(year, month, 1);

PVS Studio has identified that our check for “year < 0” is not sufficient to gaurantee correct input to the DateTime constructor. It should be year <= 0.

The ‘DateNeeded’ variable is assigned values twice successively. Perhaps this is a mistake

if (detail)
{
    while ((date.ToString("ddd") != day))
    {
        date = date.AddDays(1);
    }

    //if the date is a holiday, add 1 
    if (santa.CheckDate(DateNeeded, type))
        DateNeeded = date.AddDays(1);
}

//fixes time 
date = date.AddHours(time - date.Hour);
date = date.AddMinutes(-date.Minute);

DateNeeded = date;

PVS Studio has identified that the assignment of DateNeeded = date.AddDays(1) is nonsensical, because it is immediately overritten before being used.

The return value of function ‘Insert’ is required to be utilized.

userString.Insert(0, suffixStr);
   

This is a very common mistake. C# strings are immutable, and so the pattern for functions like insert, substring, etc is to return the new string. You can easily forget this, and the operation you were trying to do on the string just doesn’t happen at all.

Quick Review

This doesn’t represent all of the C# capabilities that PVS-Studio has, these are just the issues it found in our project. It integrates nicely with Visual Studio (see screenshot below), but you can use it standalone as well. They have a nice evaluation version that let’s you try it out for quite a long time. I find the errors it finds to be much more relevant than Resharper’s analyses, which are numerous but seem to mostly be stylistic. There do not appear to be any optimization tips for C# yet, as there are for C++ code, but perhaps that is coming in the future.

Language Helper

Thu, 29 Sep 2016 19:17:27 +0000

One of my many half finished side projects is a text adventure engine, where players can play Zork-like games, or make their own game from within the engine as well. One of the tricky problems I ran into was being able to handle certain english grammer issues in the engine, in a way that was not extremely annoying for users. For instance, you might want to indicate in the engine editor that a room has 2 gold coins and a dagger in it. When someone plays the game you could do the usual gamedev thing and present the room something like this:

You enter a scary dungeon.  
You see:
  2 gold coins
  1 dagger

> Take 1 coin
You take 1 coin.
You see:
  1 gold coin
  1 dagger

It is easy to write code to do that, but it breaks the fourth wall, and no longer reads like a story. What if instead you wanted it to look like so:

You enter a scary dungeon.  There are two gold coins and a dagger here.
> Take 1 coin
You take a coin, and now a gold coin and a dagger remain

This is harder code to write, especially when you support a editor where users might enter any words they want as items. You have to identify if the indefinite article for a given word should be “a” or “an” for starters, which has no simple rules you can follow. You need to know the plural forms of words, which is not as simple as just adding an s.

The LanguageHelper library

I’ve put together a library that helps make some of these things easier, and might be useful for various text-based RPG applications. Maybe even useful for non text based ones now that text to speech is starting to sound convincing.

What I have so far:

Query any word for it’s plural form
Query any word for the correct indefinite article
Query any verb for past tense, progressive tense, and past perfect tense
Turn an integer into the words that represent the integer
A json dictionary format so you can easily add your own words, or cull the dictionary for performance/size reasons

Things I would like to add:

Query a word for synonyms and antonyms
More complete coverage of verb conjugation
Pronunciation data?
More languages?

The library is currently in .NET 2.0, so it can be consumed by any C# or F# code, including Unity3D, but if people think this sounds useful I would create a C++ version as well.

Useful?

Does this sound useful? Does this already exist? What other features would you need? Let me know!

Sample Use Case

Following is a simple use case example. Notice there are some subtleties that the library handles well. “steel ingot” is a two word item. The library correct pulls out the indefinitely article for the leading word “steel” but applies the plural form only to the trailing word.

    (* Some imaginary blacksmith entity structure *)
    let item = "steel ingot"
    let qty = 2
    let currentAction = "forge"
    let pastAction  = "sleep"

    (* Player asks what the Smith has for sale *) 
    let response = "I have " + wordBank.QueryNounQty(item,qty)
    printf "%A\n" response
    (* Output: I have two steel ingots *)

    (* Smith only has 1 *)
    let qty = 1
    let response = "I have " + wordBank.QueryNounQty(item,qty)
    printf "%A\n" response
    (* Output: I have a steel ingot *)

    (* Smith has none left *)
    let qty = 0                
    let response = "I have " + wordBank.QueryNounQty(item,qty)
    printf "%A\n" response
    (* Output: I have no steel ingots *)

    (* What are you doing? *)
    let response = "I am " + wordBank.QueryVerbPresent(currentAction)
    printf "%A\n" response
    (* Output: I am forging *)

    (* What did you do earlier? *)
    let response = "I  " + wordBank.QueryVerbPast(pastAction)
    printf "%A\n" response
    (* Output: I slept *)

Some sample queries

Taking out the garbage

Thu, 01 Sep 2016 19:17:27 +0000

One of the never ending arguments about languages is the pros and cons of garbage collection. Some people hate it because they think it is slow, others insist it is fast, some hate that it takes away control from them, others love it for that same reason. I’m going to explore this a bit and show some pros and cons that arise, and how you can deal with them in C#, Java, and C++. I will be creating basic framework for a game in the style of Minecraft. Don’t get too excited, there won’t be any rendering or anything playable. Also, please don’t take any of these experiments to represent evidence of innate performance qualities of any of these languages. In all three cases, I am aware of ways to optimize the code further, this is just meant to illustrate the relative costs of allocation and how you can start reducing those costs in each language. If I get emails about fairness I’m going to refer you to this paragraph.

The Naive Approach, in C Sharp

Link To Gist

Above is a link to how many developers might naively being to implement a game like Minecraft. (note, that experienced AAA game devs, their eyes would bleed at this) It is 3d, so of course you have the obligatory Vector class, with obligatory operator overloading so you can do simple operations on those vectors with very obvious code and very little typing. The game world consists of Chunks, that are loaded as you approach close enough to them, and unloaded as you get too far away. Each Chunk has a number of Entities, that move around at various speeds each game tick. The player moves forward and every few ticks she passes into a new Chunk, which causes 1 Chunk to be loaded and another to be unloaded. Each Chunk also has a number of Blocks. Everything in the game, Chunks, Blocks, Entities have positions represented by a Vector class. Each tick of the game, as the player moves forward a little bit, the chunks are iterated over and told to update their entities, then checked to see if they have gone out of range, and removed if so. If a chunk is removed, a new one is added to replace it.

Simple enough, and the approach above would and does work fine for many games. It isn’t completely dumb, it uses an array backed List to keep track of things, because arrays are fast, and it pre-allocates them to the proper size when it can to avoid wasting memory and cpu on growing the array. Modern computers are fast, so this shouldn’t be a problem!

But the problem is that while computers are fast, so too have our expectations grown. A screen used to have 64,000 pixels max, now they have 2 million at a minimum. A game world that you could explore for hours used to be impressive, but now players expect endless worlds larger than planets, with detail down to blades of grass. And all of that has to happen at 90FPS on two monitors at once because VR! So, while our game is simple in principle, we load up and move around a lot of simple things. 65k blocks per chunk, 100 chunks at time, plus 1,000 entities per chunk all moving around every game tick.

Naive Approach

You can see in the code we have a rudimentary frame rate lock, at 60fps, and on a modern Core I7 cpu we aren’t hitting that frame rate ever! Some other troubling stats (collected with Perfmon) are clear:

80% of CPU time spent in garbage collection!!!
600 MB/s allocations
5.8 seconds to load the world.
31.4 ms per tick

I’m not even rendering yet! Or playing sounds, or doing networking. This is the kind of performance that causes people to say “Garbage collection is terrible and slow!”, which causes people to respond “No you just aren’t using it right!”, which then leads to “If I have to think about memory management anyway, what is the point of garbage collection!” and so on.

The root problem here, as is often the case, is too many allocations, which would be a problem even if there wasn’t any garbage collector, though perhaps not quite as bad. I am casually creating new Vectors all the time for just a short while and then tossing them away. With blocks I am creating longer lived objects and then regularly tossing them to the garbage collector as well. There are many things I can do to improve this, but C# has one feature which is an ‘easy fix’, and that is structs, which are a value type. They are not allocated on the heap, and they are passed by value. You can’t just turn all of your classes into structs, as larger classes being copied around by value would be wasteful, but small ones you can. In this case, Vector is a perfect candidate, at only 12 bytes.

Change class Vector to struct Vector

Just six characters and look at the difference:

45% of CPU time spent in garbage collection
200 MB/s allocations
4.4 seconds to load the world.
4.1ms per tick

Suddenly things have gone from a hopeless situation where there is negative time for the frame rate budget for rendering, to having 11 milliseconds and 50% of the CPU to spare. But the struct is just a partial fix. Next I will refactor the code a bit for even better performance.

Refactor

Link to Gist

After a more serious refactoring:

0.1% of CPU time spent in garbage collection
1 MB/s allocations
25ms to load the world
0.97ms per tick

A huge difference! Very little time is spent on GC now, world load times are now instant and game ticks now process in under a millisecond. At this point almost nothing happens in a game tic except updating positions of entities, and occasional unloading of one chunk to be replaced by another.

What Changed?

Lots of little things, using arrays instead of List when it doesn’t cause any extra work saves a tiny bit of overhead. Benefitting from some array bounds elisions by structuring loops just right in some places. Reducing GC pressure and improving runtime by not using foreach on Lists, and other minor tweaks. But the main thing, was rethinking how the data is organized. Previously, each chunk had 65k Block objects. By thinking about the data, one can figure out how many possible block types there are. They will likely have some bound. In this case 256 was chosen to replicate Minecraft. You could easily bump that up to a short or int and still realize the bulk of this improvement. So instead of each chunk allocating and storing a complete Block object 65k times, it just stores an index into a global array of Blocks. This is similar to how Minecraft actually does things. This optimization only makes sense if blocks are static things, most of the time, as they are in Minecraft. You can break blocks, and place blocks, but they rarely have state associated with them that changes. This trick will not work with entities as currently designed, as they are moving around, their health is changing, and so on.

This sort of thing is a very basic example of Data Oriented Programming. Think a little bit more about what is actually happening with your data in memory, and less about what an idiomatic OOP design should be. Note that had you proceeded with the original design further into the development cycle, refactoring to make these changes could end up very very painful. Now that memory isn’t being shuffled around like mad, there is plenty of CPU available to replace the mockup code with some real ‘AI’.

One could go further with this, for instance pulling the Vector class out and replacing it with an array of positions, or even separate arrays of x,y,and z values, depending on how the data is accessed could have big speed benefits due to cache locality and allow you to utilize SIMD instructions. But that sort of madness is beyond the scope of this blog post.

The Naive Approach, Java

Link to Gist

Recreating the same naive program in Java, I get the following stats (collected with Mission Control):

17% of CPU time spent in garbage collection
156 MB/s allocations
4.1s to load the world
4.9ms per tick

Java has no value types yet, so we can’t apply the trick of making Vector a struct, but the memory use and GC time is already much better than the .NET case where we made Vector a struct. Part of the reason for this is that the JVM does escape analysis, so some of the wasteful Vector allocations that are only being used within a function can be allocated on the stack automatically. Interestingly, escape analysis is a feature coming soon to .NET, and value types are coming soon to Java.

Java refactored

Link to Gist

But what if I refactor to avoid the wasteful allocations in the first place?

.04% of CPU time spent in garbage collection
1.3 MB/s allocations
50ms to load the world
1.18ms per tick

The stats are now very much inline with the refactored .NET code. While slightly worse, don’t read too much into that, I’m not as experienced at Java and probably have more obvious small mistakes. What should be noted here, is that avoiding wasteful allocations is important, no matter the platform. One obvious way to further improve the Java code would be to eliminate the Vector class entirely, and just use float x,y,z in place every where we use it. This is a bit painful, but gets rid of the reliance on escape analysis and saves some object overhead. This would be equivalent to converting the class to a value type, if/when Java has those. Another option is to use a pool of Vector objects, which you reuse.

C++ Extremely naively

The first experiment I ran with C++ was to strictly copy the behavior of C# / Java, and create the same objects, on the heap, every time. This was very unnatural, as creating a new Vector and then 2 lines of code later calling delete on it kind of alerts you to the absurdity of the situation. But for completeness I wrote that code and:

0% of CPU time spent in garbage collection (but lots spent allocating!)
1 second to load the world
17.39ms per tick

While the game world loads a lot faster, the per tick performance is still really bad. Even worse than the naive Java implementation, probably because there is no escape analysis saving us from allocating Vectors all over the place. This code is actually kind of absurdly naive though, it takes extra typing annoyance to make code this bad and it is pretty unlikely even someone not keen on performance would do this. However this is how I was taught to do things with C++ in school, so you never know.

Minor refactor - no more heap allocating Vectors

Link To Gist

This is vaguely equivalent to making the Vector class a struct in C#. I’m also passing Vector by value, and never allocating it on the heap. The code gets smaller and simpler, I don’t have to worry about memory management as much.

0% of CPU time spent in garbage collection (but some spent allocating)
737ms to load the world
1.4ms per tick

Major refactor

Link To Gist

C++ implementation improvements courtesy of Jean-Michaël Celerier

Applying the same tricks as we did in the other languages, so that we aren’t allocating so many blocks, and a few other tricks that C++ gives us the flexibility to do:

0% of CPU time spent in garbage collection (a bit spent allocating)
8ms to load the world
0.55ms per tick

Rust naively

Link To Gist

This implementation contributed kindly by Maplicant

This was implemented by a newcomer to Rust who attempted a translation of the Naive C# implementation, performance is quite good! fastest of all the naive ones.

0% of CPU time spent in garbage collection (but lots spent allocating!)
1.2 seconds to load the world
1.7ms per tick

Rust Refactored

Link To Gist

This implementation contributed kindly by Zachary Dremann

The refactored Rust implementation also performs very well:

0% of CPU time spent in garbage collection (but some spent allocating!)
14 milliseconds to load the world
0.7ms per tick

Go? Haskell?

The Haskell and Go communities have gotten into the act too with lots of fun experiments. I won’t be able to collate the performance of all of these efforts but they are fun to read up on.

Performance Comparisons:

A couple of quick graphs showing some performance comparisons. Again, I reiterate not to read these as proof of the performance superiority of any memory management approach. I assure you that all of these implementations could optimized further than they are here. The point is to show how allocations are expensive in all cases, but in different ways, and how good design brings performance into reasonable ranges in all cases, though with differing levels of effort.

The Naive Approaches (With Vector Fixes in C# and C++)

Note the Y Axis here is Log Time. Notice how in all 4 languages performance starts to degrade at the same time, about 1,000 ticks in, probably reflecting when the heap begins to fill up, and allocations get expensive for C++, and GC has to kick in more for .NET and Java. While one could bicker about which language is doing best here for eternity, the fact is all three implementations are completely unacceptable.

The Refactored Approaches

Notice all 4 languages perform well here, but the garbage collected languages do still have GC pauses, which are a big problem in gaming. Most game developers would get even more clever, using object pooling and other techniques to try to get allocations down to zero within the main game loop, if possible.

Conclusions

The main take away here is think about how you work with memory, no matter what language you use. C# offers some nice tools in value types to make this a bit easier. Java on the other hand uses escape analysis to attempt to “auto struct” things for you. Both of these approaches have pros and cons. It is less important to worry about which is best, and more important just to understand how your language works, so the code you type will leverage it’s strengths, and avoid it’s weaknesses. C++ doesn’t make allocation free, allocating too much is one of the primary causes of performance problems in C++ code as well. It does give you the most control to make things perform well, but it will be up to you to figure it out. Manage your memory well.

Benchmark Details

All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language (Debateable for C++). If you identify cases where code or compiler/environment choices are sub optimal, email me please.

Environment

Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC

C# Runtime Details

CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0  
Jit=RyuJit  GarbageCollection=Concurrent Workstation  

C++ Details

Visual Studio 2015 Update 3, Optimizations set for maximum speed, AVX2 Instructions on

Java Details

Oracle Java 64bit version 8 update 102 Testing done with JMH

Rust Details

rustc 1.13.0 build with cargo rustc --release -- -C lto -C target-cpu=native

Think Before You Parallelize

Tue, 30 Aug 2016 19:17:27 +0000

In 2005 Intel release the Pentium D, which began the era of multi-core desktop CPUs. Today, even our phones have multiple cores. Making use of all of those cores it not always easy to do, but modern languages and libraries have come a long way to help programmers take advantage. All kinds of utility functions and concurrent abstraction have been developed in an attempt to make using all our cores more accessible and simple. Sometimes these abstractions have a lot of overhead though, and sometimes it doesn’t even make sense to parallelize an operation in the first place.

Don’t Parallelize when it is already Parallelized

Suppose you are working on a big number crunching function for a high traffic website that is a bit of a performance bottleneck. You get the idea to parallelize it and it tests much faster on your dev machine with 4 cores. You expect great things on the 24 core production server. However once you deploy you find that performance in production is actually slightly worse! What you forgot was that the web server was already parallelizing things at a higher level, using all 24 production cores to handle multiple requests simultaneously. When your paralellized function fires up, all the other cores are busy with other requests. So you take the hit of whatever overhead was required to parallelize the function with no benefit.

On the other hand, if your website was say, a low traffic internal website with only a few dozen hits per day, then the plan to parallelize would likely pay off, as there will always be spare cores to crunch the numbers fast. You have to consider the overall CPU utilization of your webserver, and how your parellelized function will interact with the other jobs going on. Will it thrash the L1 cache and slow other things down? Test and measure.

Another scenario, say you are working on 3D game, you have some trick physics math where you need to crunch numbers, maybe adding realistic building physics to Minecraft. But separate threads are already handling procedural generation of new chunks, rendering, networking, and player input. If these are keeping most of the system’s cores busy, then parallelizing your physics code isn’t going to help overall. On the other hand if those other threads are not doing a lot of work, cores may indeed be free for you to crunch some physics.

So think about the system your code is running in, if things are getting parallelized at a higher level, it may not do any good to do it again at a lower level. Instead, focus on algorithms that run as efficiently as possible on a single core.

Consider your target hardware

Many developers have very nice machines, probably with a minimum of 8 logical cores these days, possibly more. But consider the entire scope of where your code might run. Will it run on a low cost virtualized web app fabric in the cloud? This may only have 1 or 2 virtual cores for you to work with. Will it run on old desktops or cheap phones, that maybe only have 2 cores? An algorithm that gets sped up on your 8 core system at home may not fair so well on systems with only 2 or 3.

Case Study Easy Parallel Loops

It is common in a given programming language to have compiler hints or library functions for doing easy parallel loops when it is appropriate. What happens behind the scenes can be very different depending on the abstractions each language or library uses. In some cases a number of threads may be created to operate on chunks of the loop, or ThreadPools may be used to reduce the overhead of creating Threads. It is important to have a rough understanding of how the abstractions available to you work, so you can make educated guesses about when it might be useful to use them, how to tune them, and how to measure them. At minimum you should consider the following issues.

If the computational overhead of creating or managing the thread is greater than the benefit you get, you can end up with a slower results than you would doing a single threaded implementation. I will compare some toy workloads with some common parallel loop abstractions in C#, F#, C++ and Java.

CSharp

    
    public double ImperativeSquareSum()
    {
        var localArray = rawArray;               
        double result = 0.0;
        for (int i = 0; i < localArray.Length; i++)
        {
           result += //Do Work
        }
        return result;
    }
         
    public double LinqParallelSquareSum()
    {
        var localArray = rawArray;
        return localArray.AsParallel().Sum(/* Do Work */);
    }
    
    public double ParallelForSquareSum()
    {
        var localArray = rawArray;
        object lockObject = new object();
        double result = 0.0;
        Parallel.For(0, localArray.Length,() => 0.0,
            (i, loopState, partialResult) => { /*Do Work*/ },            
            (localPartialSum) => { lock (lockObject) { result += localPartialSum }});            
        return result;
    }

1 million doubles - (result += x*x)

Method	Median	Bytes Allocated/Op
Imperative	1.1138 ms	29,480.65
LinqParallel	3.3174 ms	117,802.56
ParallelFor	1.9985 ms	59,264.27

The easiest way to parallelize work like this in C# is with PLINQ. Just type your collection name, then .AsParallel() and fire away with Linq queries. Unfortunately in this case it does no good, and neither does the Parallel.For function. The workload of just squaring doubles and adding them up isn’t enough to get a net benefit here. You would need to roll your own function using ThreadPools or perhaps Threads directly to see a speedup.

1 million doubles - (result += Math.sin(x))

Method	Median	Bytes Allocated/Op
Imperative	37.1130 ms	840,522.92
LinqParallel	9.8497 ms	225,694.67
ParallelFor	8.5615 ms	166,386.40

With the bigger workload there is now a large improvement by parallelizing. It takes a CPU about 2 orders of magnitude more cycles to perform a sin operation that it does an add or multiply. Because of this, per-element overhead cost becomes a much smaller percentage of overall runtime, and we get the ~4x speedup we expect from 4 physical cores. It also reduces the relative cost of the simple Linq approach compared to the more complex Parallel.For abstraction. Consider how big the workload is to help decide if the simple Linq approach is worth the cost.

FSharp

F# has a number of easy to use 3rd party libraries for this purpose. All can be used from C# as well. A quick rundown of them here:

    (* Nessos Streams ParStream *)
    array
    |> ParStream.ofArray                    
    |> ParStream.fold (fun acc x -> acc + x*x)  (+) (fun () -> 0.0) 

    (* FSharp.Collections.ParallelSeq *)
    array
    |> PSeq.reduce (fun acc x -> acc+x*x)

    (* SIMDArray (uses AVX2 SIMD as well) *)
    array
    |> Array.SIMDParallel.fold (fun acc x -> acc + x*x)
                               (fun acc x -> acc + x*x)  
                               (+) (+) 0.0 

1 million doubles (result += x*x)

Method	Time
.NET / F# Parallel SIMDArray	0.26ms
.NET / F# Nessos Streams	1.05ms
.NET / F# ParallelSeq	3.1ms

SIMDArray is ‘cheating’ here as it also does SIMD operations, but I include it because I wrote it, so I do what I want. All of these out perform core library functions above.

1 million doubles (result += Math.Sin(x))

Method	Time
.NET / F# Nessos Streams	6.7ms
.NET / F# ParallelSeq	9.9ms

The Sin operation can’t be SIMDified here so SIMDArray is out. Nessos streams again proves to be better than the core library functions.

C++

Now the same experiment in C++. Most C++ compilers can auto parallelize loops, which you can control via compiler flags or inline hints in your code. For instance with Visual Studio’s C++ compiler you can just put #pragma loop(hint_parallel(8)) on top of a loop, and it will parallelize it if it can. Unfortunately our toy example is (intentionally) a tiny bit too complex for that. Since we are summing up results, this creates a data dependency. Fortunately we can use OpenMP, which is available in Microsoft Visual C++, GCC, Clang, and other popular C++ compilers:

    double result = 0;
    #pragma omp parallel for reduction(+ : result)
    for(int i = 0; i < COUNT; i++) 	
    {
       result += /*Do Work*/;
    }

This is equivalent to the Parallel.For loop used above in C#, where you identify that you will be aggregating data. This is actually less typing and easier to read too, even if the syntax is odd. How does it perform?

1 million doubles - (result += x*x) No SIMD

Method	Median
ForLoop	1.031 ms
ParallelizedForLoop	0.375 ms

We can see that OpenMP is managing a more efficient abstraction than .NET for this case, managing almost almost a 3x speedup where .NET was actually a bit slower. Newer OpenMP implementations available on other compiles can also be directed to do SIMD vectorization in the loop for even more speed increase. That does not seem to be available in MS Visual C++, and the usual automatic vectorization seems to not happen within the omp loop. Automatic vectorization can be done on the single thread for loop but it was turned off for these C++ tests. The C++ compilers does do older SSE instructions, as is the case with .NET and Java as well, but they only use a single lane. MSVC++ will use all lanes if you specify /fp:fast but only in the non OMP loop

1 million doubles - (result += sin(x)) No SIMD

Method	Median
ForLoop	10.625 ms
ParallelizedForLoop	2.44 ms

This time a little more than a 3x speedup, and as you can see the results are overall faster than .NET as well.

Java

Java’s streams library which performed excellently in a previous blog post can be used here again. You simply have to tell it you want a parallel stream:

//Regular stream
sum = Arrays.stream(array).reduce(0,(acc,x) -> /*Do Work*/);

//ParallelStream
sum = Arrays.stream(array).parallel().reduce(0,(acc,x) -> /*Do Work*/);

1 million doubles (result += x*x)

Method	Median
Stream	1.03ms
Parallel Stream	0.375ms -> .8ms

1 million doubles (result += Math.sin(x))

Method	Median
Stream	34.5ms
Parallel Stream	7.8ms -> 14ms

Java performs right on par with C++ in the first example, but falls behind when using Math.sin(). It appears that this is not due to the parallel streams, but due to Java using a more accurate sin implementation, rather than calling the x86 instruction directly. This difference may not exist on other hardware. I do not like it when a langauge tells me I can’t touch the hardware if I want. A Math.NativeSin() would be nice. The streams library overall though has proven to be excellent, matching C++ in both scalar and parallel varieties.

Update!

Further experiments with Java using the JMH testing framework have shown the parallel streams to exhibit inconsistent performance. Sometimes executing in ~.375ms indefinitely. Sometimes executing that fast for only a few dozens iterations then suddenly taking ~.8ms indefinitely after that. Reasons unknown, if you are a JVM expert and have ideas, please email me.

Javascript

pfffftttt (yeah I know about Web Workers)

Rust

Rust provides no easy loop parallelizing abstractions out of the box, you have to roll your own. OpenMP style features may be in the works for Rust though, and 3rd party libraries are available. So let’s take a look at a nice one called Rayon which adds a “par_iter” providing similar functions as the regular iter, but in parallel. The code remains very simple:

    
    // The regular iter
    vector.iter().map(|&x| /* do work */).sum()

    // Parallel iter
    vector.par_iter().map(|&x| /* do work */).sum()

1 million doubles (result += x*x)

Method	Median
iter	1.05 ms
par_iter	.375 ms

1 million doubles (result += Math.sin(x))

Method	Median
iter	9.65 ms
par_iter	2.44 ms

These are excellent results, tied with C++, and requiring only a single line of code to express.

Summary

The loop abstractions examined here are just one type of parallel or concurrent programming abstraction available. There is a whole universe out there, Actor Models, Async/Await, Tasks, Thread Pools, and so on. Be sure to understand what you are using, and measure whether it will really be useful, or whether you should focus on fast single threaded algorithms or look for third party tools with better performance.

Aggregated Testing Results

1 million doubles ( result += x*x) No SIMD ( Except SIMDArray)

Method	Time	Lines Of Code
.NET / F# SIMDArray	0.26ms	1
Rust Rayon	0.375ms	1
C++ OpenMP	0.375ms	~5
Java Parallel Streams	0.375 ms -> 0.8ms	1
.NET / F# Nessos Streams	1.05ms	~2
.NET Parallel.For	1.9ms	~6
.NET / F# ParallelSeq	3.1ms	1
.NET Parallel Linq (Sum)	3.3ms	1
.NET Parallel Linq (Aggregate)	8ms	1

1 million doubles ( result += sin(x)) No SIMD

Method	Time	Lines Of Code
Rust Rayon	2.44ms	1
C++ OpenMP	2.44ms	~4
.NET / F# Nessos Streams	6.7ms	~2
Java Parallel Streams	7.8ms -> 14ms	1
.NET Parallel.For	8.5615ms	~6
.NET Parallel Linq (Sum)	9.8497ms	1
.NET / F# ParallelSeq	9.9ms	1
.NET Parallel Linq (Aggregate)	45.6ms	1

Benchmark Details

All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language (Debateable for C++). JIT warmup time is accounted for when applicable. If you identify cases where code or compiler/environment choices are sub optimal, email me please.

Environment

Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC

F# / C# Runtime Details

CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0
Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation  

C++ Details

Visual Studio 2015 Update 3, Optimizations set for maximum speed, SIMD off

Java Details

Oracle Java 64bit version 8 update 102 Testing done with JMH

Rust Details

rustc 1.13.0-nightly build with cargo rustc --release -- -C lto -C target-cpu=native

When Big O Fools Ya

Sat, 20 Aug 2016 19:17:27 +0000

Big O notation is a great tool. It allows one to quickly make smart choices among various data structures and algorithms. But sometimes a casual Big O analysis can fool us if we don’t think carefully about the impact of constant factors. One such example comes up very often when programming on modern CPUs, and that is when choosing between an Array, and a List, or Tree type structure.

Memory, Slow Slow Memory

In the early 1980s, the time it took to get data from RAM, and the time it took to do computation on the data were roughly in parity. You could use algorithms that hop randomly over the heap, grabbing data and working with it. Since that time, CPUs have gotten faster at a much higher rate than RAM has. Today, a CPU can compute on the order of 100 to 1000 times faster than it can get data from RAM. This means when the cpu needs data from RAM it has to stall for hundreds of cycles, doing nothing. Obviously this would be a useless situation, so modern CPUs have various levels of cache built in. Any time you request one piece of data from RAM, you also get chunks of contiguous memory pulled into the caches on the CPU. The result is that when you iterate over contiguous memory, you can access it about as fast as the CPU can operate, because you will be streaming chunks of data into the L1 cache. If you iterate over memory in random locations, you will often miss the CPU caches, and performance can suffer greatly. If you want to learn more about this, Mike Acton’s CppCon talk is a great starting point and great fun too.

The consequence of this is that arrays have become the go to data structure if performance is important, sometimes even when Big O analysis suggests it would be slower. Where you wanted a Tree before you may want a sorted array and a binary search algorithm. Where you wanted a Queue before you may want a growable array, and so on.

Linked List vs Array List

Once you are familiar with how important contiguous memory access is, it should be no surprise that if you want to iterate over a collection quickly, that an array will be faster than a Linked List. Environments with clever allocators and garbage collectors may be able to keep Linked List nodes somewhat contiguous, some of the time, but they can’t guarantee it. Using a raw array usually involves quite a bit more complex code, especially if you want to be able to insert or add items, as you will have to deal with growing the array, shuffling elements around, and so on. Most language’s have core libraries which include some sort of growable array data structure to help with this. In C++ you have vector, in C# you have List<T> (aliased as ResizeArray in F#), and in Java there is ArrayList. Usually these data structures expose the same, or similar interface as the Linked List collection. I will refer to such data structures as Array Lists from here on, but keep in mind all the C# examples are using the List<T> class, not the older ArrayList class.

So what if you need a data structure that you can insert items into, and iterate over quickly? Let us assume for this example, that we have a use case where we will insert into the front of a collection about 5 times more often that we iterate over it. Let us also assume that the Linked List and Array List in our environment have interfaces which are equally pleasant to work with for this task. All that remains then to make a choice is to determine which one performs better. In the interest of optimizing our own valuable time, one might turn to Big O analysis. Referring to the handy Big-O Cheat Sheet, the relevant time complexities for these two data structures are:

	Iterate	Insert
Array List	O(n)	O(n)
Linked List	O(n)	O(1)

Array Lists are problematic for insertion, at a minimum it has to copy every single element beyond the insertion point in the array to move them over by 1 to make space for the inserted element, making it O(n). Sometimes it will also have to reallocate a new, bigger array to make room for the insertion. This doesn’t change the Big O time complexity, but does take time, and waste memory. So it seems for our use case, where insert happens 5 times more often than iterating, that the best choice is clear. As long as n is large enough, Linked List should perform better overall.

Empiricism

But, to know things for sure, we always have to count. So let us do an experiment in C#, using BenchMarkDotNet. C# provides generic collections LinkedList which is a classic Linked List, and List which is an Array List. Their interfaces are similar, and both allow us to implement our use case with ease. We will assume a worst case scenario for Array List, by always inserting at the front, necessitating that the entire array be copied on each insertion. The testing environment specs are:

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240910 ticks, Resolution=446.2473 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0

Type=Bench  Mode=Throughput  

Test Cases:

    [Benchmark(Baseline=true)]
    public int ArrayTest()
    {        
        //In C#, List<T> is an array backed list.
        List<int> local = arrayList;
        int localInserts = inserts;
        int sum = 0;
        for (int i = 0; i < localInserts; i++)
        {
            local.Insert(0, 1); //Insert the number 1 at the front
        }

        // For loops iterate over List<T> much faster than foreach
        for (int i = 0; i < local.Count; i++)
        {
            sum += local[i];  //do some work here so the JIT doesn't elide the loop entirely
        }
        return sum;
    }

    [Benchmark]
    public int ListTest()
    {
        LinkedList<int> local = linkedList;
        int localInserts = inserts;
        int sum = 0;
        for (int i = 0; i < localInserts; i++)
        {
            local.AddFirst(1); //Insert the number 1 at the front
        }

        // Again, iterating the fastest possible way over this collection
        var node = local.First;
        for (int i = 0; i < local.Count; i++)
        {
            sum += node.Value;
            node = node.Next;
        }

        return sum;
    }

Results:

Method	length	inserts	Median
ArrayTest	100	5	38.9983 us
ListTest	100	5	51.7538 us

The Array List wins by a nice margin. But this is a small list, Big O only tells us about performance as n grows large, so we should see this trend eventually reverse as n grows larger. Let’s try it:

Method	Length	Inserts	Median
ArrayTest	100	5	38.9983 us
ListTest	100	5	51.7538 us
ArrayTest	1000	5	42.1585 us
ListTest	1000	5	49.5561 us
ArrayTest	100000	5	208.9662 us
ListTest	100000	5	312.2153 us
ArrayTest	1000000	5	2,179.2469 us
ListTest	1000000	5	4,913.3430 us
ArrayTest	10000000	5	36,103.8456 us
ListTest	10000000	5	49,395.0839 us

Length	ArrayList	LinkedList
100	38.9983	51.7538
1000	42.1585	49.5561
100000	208.9662	312.2153
1000000	2179.2469	4913.3430
10000000	36103.8456	49395.0839

Here we get the result that will be counterintuitive to many. No matter how large n gets, the Array List still performs better overall. In order for performance to get worse, the ratio of inserts to iterations has to change, not just the length of the collection. Note that isn’t an actual failure of Big O analysis, it is merely a common human failure in our application of it. If you actually “did the math”, Big O would tell you that the two data structures here will grow at the same speed when there is a constant ratio of inserts to iterations.

Where the break even point occurs will depend on many factors, though a good rule of thumb suggested by Chandler Carruth at Google is that Array Lists will outperform Linked Lists until you are inserting about an order of magnitude more often than you are iterating. This rule of thumb works well in this particular case, as 10:1 is where we see Array List start to lose:

Method	Length	Inserts	Median
ArrayTest	100000	10	328,147.7954 ns
ListTest	100000	10	324,349.0560 ns

Devils in the Details

The reason Array List wins here is because the integers being iterated over are lined up contiguously in memory. Each time an integer is requested from memory an entire cache line of integers is pulled into the L1 cache, so the next 64 bytes of data are ready to go. With the Linked List, each call to node.Next makes a pointer hop to the next node, and there is no guarantee that nodes will be contiguous in memory. Therefore we will miss the cache sometimes. But we aren’t always iterating over value types like this, especially in OOP oriented managed languages we often iterate over reference types. In that case, even with an Array List, while the pointers themselves are contiguous in memory, the objects they point to are not. The situation is still better than with a Linked List, where you will be making two pointer hops per iteration instead of one, but how does this affect the relative performance?

It narrows it quite a bit, depending on the size of the objects, and the details of your hardware and software environment. Refactoring the example above to use Lists of small objects (12 bytes), the break even point drops to about 4 inserts per iteration:

Method	Length	Inserts	Median
ArrayTestObject	100000	0	674.1864 us
ListTestObject	100000	0	1,140.9044 us
ArrayTestObject	100000	2	959.0482 us
ListTestObject	100000	2	1,121.5423 us
ArrayTestObject	100000	4	1,230.6550 us
ListTestObject	100000	4	1,142.6658 us

Managed C# code suffers a bit in this case because iterating over this Array List incurs some unnecessary array bounds checking. C++ vector would likely fare better. If you were really aggressive about this you could probably write a faster Array List class using unsafe C# code to avoid the array bounds checks. Also, the relative differences here will depend greatly on how your allocator and garbage collector manage the heap, how big your objects are, and other factors. Larger objects tended to cause the relative performance of the Array List to improve in my environment. In the context of a complete application the relative performance of Array List might improve as well as the heap gets more fragmented, but you will have to test to know for sure.

As an aside, if your objects are sufficiently small (16 to 32 bytes or less, depending on various factors) you should consider making them value types (struct in .NET) instead of objects. Not only will you benefit greatly from contiguous memory access, but you will potentially reduce garbage collection overhead as well, depending on your usage of them:

Method	Length	Inserts	Median
ArrayTestObject	100000	10	2,094.8273 us
ListTestObject	100000	10	1,154.3014 us
ArrayTestStruct	100000	10	792.0004 us
ListTestStruct	100000	10	1,206.0713 us

Java may handle this better since it does some automatic cleverness with small objects, or you may have to just use separate arrays of primitive types. Though onerous to type, this can sometimes be faster than an array of structs, depending on your data access patterns. Consider it when performance matters.

Make Sure the Abstraction is Worth It

It is common for people to object to these sorts of considerations on the basis of code clarity, correctness, and maintainability. Of course each problem domain has it’s own priorities, but I feel strongly that when the clarity benefit of the abstraction is small, and the performance impact is large, that we should choose better performance as a rule. By taking time to understand your environment, you will be aware of cases where a faster but equally clear option exists, as is often the case with Array Lists vs Lists.

As some food for thought, here are 7 different ways to add up a list of numbers in C#, with their run times and memory costs. Checked arithmetic is used in all cases to keep the comparison with Linq fair, as it’s Sum method uses checked arithmetic. Notice how much better performing the fastest option is. Notice how expensive the most popular method (Linq) is. Notice that the foreach abstraction works out well with raw Arrays, but not with Array List or Linked List. Whatever your language and environment of choice is, understand these details so you can make smart default choices.

Method	Length	Median	Bytes Allocated/Op
LinkedListLinq	100000	990.7718 us	23,192.49
RawArrayLinq	100000	643.8204 us	11,856.39
LinkedListForEach	100000	489.7294 us	11,909.99
LinkedListFor	100000	299.9746 us	6,033.70
ArrayListForEach	100000	270.3873 us	6,035.88
ArrayListFor	100000	97.0850 us	1,574.32
RawArrayForEach	100000	53.0535 us	1,574.84
RawArrayFor	100000	53.1745 us	1,577.77

    [Benchmark(Baseline = true)]
    public int LinkedListLinq()
    {
        var local = linkedList;
        return local.Sum();
    }

    [Benchmark]
    public int LinkedListForEach()
    {
        var local = linkedList;
        int sum = 0;
        checked
        {
            foreach (var node in local)
            {
                sum += node;
            }
        }
        return sum;
    }

    [Benchmark]
    public int LinkedListFor()
    {
        var local = linkedList;
        int sum = 0;
        var node = local.First;
        for (int i = 0; i < local.Count; i++)
        {
            checked
            {
                sum += node.Value;
                node = node.Next;
            }
        }

        return sum;
    }

    [Benchmark]
    public int ArrayListFor()
    {
        //In C#, List<T> is an array backed list
        List<int> local = arrayList;
        int sum = 0;

        for (int i = 0; i < local.Count; i++)
        {
            checked
            {
                sum += local[i];
            }
        }

        return sum;
    }

    [Benchmark]
    public int ArrayListForEach()
    {        
        //In C#, List<T> is an array backed list
        List<int> local = arrayList;
        int sum = 0;
        checked
        {
            foreach (var x in local)
            {
                sum += x;
            }
        }
        return sum;
    }

    [Benchmark]
    public int RawArrayLinq()
    {
        int[] local = rawArray;
        return local.Sum();
    }

    [Benchmark]
    public int RawArrayForEach()
    {
        int[] local = rawArray;
        int sum = 0;
        checked
        {
            foreach (var x in local)
            {
                sum += x;
            }
        }
        return sum;
    }

    [Benchmark]
    public int RawArrayFor()
    {
        int[] local = rawArray;
        int sum = 0;

        for (int i = 0; i < local.Length; i++)
        {
            checked
            {
                sum += local[i];
            }
        }

        return sum;
    }

Adventures in F# Performance

Sat, 13 Aug 2016 19:17:27 +0000

Apologies to functional programming enthusiasts, what follows is a lot of imperative code. What can I say, it is the array library after all!

After working on an F# SIMD Array library for a while, and learning about some nice bench marking tools for .NET thanks to Jared Hester. I got the idea to try contributing to the F# core libraries myself. I had been poking around in the official Microsoft F# repo because I was modeling my SIMD Library after the core Array library, duplicating all relevant functions in SIMD form. As I got familiar with the code I saw a function I thought I could speed up. Steffen Forkmann pointed me to a blog post of his about how to get started building and contributing to the FSharp language, so I got to work.

Array.filter

This was the first function I thought I could improve, and mostly I was wrong! Array.filter takes an array and a predicate function as its arguments and applies the function to each element of the array. The resulting array contains only the elements that satisfy the predicate. The original implementation used a List, which is a .NET collection similar to a C++ Vector, an array backed List that doubles in size as you add items and fill it up. Each time you fill it up, you have to allocate a whole new array and discard the old one. Which leads to a worst case scenario where if the array's length just exceeds a power of 2, like 1025, and 0 elements are filtered, you end up allocating 3,836 elements when you only needed 1025. And then you allocate another 1025 to copy the array out of the List. But in the best case, you allocate only a handful of bytes for List overhead, when everything is filtered:

let filter f (array: _[]) = 
    checkNonNull "array" array
    let res = List<_>() // ResizeArray
    for i = 0 to array.Length - 1 do 
        let x = array.[i] 
        if f x then res.Add(x)
    res.ToArray()

I tried a few things, and settled on this for a while:

let filter f (array: _[]) = 
    checkNonNull "array" array                        
    let temp = Array.zeroCreateUnchecked array.Length
    let mutable c = 0
    for i = 0 to array.Length-1 do                
        if f array.[i] then
            temp.[i] <- true
            c <- c + 1
            
    let result = Array.zeroCreateUnchecked c
    c <- 0    
    let mutable i = 0
    while c < result.Length do
        if temp.[i] then
            result.[c] <- array.[i]
            c <- c + 1
        i <- i + 1
    result

This allocates an array of booleans the same length as the input up front, which are usually stored as bytes in .NET. So in the common case, where you have a 32bit or 64bit pointer, int, or float, as your array element, it will allocate no more than 1/8 to 1/4 of your array size in extra data instead of 3x to 4x. Reducing GC pressure is a big win with garbage collected languages so that seemed like a good thing. There are some gotchas though:

The loops now both have branches in them.
The branch pattern will sometimes be random, so branch prediction will miss them, which is slow.
The performance advantage goes negative compared to the original implementation as the size of the array type shrinks.

So in cases where most things are filtered, and the distribution of elements is somewhat random as to whether they get filtered or not, performance was sometimes worse. Performance also differed in 32bit vs 64 bit builds, and on different machines. Benchmarking this was really hard because you have to account for different array type sizes, lengths, different distribution of filtering and amount of filtering. It didn’t always win, and it was hard to decide if it was really better.

Then Asik suggested a solution which ended up being the final answer:

let filter f (array: _[]) = 
    checkNonNull "array" array
    let res = Array.zeroCreateUnchecked array.Length 
    let mutable count = 0
    for x in array do                 
        if f x then 
            res.[count] <- x
            count <- count + 1
    Array.subUnchecked 0 count res

This just allocates an entire array of whatever type was input, adds elements into it, and then uses Array.sub which calls fast native code to copy sub sections of arrays into new ones. This was always faster than the original core lib solution, but sometimes allocated more memory. The Microsoft guys considered that a net win, so they took it. The improvement here varied a lot, but was usually around 20% faster. Worst case performance would be with large array types (Like a 16 byte struct) where most elements are likely to get filtered. You might want to roll your own filter if you are doing that. This same optimization was applied to the similar Array.choose function.

UPDATE:

Asik and I have collaborated and got a new filter merged that keeps the speed of the above solution, while reducing allocations by ~30% on average. We did this by implementing a growing array by hand, taking advantage of extra knowledge we have, like that the upper bound for it’s size is array.Length, and some other tricks. Another interesting solution is being proposed by Paul Westcott which uses a bit array. This may reduce allocations yet again while maintaining similar rerformance, pretty cool.

As an aside, if your predicate is a pure function, and a fast function, you can apply the predicate twice to avoid any extra allocations at all. This is very fast for sufficiently simple predicates, like > or < comparisons.

Performance test results for filtering 50% of random doubles on 64bit RyuJit

Method	Median	StdDev	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
CoreFilter	10.7906 ms	0.2096 ms	20.00	-	314.00	3 953 196,34
ArrayFilter	8.3605 ms	0.0374 ms	-	-	329.99	3 762 296,97

Array.partition

I wasn’t too happy with the filter optimization because I felt like someimtes taking a memory hit wasn’t so great. So I started scanning through the library for other opportunities, and came across Array.partition, which takes an array and a predicate, returning a tuple with two arrays. One array contains every element that was true, the other every element that was false.

let partition f (array: _[]) = 
    checkNonNull "array" array
    let res1 = List<_>() // ResizeArray
    let res2 = List<_>() // ResizeArray
    for i = 0 to array.Length - 1 do 
        let x = array.[i] 
        if f x then res1.Add(x) else res2.Add(x)
    res1.ToArray(), res2.ToArray()

I had more respect for the (array backed) List solutions now, after failing to get a clear win by using raw arrays with filter. So I tried to look for something more clever. I realized that one invariant here is that the result will always be the same size as the input. If the input is 100 elements, the output will be 100 elements. So fundamentally, we shouldn’t need to use a data structures that grows. I thought about creating a struct where you could tag each element with a true or false on the first pass, and then copy the results into the two output arrays. But that still wastes array.Length bytes of memory. Then I had a great idea, maybe my best idea! Allocate an array the same size and type as the input, and put all the true elements on the left, and all the false elements on the right! The only memory wasted is an extra int to keep track of where one set ends and the other begins. You then just copy the left side of the array into the first result, and the reverse of the right side of the array into the second result:

let partition f (array: _[]) = 
    checkNonNull "array" array
    let res = Array.zeroCreateUnchecked array.Length        
    let mutable upCount = 0
    let mutable downCount = array.Length-1    
    for x in array do                
        if f x then 
            res.[upCount] <- x
            upCount <- upCount + 1
        else
            res.[downCount] <- x
            downCount <- downCount - 1
        
    let res1 = Array.subUnchecked 0 upCount res
    let res2 = Array.zeroCreateUnchecked (array.Length - upCount)    

    downCount <- array.Length-1
    for i = 0 to res2.Length-1 do
        res2.[i] <- res.[downCount]        
        downCount <- downCount - 1

    res1 , res2

Performance test results for partitioning random int arrays with predicate `(fun x -> x % 2 = 0)`

Method	ArrayLength	Median	StdDev	Scaled	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
Partition	10	180.8758 ns	16.3650 ns	1.00	0.01	-	-	185.22
NewPartition	10	76.6145 ns	1.2114 ns	0.42	0.01	-	-	90.38
Partition	10000	117,268.5175 ns	1,064.2667 ns	1.00	6.40	-	-	99,742.26
NewPartition	10000	79,020.6291 ns	474.4149 ns	0.67	2.64	-	-	43,572.00
Partition	10000000	154,545,402.8213 ns	3,116,253.3692 ns	1.00	-	-	62.02	59,133,643.66
NewPartition	10000000	98,768,489.7225 ns	726,198.4079 ns	0.64	-	-	34.00	29,686,956.03

Adventures in IL and Dissasembly

One of the performance drawbacks of most managed/safe languages is that they do array bounds checking. This prevents you from accidentally wandering off the end of an array and over writing memory at random, which is a useful feature. But it comes with a performance cost, as you end up eating some cpu cycles checking array bounds each time through the loop. The .NET JIT will identify Some but not all cases when these bounds checks can be eliminated. You have to take some care to structure your loop just right, or it will be missed. F# added some confusion here since their loops have different syntax than C#, and sometimes compile strangely, or badly. You can peek at the byte code or C# equivalent representation of it with tools like ILSpy For instance this loop:

let len = array.Length
for i = 0 to len-1 do
  (* stuff *)

compiles to the C# equivalent of:

int num = len - 1;
if (num >= i)
{
    do
    {
        // stuff
        i++;
    }
    while (i != num + 1);
}

This is madness, maybe some of that madness gets JITted away, but it definitely does cause the array bounds elision to be missed, slowing it down. This was a pattern used in many places in the core Array library, so I just went through and mechanistically replaced them all with the pattern that works:

for i = 0 to array.Length-1 do
    (* stuff *)

Which becomes the C# equivalent of:

for (int i = 0; i < array.Length; i++) {
    //stuff

AHHHH, much better, and now we get array bounds elision from the JIT too. The impact of this change can be pretty big in some cases, when any functions applied per array element are very simple the array bounds checking makes up a sizeable % of total run time. In other cases it will be a very small impact. Array.map (fun x-> SieveofEratosthenes x) isn’t going to be noticeably better. But it impacted almost all of the functions in the Array module, and I assume (which is dangerous) it would take some overhead out of JITing the IL as well.

If you want to know for sure if the loop is doing what you want, as 32Bit JITs differ from the 64 bit one differ from Mono etc., you will need to view the dissasembly. In Visual Studio you can get it at from Debug -> Windows -> Disassembly while the program is running. Here is an example of code with, and without a bounds check:

Since this process is done in the JIT, you don’t always have control over it. Sometimes you can massage your code to be sure the JIT will do the right thing, but sometimes you can’t. If you get desperate, write the function in C# using an unsafe loop, and call it from F#.

Other loop patterns to beware of in .NET:

For lops that go from 0 to anything less than the array length, will not get the bounds check elided.
For loops that go backwards, will not get array bounds checking elided.
With for loops over arrays in F# that have a stride length of something other than 1, the compiler generates a loop that uses an Enumerator, which is much slower and generates garbage. Use a while loop, or tail recursion instead.
For loops over arrays that are class members will miss the array bounds elision. Make a function local copy of the array reference first.
The for x in array syntax in F# works out fine. There may be other performance considerations but a normal for loop is generated and bounds checking is elided.

These things are all true as of 64bit RyuJIT .NET 4.6.2 and F# 4.4.0, some of them are being actively worked on and could improve soon.

Performance test results of bounds check elision from `Array.map` with mapping function `(fun x -> x + 1)`

Method	Length	Median	StdDev	Scaled
Old	10	17.5030 ns	0.5275 ns	1.00
New	10	14.1205 ns	0.4858 ns	0.81
Old	10000	10,212.8762 ns	118.7990 ns	1.00
New	10000	8,963.2690 ns	329.8907 ns	0.88

Delving Into Parallel

The Array module has a sub module Parallel. Array.Parallel.map, for instance, will use a Parallel.For loop to multithread your map operation. Scanning through these I saw Parallel.partition:

let partition predicate (array : 'T[]) =
    checkNonNull "array" array
    let inputLength = array.Length
    let lastInputIndex = inputLength - 1

    let isTrue = Array.zeroCreateUnchecked inputLength
    Parallel.For(0, inputLength, 
        fun i -> isTrue.[i] <- predicate array.[i]
        ) |> ignore
    
    let mutable trueLength = 0
    for i in 0 .. lastInputIndex do
        if isTrue.[i] then trueLength <- trueLength + 1
    
    let trueResult = Array.zeroCreateUnchecked trueLength
    let falseResult = Array.zeroCreateUnchecked (inputLength - trueLength)
    let mutable iTrue = 0
    let mutable iFalse = 0
    for i = 0 to lastInputIndex do
        if isTrue.[i] then
            trueResult.[iTrue] <- array.[i]
            iTrue <- iTrue + 1
        else
            falseResult.[iFalse] <- array.[i]
            iFalse <- iFalse + 1

    (trueResult, falseResult)

What stuck out at me here was that they were iterating over the entire isTrue array a second time in order to count up how many true elements there are. This struck me as fundamentally unnecessary. So I tried creating an accumulation variable above the Parallel.For call, and just incrementing that within the loop. Nope! You can’t add in parallel like that safely on x86 (or perhaps any architecture?) It worked sometimes but not always. Then I remembered
System.Threading.Interlocked.Increment(Int32), which provides a thread safe way to increment an int. This worked! But then it was just as slow as the scalar version of the function, since every thread was constantly locking on the increment function. So I read the documentation. Sometimes this stuff is awful to read. Func<Int32>? Action<TLocal>? PC LOAD LETTER?!?! But if you go slow and stare at this for a while it will start to make sense. The key info here is that there is a Parallel.For loop which can internally keep track of an accumulator for you. This will let us track the total number of true elements without iterating over the array again. So the new solution becomes:

let partition predicate (array : 'T[]) =
    checkNonNull "array" array
    let inputLength = array.Length                
    let isTrue = Array.zeroCreateUnchecked inputLength                
    let mutable trueLength = 0
                                    
    Parallel.For(0, 
                    inputLength, 
                    (fun () -> 0),
                    (fun i _ trueCount -> 
                    if predicate array.[i] then
                        isTrue.[i] <- true
                        trueCount + 1
                    else
                        trueCount),                        
                    Action<int> (fun x -> System.Threading.Interlocked.Add(&trueLength,x) |> ignore) ) |> ignore
                    
    let res1 = Array.zeroCreateUnchecked trueLength
    let res2 = Array.zeroCreateUnchecked (inputLength - trueLength)
    let mutable iTrue = 0
    let mutable iFalse = 0
    for i = 0 to isTrue.Length-1 do
        if isTrue.[i] then
            res1.[iTrue] <- array.[i]
            iTrue <- iTrue + 1
        else
            res2.[iFalse] <- array.[i]
            iFalse <- iFalse + 1

    res1, res2

In this case, each thread has its’ own accumulator value, keeping track of their own trueCount. So they are free to increment it without locking. As threads finish, they then do a locked add, adding their own personal trueCount to the final result stored in trueLength. This locked add only happens NumThreads times, instead of array.Length times, so causes no terrible performance penalty. The final result is about 30% faster with no memory use penalty.

Performance test results of `Array.Parallel.partition` with predicate `(fun x -> x % 2 = 0)`

Method	Length	Median	StdDev	Scaled	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
Original	1000	21.8514 us	0.5300 us	1.00	0.16	-	-	3,471.77
New	1000	20.5297 us	0.8840 us	0.94	0.17	-	-	3,489.75
Original	10000	160.0466 us	3.1249 us	1.00	1.21	-	-	28,955.03
New	10000	118.1885 us	2.8572 us	0.74	1.20	-	-	28,666.02
Original	100000	1,282.9827 us	7.3705 us	1.00	-	-	10.17	211,334.08
New	100000	917.0063 us	17.4501 us	0.71	-	-	7.27	151,441.53
Original	1000000	12,467.9427 us	728.8799 us	1.00	-	-	65.99	2,353,833.73
New	1000000	9,700.7108 us	990.4339 us	0.78	-	-	65.24	2,309,151.64
Original	10000000	125,043.1745 us	1,753.3497 us	1.00	-	-	35.28	29,670,713.02
New	10000000	86,908.7271 us	1,472.4448 us	0.70	-	-	35.53	29,909,345.33

Recursion is slower … sometimes

Sometimes a recursive implementation will be a substantive speed hit. While the F# compiler is very good at tail recursion optimization, turning most recursive functions into nice loops, there can still be a small to medium performance penalty in some cases. For example, Array.compareWith got about 20% faster when converted from this recursive implementation to while loops:

 let inline compareWith (comparer:'T -> 'T -> int) (array1: 'T[]) (array2: 'T[]) = 
    checkNonNull "array1" array1
    checkNonNull "array2" array2

    let length1 = array1.Length
    let length2 = array2.Length
    let minLength = Operators.min length1 length2

    let rec loop index  =
        if index = minLength  then
            if length1 = length2 then 0
            elif length1 < length2 then -1
            else 1
        else  
            let result = comparer array1.[index] array2.[index]
            if result <> 0 then result else
            loop (index+1)

    loop 0

It took some care to realize this performance improvement, early attempts were actually worse. So keep in mind that testing and IL inspection will be necessary to know for sure if a recursive implementation is a problem. It often is not.

Notes on Benchmarking

Benchmarking managed code isn’t easy. As well as dealing with non deterministic hardware and operating systems just like you do with C/C++, you also add the complication of the JIT, the runtime, and garbage collection. All of these things can cause code to run faster one moment, and slower the next. Especially when looking for marginal gains you can easily fool yourself into thinking you have made progress when you have actually regressed, or vice versa. Also you may have achieved a slight runtime improvement for your function, but generated more garbage that has to be collected, leading to a net loss. Or maybe you thrashed the L1 cache with your new algorithm such that the next function goes slower, when run in the context of a full program. These can be hard to identify in a benchmark. This is why I cringe a bit when I see people say “you can optimize it later when you identify that it is a bottleneck”. Identifying bottlenecks can be hard. If you see easy ways to avoid pointer hopping or creating garbage, take them.

I used the BenchmarkDotNet library which helps solve some, but not all, of these challenges. It will automatically warm up the JIT for you, figure out how many trials need to run for each test to get good data, and report on memory usage and GC events (though this feature has some bugs, so be careful). It also spits out nice reports on the results in HTML, CSV, and Markdown formats. The Markdown format is very handy as you can paste it into your Pull Requests. You can see a sample stub that I used here.

You can do this too

If you are interested in improving the quality or performance of software in the world, consider doing something about it. You do not need to be highly skilled or experienced. I am just an average web developer by day, not a language architect or assembler expert or anything. You just need some patience. Learning how a given project’s repository and build process works is often the hardest part. Ask questions of the community, don’t worry about seeming dumb. You will get less dumb every time you ask a dumb question. Pick your favorite open source language or library, make it better. Code bases are huge and even those written by grey beard wizards will have mistakes and bottlenecks that you can find and fix. If the code base is way above your head, start with improving documentation or error messages or other important but not so glamorous work. It is always highly appreciated, and can be a way to familiarize yourself with the project and endear yourself to the other team members. Plus it also makes the world a better place.

What does this mean for FSharp

The net effect of all of this for F# programs out there will vary considerably. For the changes I’ve been working on you need to be making use of arrays (you should! Learn about cache misses). There are PRs from other people for performance improvements in other areas too, which is great to see. If you are using arrays and using the core library Array module, things will just go faster. Whether it makes a substantive difference just depends on your use case. For fun I put together a toy example that hits a lot of the key functions that have been sped up, and compared the current 4.4.0 Core lib against what will hopefully all get merged into 4.4.1:

(* Init,create and map faster due to array bounds check elision *)
let array1 = Array.init TEN_MILLION (fun i ->  i)                                               
let array2 = Array.create TEN_MILLION 5                
let added = Array.map2 (fun x y -> x+y) array1 array2

(* Rev faster due to array bounds elision and micro optimizations *)
let backwards = added |> Array.rev        

(* AverageBy is much faster as it now no longer just calls into Seq.AverageBy *)
let average = backwards |> Array.averageBy (fun x-> (float)x)        

(* Use aggregating Parallel.For loop *)
let greaterThan400 = backwards
                     |> Array.Parallel.choose (fun x -> match x with 
                                                        | x when x > 400 -> Some x
                                                        | _ -> None )
                                            
(* Partition faster and uses less memory due to new algorithm *)
let (even,odd) = greaterThan400 |> Array.partition (fun x -> x % 2 = 0)        

(* Filter faster due to using preallocated array instead of List<T> *)
let filtered = even |> Array.filter(fun x -> x % 4 = 0)

Results

Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240908 ticks, Resolution=446.2477 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0

Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation  

Method	Length	Median	StdDev	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
Old	10	3.4036 us	0.0806 us	0.07	0.00	-	852.03
New	10	3.4044 us	0.3243 us	0.08	0.00	-	1,118.24
Old	1000	52.2478 us	4.7930 us	2.15	-	-	31,762.76
New	1000	41.3602 us	2.7741 us	1.37	-	-	22,699.78
Old	100000	6,001.7350 us	286.9376 us	58.77	2.10	74.12	3,114,798.18
New	100000	3,296.3410 us	89.5167 us	-	-	78.91	3,254,820.11
Old	1000000	40,985.6462 us	830.1541 us	556.34	-	211.89	30,759,505.83
New	1000000	33,555.2876 us	3,514.3688 us	519.70	-	229.11	29,579,065.56
Old	10000000	405,780.3801 us	8,032.9328 us	5,660.00	-	227.00	333,268,994.49
New	10000000	286,415.5958 us	7,394.8703 us	5,049.00	-	183.00	287,616,350.72

Making the obvious code fast

Fri, 22 Jul 2016 19:17:27 +0000

Jonathan Blow of “The Witness” fame likes to talk about just typing the obvious code first. Usually it will turn out to be fast enough. If it doesn’t, you can go back and optimize it later. His thoughts come in the context of working on games in C/C++. I think these languages, with modern incarnations of their compilers, are compatible with this philosophy. Not only are the compilers very mature but they are low level enough that you are forced to do things by hand, and think about what the machine is doing most of the time, especially if you stick to C or a ‘mostly C’ subset of C++. However in most higher level languages, there tend to be performance traps where the obvious, or idiomatic solution is particularly bad.

What counts as obvious or idiomatic, is of course often a matter of opinion. The language itself may encourage certain choices by making them easier to type, or highlighting them in documentation and teaching materials. The community that grows up around a language may just come to prefer certain constructs and encourage others to use them. It is very common to see programmers encouraged to use high level constructs over lower level ones, in the interest of readability and simplicity. This is a worthy ideal, but often people aren’t aware of what the cost really is. Some of these constructs have a much higher cost than people realize.

In this article I will explore a number of languages, with a toy map and reduce example. Within each language, I will explore a number of approaches, ranging from high level to hand coded imperative loops and SIMD operations. Some of the performance pitfalls I will show may be specific to this toy example. With a different toy example, the languages that excel and those that do poorly could be totally different. This is meant merely to explore, and get people thinking about the performance cost of abstractions. For each case I will show code examples so you can consider the differences in complexity.

The Task

We wish to take an array of 32 million 64bit floating point values, and compute the sum of their squares. This will let us explore some fundamental abilities of various languages. Their ability to iterate over arrays efficiently, whether they can vectorize basic loops, and whether higher order functions like map and reduce compile to efficient code. When applicable, I will show runtimes of both map and reduce, so we get insight into whether the language can stream higher order functions together, and also the runtime with a single reduce or fold operation.

The Results

Benchmark Details

C - 17 milliseconds

    double sum = 0.0;    
    for (int i = 0; i < COUNT; i++) 
    {
        double v = values[i] * values[i]; 
        sum += v;
    }

ANSI C is a bare bones language, no higher order functions or loop abstractions exist to even think about, so this imperative loop is what most programmers wil turn to to complete this task. If I thought that this would be a performance critical piece of code, I might use SIMD intrinsics, which requires this nasty mess:

C - SIMD Explicit - 17 milliseconds

    __m256d vsum = _mm256_setzero_pd();
    for(int i = 0; i < COUNT/4; i=i+1) {
        __m256d v = values[i];
        vsum = _mm256_add_pd(vsum,_mm256_mul_pd(v,v));
    }
    double *tsum = &vsum;
    double sum = tsum[0]+tsum[1]+tsum[2]+tsum[3];

However, notice that the runtime is the same for the obvious and SIMD versions! It turns out that the obvious code was automatically turned into SIMD enhanced machine instructions. A process called “Auto vectorization”. Visual C++ is not known for being the most clever of C++ compilers but it still gets this right:

double sum = 0.0;    
	for (int i = 0; i < COUNT; i++) {
00007FF7085C1120  vmovupd     ymm0,ymmword ptr [rcx]  
00007FF7085C1124  lea         rcx,[rcx+40h]  
		double v = values[i] * values[i];  //square em
00007FF7085C1128  vmulpd      ymm2,ymm0,ymm0  
00007FF7085C112C  vmovupd     ymm0,ymmword ptr [rcx-20h]  
00007FF7085C1131  vaddpd      ymm4,ymm2,ymm4  
00007FF7085C1135  vmulpd      ymm2,ymm0,ymm0  
00007FF7085C1139  vaddpd      ymm3,ymm2,ymm5  
00007FF7085C113D  vmovupd     ymm5,ymm3  
00007FF7085C1141  sub         rdx,1  
00007FF7085C1145  jne         imperative+80h (07FF7085C1120h)  
		sum += v;
	}

To get the SIMD instructions used here, which can operate on 4 doubles at a time, you have to specify to the compiler that you want ‘fast floating point’ and specify that you want to target AVX2 instructions as well. Results will be different when vectorized, though they will actually be more accurate, not less. (in this case, maybe all?)

C# Linq Select Sum - 260 milliseconds

    var sum = values.Sum(x => x * x);

C# Linq Aggregate - 280 milliseconds

    var sum = values.Aggregate(0.0,(acc, x) => acc + x * x);

C# for loop - 34 milliseconds

    double sum = 0.0;
    foreach (var v in values)
    {       
        double square = v * v;
        sum += square;       
    }

Stepping up a level to C#, we have a couple of idiomatic solutions. Many C# programmers today might use Linq which as you can see is much slower. It also creates a lot of garbage, putting more pressure on the garbage collector. Oddly, the Aggregate function, which is equivalent to fold or reduce in most other languages, is slower despite being a single step instead of two. The foreach loop in the second example is also commonly used. While this pattern has big performance pitfalls when used on collections like List<T>, with arrays it compiles to efficient code. This is nice as it saves you some typing without runtime penalty. The runtime here is still twice as slow as the C code, but that is entirely due to not being automatically vectorized.
With the .NET JIT, it is not considered a worthwhile tradeoff to do this particular optimization.

With C# you also have to take some care with array access in loops, or bounds checking overhead can be introduced. In this case the JIT gets it right, and there is no bounds checking overhead.

C# SIMD Explicit - 17 milliseconds

    Vector<double> vsum = new Vector<double>(0.0);
    for (int i = 0; i < COUNT; i += Vector<double>.Count)
    {
        var value = new Vector<double>(values, i);
        vsum = vsum + (value * value);
    }
    double sum = 0;
    for(int i = 0; i < Vector<double>.Count;i++)
    {
        sum += vsum[i];
    }

While the .NET JIT won’t do SIMD automatically, we can explicitly use some SIMD instructions, and achieve performance nearly identical to C. An advantage here for C# is that the SIMD code is a bit less nasty than using intrinsics, and that particular instructions whether they be AVX2, SSE2, NEON, or whatever the hardware supports, can be decided upon at runtime. Whereas the C code above would require separate compilation for each architecture. A disadvantage for C# is that not all SIMD instructions are exposed by the Vector library, so something like SIMD enhanced noise functions can’t be done with nearly the same performance. As well, the machine code produced by the Vector library is not always as efficient when you step out of toy examples.

F# - 127 milliseconds

    let sum =
        values
        |> Array.map squares
        |> Array.sum

The obvious F# code is beautiful, I like typing this, and I like working with it. But performance is terrible. Just as with C# you get no auto vectorization, as they use the same JIT. Additionally the array is iterated over twice, once to map them to squares, and once to sum them. Finally, since immutability is the default, each operation returns a new array, incurring allocation costs and GC pressure. So the total performance impact on an application is likely to be worse than this micro benchmark would suggest.

F# Streams - 98 milliseconds

    let sum = 
        values
        |> Stream.map square
        |> Stream.sum

F# is a functional first language, rather than a pure functional language like Haskell. If you do happen to use pure functions, you can stream your map and sum operations together, and avoid iterating over the array twice. The Nessos Streams library provides this, with a nice performance improvement as a result.

F# Fold - 75 milliseconds

    let sum = 
        values
        |> Array.fold (fun acc x -> acc + x*x) 0.0

When we use a single fold operation, we no longer iterate over the collection twice and allocate extra memory, and runtime improves even more. Since there is no overhead for streaming together multiple higher order functions as there is in the Streams library, it does slightly better.

F# Imperative - 38 milliseconds

    let mutable sum = 0.0
    for i = 0 to values.Length-1 do
            let x = values.[i]
            sum <- sum + x*x            

One of the nice things about F#, is that while it is a functional leaning language, very few barriers are put in your way if you want to go imperative for the sake of speed. Write a normal for loop, and you get the same performance as SSE vectorized C.

F# SIMD - 18ms

    let sum =
        values
        |> Array.SIMD.fold (fun acc v -> acc +v*v) (+) 0.0    

Now to get serious. First we use fold, so that we can combine the summing and squaring into a single pass. Then we use the SIMDArray extensions that I have been working on which let you take full advantage of SIMD with more idiomatic F#. Performance here is great, nearly as fast as C, but it took a lot of work to get here. At the moment there is no way to combine the lazy stream optimization with the SIMD ones. If you want to filter->map->reduce you will still be doing a lot of extra work. This should be possible in principle though. Please submit a PR!

Rust - 34ms

    let sum = values.iter().
                map(|x| x*x).        
                sum()  

Rust achieves impressive numbers with the most obvious approach. This is super cool. I feel that this behavior should be the goal for any language offering these kinds of higher order functions as part of the language or core library. Using a traditional for loop or a ‘for x in y’ style loop is also just as fast. It is also possible to use rust intrinsics to get the same speed as the AVX2 vectorized C code here, but to use those you have to write out the loop explicitly:

Rust SIMD - 17ms

    let mut sum = 0.0;
    unsafe {
        for v in values {
            let x : f64 = std::intrinsics::fmul_fast(*v,*v);
            sum = std::intrinsics::fadd_fast(sum,x); 
        }
    }
    sum

It would be nice if the rustc compiler had an option to just apply this globally, so you could use the higher order functions. Also, these features are marked as unstable, and likely to remain unstable forever. This might make it problematic to use this feature for any important production project. It would also be nice if the unsafe block was not required. Hopefully the Rust maintainers have a plan to make this better.

Javascript map reduce (node.js) 10,000ms

var sum = values.map(x => x*x).
                 reduce( (total,num,index,array) => total+num,0.0);

Javascript reduce (node.js) 800 and then 300 milliseconds

var sum = values.reduce( (total,num,index,array) => total+num*num,0.0)

It is common to see these higher order javascript functions suggested as the most elegant way to do this, but it is incredibly slow. Simplifying the combined map and reduce improves runtime by an order of magnitude to 800ms, though after 3 or 4 iterations the JIT does some magic and runtime drops to 300ms thereafter. This represents the first time I have seen any substantive JIT optimization happen during runtime in the wild!

Javascript foreach (node.js) 800 and then 300 milliseconds

    var sum = 0.0;
    array.forEach( (element,index,array) => sum += element*element  )

Slightly less elegant but also a popular idiom in javascript, this is faster than map and reduce, but is still amazingly slow. Again, after 3 or 4 iterations the JIT does some magic and it speeds up from around 800 to 300 milliseconds.

Javascript imperative (node.js) 37 milliseconds

    var sum = 0.0;
    for (var i = 0; i < values.length;i++){
        var x = values[i];
        sum += x*x;
    }

Finally, when we get down to a basic imperative for loop, javascript performs comparably to SEE vectorized C.

Java Streams Map Sum 138 milliseconds

    double sum = Arrays.stream(values).
                        map(x -> x*x).
                        sum();

Java Streams Reduce 34 milliseconds

    double sum = Arrays.stream(values).
                        reduce(0,(acc,x) -> acc+x*x);

Java 8 includes a very nice library called stream which provides higher order functions over collections in a lazy evaluated manner, similar to the F# Nessos streams library and Rust. Given that this is a lazy evaluated system, it is odd that there is such a performance difference between map then sum and a single reduction. The reduce function is compiling down to the equivalent of SSE vectorized C, but the map then sum is not even close. It turns out that the sum() method on DoubleStream:

may be implemented using compensated summation or other technique to reduce the error bound in the numerical sum compared to a simple summation of double values.

A nice feature, but not clearly communicated by the method name! If we tweak the java code to do normal summation the runtime remains as fast as SSE vectorized C, a nice accomplishment:

Java Streams Map Reduce 34 milliseconds

    double sum = Arrays.stream(values).
                        map(x -> x*x).
                        reduce(0,(acc,x) -> acc+x);

There does not appear to be a way to get SIMD out of Java, either explicitly or via automatic vectorization by the Hotspot JVM. There are 3rd party libraries available that do it by calling C++ code. I do see some literature stating that the JVM can and does auto-vectorize, but I’m not seeing evidence of that in this case, or when I use a for loop, either.

Go for Range 37 milliseconds

    sum := 0.0
    for _,v := range values[:] {
        sum = sum +  v*v
    }

Go for loop 37 milliseconds

    sum := 0.0
    for i := 0; i < len(values); i++ {
        x := values[i]
        sum = sum +  x*x
    }

Go has good performance with the both the usual imperative loop and their ‘range’ idiom which is like a ‘foreach’ in other languages.
Neither auto vectorization nor explicit SIMD support appears to be completely not on the Go radar. There are no map/reduce/fold higher order functions in the standard library, so we can’t compare them. Go does a good thing here by not providing a slow path at all.

Conclusion

I have shown some performance pitfalls in various languages here. One should not read too much into this as an argument for general performance of these languages. Every language has some pitfalls where the preferred or easiest approaches to solving a problem can lead to performance pitfalls. In Java, for instance, everything is objects. Objects all allocate on the heap (unless the JIT does some work at runtime to determine it doesn’t need to go on the heap, but that isn’t a freebie). Since Java is also a garbage collected language, this can lead to performance pitfalls when you type the obvious code. With experience, you can learn about these pitfalls and do work to avoid them, just like you can avoid pitfalls of Linq in C#, by not using it, or the pitfalls of F# by using Stream or SIMD libraries instead of the core ones. But even then, you have to take extra care, and type extra code, or take on more dependencies to do that. This is partially purpose defeating, since high level languages are supposed to let you type less, and get things working faster.

What I would like to see is more of an attitude change among high level language designers and their communities. None of the issues above need to exist. Java could (and will, soon) provide value types (as C# does) to make it less painful to avoid GC pressure if you use lots of small, short lived constructs. Go could provide more SIMD support, either via a SIMD library or better auto vectorization. F# could provide efficient Streams as part of the core library like Java does. .NET could auto vectorize in the JIT and/or provide more complete coverage of SIMD instructions in the Vector library. We, the community, can help by providing libraries and submitting PRs to make the obvious code faster. Time and energy will be saved, batteries will last longer, users will be happier.

Benchmark Details

All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language. JIT warmup time is accounted for when applicable. If you identify cases where code or compiler/environment choices are sub optimal, email me please.

Environment

Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC

F# / C# Runtime Details

CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0
Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation  

C Details

Visual Studio 2015 Update 3, fast floating point, 64 bit, AVX2 instructions enabled, all speed optimizations on

Rust Details

v1.13 Nightly, –release -opt-level=3

Javascript/Node Details

v6.4.0 64bit NODE_ENV=production

Java Details

Oracle Java 64bit version 8 update 102

Go Details

Go 1.7

De-Cruft Visual Studio

Mon, 11 Jul 2016 19:17:27 +0000

Above is a screen shot of what Visual Studio looks like on most people’s desktops. There is a lot going on, and some people like it that way. They have a large monitor, they use all of these features, and they suffer no performance and stability problems. Some of us however, are really only interested in seeing the code, and find the rest of this to be a distraction that eats system resources and screen real estate. I will quickly explain how you can turn off any of these visual features that you do not actually want. This can free you from visual distractions, or open up screen real estate to make side by side code editing a more practical endeavour. It may even reduce system resource use and improve stability somewhat. (citation needed). These tips all assume you are using Visual Studio 2015, though some may work on older versions as well.

Remove any extensions you don’t use

Often times performance and stability issues with Visual Studio 2015 are due to extensions. Take a quick glance at your installed extensions by navigating to Tools->Extensions and Updates->Installed and see if there is anything there that you never actually use. If so, uninstall it. If you have used ReSharper for a long time, Visual Studio has slowly been adding a lot of the features that ReSharper used to add. If you don’t need ReSharper, you can get huge improvements in responsiveness by uninstalling it.

Disable CodeLens

The CodeLens feature of Visual Studio can be quite useful, it can display various meta-data about your code, and it’s state within the context of your source control. But if you do not make use of it often, you can save a great deal of visual clutter, and perhaps improve the resource utilization and responsiveness of Visual Studio as well by turning it off. You can disable it globally at Tools->Options->Text Editor->All Languages, or on a per-language basis if you prefer.

Solution Explorer and Output / Error Panes

These are very commonly used tools, but you can have your cake and eat it too with them. At the top of these panels you will see a thumb-tack icon. You can click that to toggle ‘Auto-Hide’. With ‘Auto-Hide’ enabled the panes will normally stay minified, but you can bring them up and put the focus in them with shortcut keys (CNTRL-ALT-L for solution explorer, CNTRL-ALT-O for Output, etc.) and then you can hop back to your code with the ESC key. While they are open, the focus will be in the pane and you can navigate them with the ARROW and ENTER keys, no need for the mouse. This is a great way to free up space, and maintain the usefulness of the solution explorer.

Tools->Options->Text Editor->All Languages->General will give you the option to turn off the Navigation bar, a thin bar with dropdowns that notify you of the current method you are in. Some people like this feature, if you never use it, free up the space and turn it off. You can turn if on/off on a per language basis as well. I also like to actually turn line numbers on here.

Hiding the code outlining graphics, if you find those useless, is a bit more tricky. For some languages like C++ and C#, you can turn the feature off from within the language specific options under Text Editor. For others, like Javascript, you have to turn it off by hand on each file with CNTRL-M CNTRL-P

Reduce the Margins

By default there is a lot of horizontal space taken up by the Selection and Indicator margins. If you don’t make use of these, you can go to Tools->Options->Text Editor->General and unselect them both.

If you have been learning your keyboard shortcuts, the icons under the menu bar should be completely useless to you, and you can remove them by right clicking empty space in that area and deselecting any icon groups you don’t need. This can quickly free up vertical space. You can go even further, and hide the menu text as well with extensions like ‘Hide Main Menu’. You can still use the menu, as pressing the ALT key brings it back up. Just go to Tools->Extensions and Updates->Online and search for ‘Hide Main Menu’

Full Screen

Tap ALT-SHIFT-ENTER to go into fullscreen mode, this frees up some space as the window borders go away. This also makes the icons under the menu go away.

Tab Group Jumper

More of a productivity improvement than a de-crufting, but once you free up all this space, you may find you have room for 2 or 3 pages of code side by side. Unfortunately visual studio provides no way to jump between tab groups without the mouse. The Tab Group Jumper extension adds this functionality. Tools->Extensions and Updates->Online and search for ‘Tab Jumper’

After De-Crufting

Now with space freed up, you have more room on your screen for code, side by side editing, or whatever else you desire.

Marginal Gains

Fri, 01 Jul 2016 19:17:27 +0000

In a former life I was heavily involved in bike racing, and became obsessed with the concept of “marginal gains”. It is the idea that there exist a multitude of choices you can make, each of which, in isolation, has little to no effect on your result, but in totality can be the difference between success and failure. It is a philosophy which requires a delicate balance. Too much time spent worrying about minutiae can distract from the business of actually training. But ignoring marginal gains completely means you will eventually lose to someone who did not.

So it is with being a software developer. Taking time to master your tools will save you time, and expand your abilities. But, at some point, you just need to shut up and code.

Become one with the command line

If you spend most of your time in Windows software development it is possible to get by and never really master various command line systems and tools. Taking the time to force yourself to learn these things can be extremely valuable and open up entirely new worlds. Being familiar with how to get things building from source in Linux for example, can allow you to leverage and contribute to open source projects that might otherwise be unavailable to you. If your projects are deployed to cloud infrastructures such as Azure or AWS, being able to manage all of that from Bash or Powershell let’s you get things done much, much faster than working through web GUI interfaces. You will likely be able to easily automate a lot of your daily tasks with scripts. For instance do you type “git add *, git commit -m “foo”, git push” 30 times a day? (or use your mouse and click through 3 menus in the GUI equivalent?) That is an easy fix for even a beginner at bash or batch file scripting. Being familiar with the command line also opens up options such as using faster or more flexible code editors, rather than being stuck in Visual Studio, Eclipse, etc because you don’t understand how to build and run things without the IDE to help you. The possibilities here are too many to list. Take stock of the kinds of things you work on, and take time out of your day to learn Bash, or Powershell, or whatever command line skills may be relevant to your job and interests.

Master your code editors

Whatever you use to edit code, whether it be an IDE like Visual Studio or a text editor like VIM or Sublime, take some time to truly master it. Think carefully about what wastes your time as you use it. Do you spend lots of time navigating around code with the arrow keys? Learn what shortcuts are available to speed that up. Move to next token, move to next / previous matching brace, these are often features available with a hotkey in a good editor. If you want a feature that isn’t there, look into customizing the editor to add it. Occasionally take a day and make it a goal to do all your coding without touching the mouse. At first you will find hundreds of things you can’t accomplish without doing so, but gradually you will learn how to bring that down to zero, either by learning keyboard shortcuts, or tweaking your environment so you don’t need them. Here are some common examples:

Changing indentation on blocks

With Visual Studio you can do this by selecting and using TAB and SHIFT-TAB. You can even select rectangular regions with ALT-SHIFT-ARROWS.

Token Delete and Multi Line Editing

In Visual Studio you can delete entire tokens at a time with CNTRL-DEL or CNTRL-BACKSPACE. You can also navigate the cursor a token at a time with CNTRL-ARROW. Some may find it useful to also have a camelcase/pascalcase aware feature. The third gif here shows Multi Line editing in action, where you can use the ALT-SHIFT selection feature and make identical changes to many lines at once. Most code editors will have a feature like this, it can be very useful to learn it.

There are dozens of little tricks like this available in any decent code editor. Any time you find yourself having to bang a lot of keys or use the mouse to get things done, investigate whether your editor has a shortcut built in for the task, or whether you can easily add one. These will take practice to use quickly and without thinking about it, but when you build up a nice set of shortcuts, such that you are rarely touching the mouse or repeating keystrokes, you will get things done faster, and more pleasantly.

A few more freebies in Visual Studio:

F9 set/unset a breakpoint on the current line
F12 go to definition of the token the cursor is currently on
F5 to run SHIFT-F5 to stop the current default project
CNTRL-TAB to pop to the previously focused window
Hold CNTRL and press TAB to cycle through all previously open windows
CNTRL-K CNTRL-O in C++ files to hop between .h/.hpp and .cpp/.c files
CNTRL-ALT-L to pop to the solution explorer, ARROWS and ENTER to navigate/open

Expand either the depth, or breadth, of your skills

Take some time to Git Gud. Maybe you are committed to being deeply expert in one aspect of programming. If so, take time to deepen your understanding. Think about what language features or programming concepts have confused you in the past. Set time aside to master those things.

Alternatively, especially if you are younger, take time to learn and practice entirely new philosophies. Has your career to date been entirely in managed runtimes? Start a personal project in C, D, or Rust, and learn what programming without garbage collection, and with access to the bare metal is like. Have you done nothing but Object Oriented Programming your whole life? Try some side projects in a functional language, and try it again in ANSI C. Learn for yourself what the pros and cons of these paradigms are for you. Do not believe the assertions you hear every day about different programming paradigms, almost none of them are backed up with rigorous evidence. Take time to investigate the universes you are not familiar with. Consider contributing to an open source project, where you can learn from a new set of people than you deal with at your day job. You will likely pick up useful tips and techniques from that that you can use elsewhere. Even if you end up sticking with what you already know, you will likely learn some things to make you better at that too.