Code Thoughts Thoughts on code https://jackmott.github.io// Tue, 14 Aug 2018 20:05:02 +0000 Tue, 14 Aug 2018 20:05:02 +0000 Jekyll v3.7.3 Does Your Code Leave a Trail of Slowness? <h1 id="the-trail-of-slowness">The Trail of Slowness</h1> <p>Often I implore people to make better performing coding decisions when the downsides are small or nonexistent. A common response to this is that you should only worry about performance when you have measured the code in question and found performance to be an issue. There a couple of problems with that ethos, the first of which is that sometimes early decisions will be hard to reverse late in a project if performance turns out to be an issue. But there is a more insidious problem. Areas of your code for which performance is not important, may be causing <em>other</em> code, or even <em>other programs</em> to slow down. This trail of slowness left behind by uninformed performance decisions will not show up in any particular place in a profiler, they will just slow down everything a bit.</p> <h1 id="why-does-this-happen">Why Does This Happen</h1> <p>Modern CPU performance is severely limited by RAM latency. A request to get data from RAM can take over 100 CPU cycles to complete, and while your CPU waits, it just sits there, doing nothing useful. This problem is addressed by a series of caches in your CPU, smaller but much faster regions of memory. Whenever you request data from RAM, you get back an entire cache line of contiguous memory into the CPU caches. Assuming that the next memory your request is in that contiguous segment of memory in the cache line, you will get it very quickly.</p> <p>This is why iterating over an array in order (contiguous memory) is much, much faster than iterating over a LinkedList, where each node is in a random(ish) location in the heap. Many of those requests from RAM while iterating over a LinkedList miss the cache, and cause long CPU stalls.</p> <p>But it is worse than just being slow. Every time you request memory that isn’t in the cache already, you have to pull down a whole cache line and replace data that is already in the cache. Any code running that might have hoped to use that cached data won’t have it, and will now have to get it from RAM again. This is sometimes called “thrashing the cache”.</p> <h1 id="how-bad-can-it-be">How Bad Can It Be?</h1> <p>This can vary wildly based on the workloads involved. In the contrived example below, I do a fixed amount of work with a LinkedList, while at the same time I have other threads running doing as much work as possible for 5 seconds. They are able to perform the ArraySumSquare function about 3.5 million times in that 5 seconds.</p> <p>When I alter the RunLinkedList function to do the same fixed amount of work on arrays instead, the original array loops are now able to perform ArraySumSquare about 4.4 million times.</p> <p>This could be similar to a situation where a server has a periodic computation it does, where taking a few seconds isn’t considered a problem at all. But that job is having a significant impact on users using the live system, increasing latency by a ~20%. Or it could be similar to a code editor that is doing some parsing behind the scenes where the performance seems fine when you profile it, but it is slowing down UI response due to the cache thrashing.</p> <h1 id="what-to-do">What To Do</h1> <p>Use arrays or array backed lists by default unless you have good reasons not to. Most languages have resize-able array backed structures that are as convenient to use as LinkedLists:</p> <ul> <li>Java - ArrayList</li> <li>C# - List<T></T></li> <li>F# - ResizeArray (see FSharpX for higher order functions on these!)</li> <li>C++ - std::vector</li> <li>Rust - vec</li> </ul> <p>Even if you are inserting or removing from the front or middle of your collection periodically it is <a href="https://jackmott.github.io/programming/2016/08/20/when-bigo-foolsya.html">usually still faster overall</a>. You may also consider using a sorted array along with BinarySearch instead of a tree, when applicable. Think about how you lay out your data structures, avoid unnecessary pointer hops. Avoid virtual functions when they aren’t really necessary. Understand how branch prediction works so you can set your code up for success. <a href="https://www.akkadia.org/drepper/cpumemory.pdf">Take some time to learn how memory works</a>, and you can make better default decisions, and better designs early in your projects.</p> <h1 id="sample-code">Sample Code</h1> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="nn">System</span><span class="p">;</span> <span class="k">using</span> <span class="nn">System.Collections.Generic</span><span class="p">;</span> <span class="k">using</span> <span class="nn">System.Diagnostics</span><span class="p">;</span> <span class="k">using</span> <span class="nn">System.Threading</span><span class="p">;</span> <span class="k">namespace</span> <span class="nn">ConsoleApplication1</span> <span class="p">{</span> <span class="k">class</span> <span class="nc">Program</span> <span class="p">{</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="p">{</span> <span class="n">Thread</span> <span class="p">[]</span> <span class="n">arrayThreads</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Thread</span><span class="p">[</span><span class="m">8</span><span class="p">];</span> <span class="n">Thread</span> <span class="p">[]</span> <span class="n">listThreads</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Thread</span><span class="p">[</span><span class="m">8</span><span class="p">];</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="m">8</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">arrayThreads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Thread</span><span class="p">(</span><span class="n">RunArray</span><span class="p">);</span> <span class="n">listThreads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Thread</span><span class="p">(</span><span class="n">RunLinkedList</span><span class="p">);</span> <span class="n">arrayThreads</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nf">Start</span><span class="p">();</span> <span class="n">listThreads</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nf">Start</span><span class="p">();</span> <span class="p">}</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="m">8</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">arrayThreads</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nf">Join</span><span class="p">();</span> <span class="n">listThreads</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nf">Join</span><span class="p">();</span> <span class="p">}</span> <span class="n">Console</span><span class="p">.</span><span class="nf">WriteLine</span><span class="p">(</span><span class="s">"Perf Critical Ops:"</span> <span class="p">+</span> <span class="n">totalOps</span><span class="p">);</span> <span class="n">Console</span><span class="p">.</span><span class="nf">ReadLine</span><span class="p">();</span> <span class="p">}</span> <span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="n">totalOps</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">public</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">LEN</span> <span class="p">=</span> <span class="m">10000</span><span class="p">;</span> <span class="k">public</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">TIME</span> <span class="p">=</span> <span class="m">5000</span><span class="p">;</span> <span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">RunArray</span><span class="p">()</span> <span class="p">{</span> <span class="n">Stopwatch</span> <span class="n">sw</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Stopwatch</span><span class="p">();</span> <span class="n">sw</span><span class="p">.</span><span class="nf">Start</span><span class="p">();</span> <span class="kt">int</span><span class="p">[]</span> <span class="n">a</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">int</span><span class="p">[</span><span class="n">LEN</span><span class="p">];</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">a</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="m">2</span><span class="p">;</span> <span class="p">}</span> <span class="kt">long</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="kt">int</span> <span class="n">count</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">while</span> <span class="p">(</span><span class="n">sw</span><span class="p">.</span><span class="n">ElapsedMilliseconds</span> <span class="p">&lt;</span> <span class="n">TIME</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="nf">ArraySumSquare</span><span class="p">(</span><span class="n">a</span><span class="p">);</span> <span class="n">count</span><span class="p">++;</span> <span class="p">}</span> <span class="n">Interlocked</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="k">ref</span> <span class="n">totalOps</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span> <span class="p">}</span> <span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">ArraySumSquare</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">a</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">a</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">*</span> <span class="n">a</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="nf">RunLinkedList</span><span class="p">()</span> <span class="p">{</span> <span class="n">Stopwatch</span> <span class="n">sw</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Stopwatch</span><span class="p">();</span> <span class="n">sw</span><span class="p">.</span><span class="nf">Start</span><span class="p">();</span> <span class="n">LinkedList</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">l</span> <span class="p">=</span> <span class="k">new</span> <span class="n">LinkedList</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;();</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">LEN</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">l</span><span class="p">.</span><span class="nf">AddLast</span><span class="p">(</span><span class="m">2</span><span class="p">);</span> <span class="p">}</span> <span class="kt">long</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="kt">int</span> <span class="n">count</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">while</span> <span class="p">(</span><span class="n">count</span> <span class="p">&lt;</span> <span class="m">75000</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="nf">LinkedListSumSquare</span><span class="p">(</span><span class="n">l</span><span class="p">);</span> <span class="n">count</span><span class="p">++;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">LinkedListSumSquare</span><span class="p">(</span><span class="n">LinkedList</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">l</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">num</span> <span class="k">in</span> <span class="n">l</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">num</span> <span class="p">*</span> <span class="n">num</span><span class="p">;</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div></div> Mon, 27 Feb 2017 19:17:27 +0000 https://jackmott.github.io//2017/02/27/trail-of-slow.html https://jackmott.github.io//2017/02/27/trail-of-slow.html PVS-Studio C# <p><a href="http://www.viva64.com/en/pvs-studio/">PVS-Studio</a> is a popular static analysis tool in the C++ world, and plenty of articles have been written about the kinds of bugs it can find in C++ projects, such as this entertaining one about the <a href="https://www.unrealengine.com/blog/how-pvs-studio-team-improved-unreal-engines-code">Unreal Engine</a>. About year ago they added C# support, and have steadily been adding more C# analysis features since. Today I grabbed version 6.10 and ran it on my code base at work, which is a fairly large ASP/MVC web application. Here are some of the things it found:</p> <h4 id="the-usergroup-object-was-used-before-it-was-verified-against-null">The ‘user.Group’ object was used before it was verified against null</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">var</span> <span class="n">user</span> <span class="p">=</span> <span class="n">_entityRepository</span><span class="p">.</span><span class="nf">GetOnlineUserByUsername</span><span class="p">(</span><span class="n">username</span><span class="p">);</span> <span class="kt">string</span> <span class="n">nsId</span> <span class="p">=</span> <span class="n">user</span><span class="p">.</span><span class="n">Group</span><span class="p">.</span><span class="n">NetSuiteInternalId</span><span class="p">;</span> </code></pre></div></div> <p>One of the nicer features of PVS-Studio is that it can identify cases like this, where a null pointer exception is possible. By identifying these and dealing with them you can eliminate an common class of error and deal with it more gracefully.</p> <h4 id="expression-resultsucces-is-always-true">Expression ‘result.Succes’ is always true</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span><span class="p">(!</span><span class="n">result</span><span class="p">.</span><span class="n">Success</span><span class="p">)</span> <span class="k">return</span> <span class="nf">Json</span><span class="p">(</span><span class="n">result</span><span class="p">);</span> <span class="c1">//...</span> <span class="k">if</span> <span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">Success</span><span class="p">)</span> <span class="p">{</span> <span class="c1">//... </span> <span class="p">}</span> </code></pre></div></div> <p>This and a couple other similar examples were identified. This error could imply a serious logic error in the code. At the very least it identifies unecessary checks cluttering up the code and slowing down execution.</p> <h4 id="it-is-odd-that-the-body-of-canreadtype-function-is-fully-equivalent-to-the-body-of-canwritetype-function">It is odd that the body of ‘CanReadType’ function is fully equivalent to the body of ‘CanWriteType’ function</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">override</span> <span class="kt">bool</span> <span class="nf">CanReadType</span><span class="p">(</span><span class="n">Type</span> <span class="n">type</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="nf">SupportedType</span><span class="p">(</span><span class="n">type</span><span class="p">);</span> <span class="p">}</span> <span class="k">public</span> <span class="k">override</span> <span class="kt">bool</span> <span class="nf">CanWriteType</span><span class="p">(</span><span class="n">Type</span> <span class="n">type</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="nf">SupportedType</span><span class="p">(</span><span class="n">type</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>This turned out to be correct for us, since we needed to override both methods. This class of message can sometimes identify code that can be cominbed to shrink your code base down, or may identify a copy-paste error.</p> <h4 id="an-odd-precise-comparison-transactiontaxrate--0-consider-using-a-comparison-with-a-defined-precision-mathabsa-b--epsilon">An odd precise comparison: transaction.TaxRate == 0. Consider using a comparison with a defined precision: Math.Abs(A-B) &lt; Epsilon</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span><span class="p">(</span><span class="n">transaction</span><span class="p">.</span><span class="n">TaxRate</span> <span class="p">==</span> <span class="m">0</span><span class="p">)</span> </code></pre></div></div> <p>Any instances of comparing floating point values to exact values will be indentified as a low-risk problem.</p> <h4 id="a-part-of-conditional-expression-is-always-true-if-it-is-evaluated-billingaddresssameasshipping--on">A part of conditional expression is always true if it is evaluated: billingAddressSameAsShipping != “on”</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="kt">string</span><span class="p">.</span><span class="nf">IsNullOrEmpty</span><span class="p">(</span><span class="n">CFM</span><span class="p">.</span><span class="n">BillingAddress</span><span class="p">.</span><span class="n">Id</span><span class="p">)</span> <span class="p">&amp;&amp;</span> <span class="kt">string</span><span class="p">.</span><span class="nf">IsNullOrEmpty</span><span class="p">(</span><span class="n">billingAddressSameAsShipping</span><span class="p">)</span> <span class="p">&amp;&amp;</span> <span class="n">billingAddressSameAsShipping</span> <span class="p">!=</span> <span class="s">"on"</span><span class="p">)</span> </code></pre></div></div> <p>Another common mistake, this can often arise when an if statement is modified with an extra check later. In this case, the check for != “on” is now unecessary. But this could expose logic mistakes as well.</p> <h4 id="the-url-variable-is-assigned-to-itself">The ‘url’ variable is assigned to itself</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">url</span> <span class="p">=</span> <span class="n">url</span> <span class="p">=</span> <span class="s">"user/"</span> <span class="p">+</span> <span class="n">userId</span> <span class="p">+</span> <span class="s">"/attorney/"</span> <span class="p">+</span> <span class="n">id</span><span class="p">;</span> </code></pre></div></div> <p>A copy/paste error, that likely gets compiled away, but is cluttering up the code still.</p> <h4 id="idisposable-object-servererror-is-not-disposed-before-method-returns">IDisposable object ‘serverError’ is not disposed before method returns</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">var</span> <span class="n">serverError</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">HttpResponseMessage</span><span class="p">(</span><span class="n">HttpStatusCode</span><span class="p">.</span><span class="n">InternalServerError</span><span class="p">);</span> <span class="k">return</span> <span class="n">Request</span><span class="p">.</span><span class="nf">CreateResponse</span><span class="p">(</span><span class="n">HttpStatusCode</span><span class="p">.</span><span class="n">InternalServerError</span><span class="p">);</span> </code></pre></div></div> <p>PVS will identify a few different problems with the IDisposable interface, including this, classes that implement the Dispose method but not the IDisposable interface, and classes which have IDisposable memebers but don’t implement IDisposable themselves. Proper handling of these issues can reduce memory use and GC pressure.</p> <h4 id="the-datetime-constructor-could-receive-the-0-value-while-positive-value-is-expected-inspect-the-first-argument">The ‘DateTime’ constructor could receive the ‘0’ value while positive value is expected. Inspect the first argument</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DateTime</span> <span class="n">now</span> <span class="p">=</span> <span class="n">DateTime</span><span class="p">.</span><span class="n">Now</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">year</span> <span class="p">&lt;</span> <span class="m">0</span> <span class="p">||</span> <span class="n">year</span> <span class="p">&gt;</span> <span class="n">now</span><span class="p">.</span><span class="n">Year</span> <span class="p">||</span> <span class="n">month</span> <span class="p">&lt;=</span> <span class="m">0</span> <span class="p">||</span> <span class="n">month</span> <span class="p">&gt;</span> <span class="m">12</span><span class="p">)</span> <span class="p">{</span> <span class="k">throw</span> <span class="k">new</span> <span class="nf">Exception</span><span class="p">(</span><span class="s">"The input query time is not valid"</span><span class="p">);</span> <span class="p">}</span> <span class="n">DateTime</span> <span class="n">StartOfMonth</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">DateTime</span><span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">month</span><span class="p">,</span> <span class="m">1</span><span class="p">);</span> </code></pre></div></div> <p>PVS Studio has identified that our check for “year &lt; 0” is not sufficient to gaurantee correct input to the DateTime constructor. It should be year &lt;= 0.</p> <h4 id="the-dateneeded-variable-is-assigned-values-twice-successively-perhaps-this-is-a-mistake">The ‘DateNeeded’ variable is assigned values twice successively. Perhaps this is a mistake</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">detail</span><span class="p">)</span> <span class="p">{</span> <span class="k">while</span> <span class="p">((</span><span class="n">date</span><span class="p">.</span><span class="nf">ToString</span><span class="p">(</span><span class="s">"ddd"</span><span class="p">)</span> <span class="p">!=</span> <span class="n">day</span><span class="p">))</span> <span class="p">{</span> <span class="n">date</span> <span class="p">=</span> <span class="n">date</span><span class="p">.</span><span class="nf">AddDays</span><span class="p">(</span><span class="m">1</span><span class="p">);</span> <span class="p">}</span> <span class="c1">//if the date is a holiday, add 1 </span> <span class="k">if</span> <span class="p">(</span><span class="n">santa</span><span class="p">.</span><span class="nf">CheckDate</span><span class="p">(</span><span class="n">DateNeeded</span><span class="p">,</span> <span class="n">type</span><span class="p">))</span> <span class="n">DateNeeded</span> <span class="p">=</span> <span class="n">date</span><span class="p">.</span><span class="nf">AddDays</span><span class="p">(</span><span class="m">1</span><span class="p">);</span> <span class="p">}</span> <span class="c1">//fixes time </span> <span class="n">date</span> <span class="p">=</span> <span class="n">date</span><span class="p">.</span><span class="nf">AddHours</span><span class="p">(</span><span class="n">time</span> <span class="p">-</span> <span class="n">date</span><span class="p">.</span><span class="n">Hour</span><span class="p">);</span> <span class="n">date</span> <span class="p">=</span> <span class="n">date</span><span class="p">.</span><span class="nf">AddMinutes</span><span class="p">(-</span><span class="n">date</span><span class="p">.</span><span class="n">Minute</span><span class="p">);</span> <span class="n">DateNeeded</span> <span class="p">=</span> <span class="n">date</span><span class="p">;</span> </code></pre></div></div> <p>PVS Studio has identified that the assignment of DateNeeded = date.AddDays(1) is nonsensical, because it is immediately overritten before being used.</p> <h4 id="the-return-value-of-function-insert-is-required-to-be-utilized">The return value of function ‘Insert’ is required to be utilized.</h4> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">userString</span><span class="p">.</span><span class="nf">Insert</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">suffixStr</span><span class="p">);</span> </code></pre></div></div> <p>This is a very common mistake. C# strings are immutable, and so the pattern for functions like insert, substring, etc is to return the new string. You can easily forget this, and the operation you were trying to do on the string just doesn’t happen at all.</p> <h2 id="quick-review">Quick Review</h2> <p>This doesn’t represent all of the C# capabilities that PVS-Studio has, these are just the issues it found in our project. It integrates nicely with Visual Studio (see screenshot below), but you can use it standalone as well. They have a nice evaluation version that let’s you try it out for quite a long time. I find the errors it finds to be much more relevant than Resharper’s analyses, which are numerous but seem to mostly be stylistic. There do not appear to be any optimization tips for C# yet, as there are for C++ code, but perhaps that is coming in the future.</p> <p><img src="/images/pvs.png" alt="pvs" title="PVS Visual Studio" /></p> Tue, 01 Nov 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/11/01/PVS-Studio.html https://jackmott.github.io//programming/2016/11/01/PVS-Studio.html programming Language Helper <p>One of my many half finished side projects is a <a href="http://jackmott.github.io/dungeonbuilder/">text adventure engine</a>, where players can play Zork-like games, or make their own game from within the engine as well. One of the tricky problems I ran into was being able to handle certain english grammer issues in the engine, in a way that was not extremely annoying for users. For instance, you might want to indicate in the engine editor that a room has 2 gold coins and a dagger in it. When someone plays the game you could do the usual gamedev thing and present the room something like this:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You enter a scary dungeon. You see: 2 gold coins 1 dagger &gt; Take 1 coin You take 1 coin. You see: 1 gold coin 1 dagger </code></pre></div></div> <p>It is easy to write code to do that, but it breaks the fourth wall, and no longer reads like a story. What if instead you wanted it to look like so:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You enter a scary dungeon. There are two gold coins and a dagger here. &gt; Take 1 coin You take a coin, and now a gold coin and a dagger remain </code></pre></div></div> <p>This is harder code to write, especially when you support a editor where users might enter any words they want as items. You have to identify if the indefinite article for a given word should be “a” or “an” for starters, which has no simple rules you can follow. You need to know the plural forms of words, which is not as simple as just adding an s.</p> <h2 id="the-languagehelper-library">The LanguageHelper library</h2> <p>I’ve put together a library that helps make some of these things easier, and might be useful for various text-based RPG applications. Maybe even useful for non text based ones now that text to speech is starting to sound convincing.</p> <p>What I have so far:</p> <ul> <li>Query any word for it’s plural form</li> <li>Query any word for the correct indefinite article</li> <li>Query any verb for past tense, progressive tense, and past perfect tense</li> <li>Turn an integer into the words that represent the integer</li> <li>A json dictionary format so you can easily add your own words, or cull the dictionary for performance/size reasons</li> </ul> <p>Things I would like to add:</p> <ul> <li>Query a word for synonyms and antonyms</li> <li>More complete coverage of verb conjugation</li> <li>Pronunciation data?</li> <li>More languages?</li> </ul> <p>The library is currently in .NET 2.0, so it can be consumed by any C# or F# code, including <a href="https://unity3d.com/">Unity3D</a>, but if people think this sounds useful I would create a C++ version as well.</p> <h2 id="useful">Useful?</h2> <p>Does this sound useful? Does this already exist? What other features would you need? Let me know!</p> <h2 id="sample-use-case">Sample Use Case</h2> <p>Following is a simple use case example. Notice there are some subtleties that the library handles well. “steel ingot” is a two word item. The library correct pulls out the indefinitely article for the leading word “steel” but applies the plural form only to the trailing word.</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c">(* Some imaginary blacksmith entity structure *)</span> <span class="k">let</span> <span class="n">item</span> <span class="o">=</span> <span class="s2">"steel ingot"</span> <span class="k">let</span> <span class="n">qty</span> <span class="o">=</span> <span class="mi">2</span> <span class="k">let</span> <span class="n">currentAction</span> <span class="o">=</span> <span class="s2">"forge"</span> <span class="k">let</span> <span class="n">pastAction</span> <span class="o">=</span> <span class="s2">"sleep"</span> <span class="c">(* Player asks what the Smith has for sale *)</span> <span class="k">let</span> <span class="n">response</span> <span class="o">=</span> <span class="s2">"I have "</span> <span class="o">+</span> <span class="n">wordBank</span><span class="o">.</span><span class="nc">QueryNounQty</span><span class="p">(</span><span class="n">item</span><span class="o">,</span><span class="n">qty</span><span class="p">)</span> <span class="n">printf</span> <span class="s2">"%A</span><span class="se">\n</span><span class="s2">"</span> <span class="n">response</span> <span class="c">(* Output: I have two steel ingots *)</span> <span class="c">(* Smith only has 1 *)</span> <span class="k">let</span> <span class="n">qty</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">let</span> <span class="n">response</span> <span class="o">=</span> <span class="s2">"I have "</span> <span class="o">+</span> <span class="n">wordBank</span><span class="o">.</span><span class="nc">QueryNounQty</span><span class="p">(</span><span class="n">item</span><span class="o">,</span><span class="n">qty</span><span class="p">)</span> <span class="n">printf</span> <span class="s2">"%A</span><span class="se">\n</span><span class="s2">"</span> <span class="n">response</span> <span class="c">(* Output: I have a steel ingot *)</span> <span class="c">(* Smith has none left *)</span> <span class="k">let</span> <span class="n">qty</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">let</span> <span class="n">response</span> <span class="o">=</span> <span class="s2">"I have "</span> <span class="o">+</span> <span class="n">wordBank</span><span class="o">.</span><span class="nc">QueryNounQty</span><span class="p">(</span><span class="n">item</span><span class="o">,</span><span class="n">qty</span><span class="p">)</span> <span class="n">printf</span> <span class="s2">"%A</span><span class="se">\n</span><span class="s2">"</span> <span class="n">response</span> <span class="c">(* Output: I have no steel ingots *)</span> <span class="c">(* What are you doing? *)</span> <span class="k">let</span> <span class="n">response</span> <span class="o">=</span> <span class="s2">"I am "</span> <span class="o">+</span> <span class="n">wordBank</span><span class="o">.</span><span class="nc">QueryVerbPresent</span><span class="p">(</span><span class="n">currentAction</span><span class="p">)</span> <span class="n">printf</span> <span class="s2">"%A</span><span class="se">\n</span><span class="s2">"</span> <span class="n">response</span> <span class="c">(* Output: I am forging *)</span> <span class="c">(* What did you do earlier? *)</span> <span class="k">let</span> <span class="n">response</span> <span class="o">=</span> <span class="s2">"I "</span> <span class="o">+</span> <span class="n">wordBank</span><span class="o">.</span><span class="nc">QueryVerbPast</span><span class="p">(</span><span class="n">pastAction</span><span class="p">)</span> <span class="n">printf</span> <span class="s2">"%A</span><span class="se">\n</span><span class="s2">"</span> <span class="n">response</span> <span class="c">(* Output: I slept *)</span> </code></pre></div></div> <h2 id="some-sample-queries">Some sample queries</h2> <p><img src="/images/demo.gif" alt="DemoGif" title="Demo Gif" /></p> Thu, 29 Sep 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/09/29/language-helper.html https://jackmott.github.io//programming/2016/09/29/language-helper.html programming Taking out the garbage <p>One of the never ending arguments about languages is the pros and cons of garbage collection. Some people hate it because they think it is slow, others insist it is fast, some hate that it takes away control from them, others love it for that same reason. I’m going to explore this a bit and show some pros and cons that arise, and how you can deal with them in C#, Java, and C++. I will be creating basic framework for a game in the style of <a href="https://minecraft.net/en/">Minecraft</a>. Don’t get too excited, there won’t be any rendering or anything playable. Also, please don’t take any of these experiments to represent evidence of innate performance qualities of any of these languages. In all three cases, I am aware of ways to optimize the code further, this is just meant to illustrate the relative costs of allocation and how you can start reducing those costs in each language. If I get emails about fairness I’m going to refer you to this paragraph.</p> <h2 id="the-naive-approach-in-c-sharp">The Naive Approach, in C Sharp</h2> <p><a href="https://gist.github.com/jackmott/6d0ba936b24595402b49b6e76137c788">Link To Gist</a></p> <p>Above is a link to how many developers might naively being to implement a game like Minecraft. (note, that experienced AAA game devs, their eyes would bleed at this) It is 3d, so of course you have the obligatory Vector class, with obligatory operator overloading so you can do simple operations on those vectors with very obvious code and very little typing. The game world consists of Chunks, that are loaded as you approach close enough to them, and unloaded as you get too far away. Each Chunk has a number of Entities, that move around at various speeds each game tick. The player moves forward and every few ticks she passes into a new Chunk, which causes 1 Chunk to be loaded and another to be unloaded. Each Chunk also has a number of Blocks. Everything in the game, Chunks, Blocks, Entities have positions represented by a Vector class. Each tick of the game, as the player moves forward a little bit, the chunks are iterated over and told to update their entities, then checked to see if they have gone out of range, and removed if so. If a chunk is removed, a new one is added to replace it.</p> <p>Simple enough, and the approach above would and does work fine for many games. It isn’t completely dumb, it uses an array backed List to keep track of things, because arrays are fast, and it pre-allocates them to the proper size when it can to avoid wasting memory and cpu on growing the array. Modern computers are fast, so this shouldn’t be a problem!</p> <p>But the problem is that while computers are fast, so too have our expectations grown. A screen used to have 64,000 pixels max, now they have 2 million at a minimum. A game world that you could explore for hours used to be impressive, but now players expect endless worlds larger than planets, with detail down to blades of grass. And all of that has to happen at 90FPS on two monitors at once because VR! So, while our game is simple in principle, we load up and move around a <em>lot</em> of simple things. 65k blocks per chunk, 100 chunks at time, plus 1,000 entities per chunk all moving around every game tick.</p> <h3 id="naive-approach">Naive Approach</h3> <p>You can see in the code we have a rudimentary frame rate lock, at 60fps, and on a modern Core I7 cpu we aren’t hitting that frame rate ever! Some other troubling stats (collected with Perfmon) are clear:</p> <ul> <li>80% of CPU time spent in garbage collection!!!</li> <li>600 MB/s allocations</li> <li>5.8 seconds to load the world.</li> <li>31.4 ms per tick</li> </ul> <p>I’m not even rendering yet! Or playing sounds, or doing networking. This is the kind of performance that causes people to say “Garbage collection is terrible and slow!”, which causes people to respond “No you just aren’t using it right!”, which then leads to “If I have to think about memory management anyway, what is the point of garbage collection!” and so on.</p> <p>The root problem here, as is often the case, is too many allocations, which would be a problem even if there wasn’t any garbage collector, though perhaps not quite as bad. I am casually creating new Vectors all the time for just a short while and then tossing them away. With blocks I am creating longer lived objects and then regularly tossing them to the garbage collector as well. There are many things I can do to improve this, but C# has one feature which is an ‘easy fix’, and that is structs, which are a value type. They are not allocated on the heap, and they are passed by value. You can’t just turn all of your classes into structs, as larger classes being copied around by value would be wasteful, but small ones you can. In this case, Vector is a perfect candidate, at only 12 bytes.</p> <h3 id="change-class-vector-to-struct-vector">Change class Vector to struct Vector</h3> <p>Just six characters and look at the difference:</p> <ul> <li>45% of CPU time spent in garbage collection</li> <li>200 MB/s allocations</li> <li>4.4 seconds to load the world.</li> <li>4.1ms per tick</li> </ul> <p>Suddenly things have gone from a hopeless situation where there is negative time for the frame rate budget for rendering, to having 11 milliseconds and 50% of the CPU to spare. But the struct is just a partial fix. Next I will refactor the code a bit for even better performance.</p> <h3 id="refactor">Refactor</h3> <p><a href="https://gist.github.com/jackmott/fa4ca37bd372c1fe99f514bcf92519a0">Link to Gist</a></p> <p>After a more serious refactoring:</p> <ul> <li>0.1% of CPU time spent in garbage collection</li> <li>1 MB/s allocations</li> <li>25ms to load the world</li> <li>0.97ms per tick</li> </ul> <p>A huge difference! Very little time is spent on GC now, world load times are now instant and game ticks now process in under a millisecond. At this point almost nothing happens in a game tic except updating positions of entities, and occasional unloading of one chunk to be replaced by another.</p> <h3 id="what-changed">What Changed?</h3> <p>Lots of little things, using arrays instead of List when it doesn’t cause any extra work saves a tiny bit of overhead. Benefitting from some array bounds elisions by structuring loops just right in some places. Reducing GC pressure and improving runtime by not using <em>foreach</em> on Lists, and other minor tweaks. But the main thing, was rethinking how the data is organized. Previously, each chunk had 65k Block objects. By thinking about the data, one can figure out how many possible block types there are. They will likely have some bound. In this case 256 was chosen to replicate Minecraft. You could easily bump that up to a short or int and still realize the bulk of this improvement. So instead of each chunk allocating and storing a complete Block object 65k times, it just stores an index into a global array of Blocks. This is similar to how Minecraft actually does things. This optimization only makes sense if blocks are static things, most of the time, as they are in Minecraft. You can break blocks, and place blocks, but they rarely have state associated with them that changes. This trick will not work with entities as currently designed, as they are moving around, their health is changing, and so on.</p> <p>This sort of thing is a very basic example of <a href="https://dataorientedprogramming.wordpress.com/tag/mike-acton/">Data Oriented Programming</a>. Think a little bit more about what is actually happening with your data in memory, and less about what an idiomatic OOP design should be. Note that had you proceeded with the original design further into the development cycle, refactoring to make these changes could end up very very painful. Now that memory isn’t being shuffled around like mad, there is plenty of CPU available to replace the mockup code with some real ‘AI’.</p> <p>One could go further with this, for instance pulling the Vector class out and replacing it with an array of positions, or even separate arrays of x,y,and z values, depending on how the data is accessed could have big speed benefits due to cache locality and allow you to utilize SIMD instructions. But that sort of madness is beyond the scope of this blog post.</p> <h2 id="the-naive-approach-java">The Naive Approach, Java</h2> <p><a href="https://gist.github.com/jackmott/ddc406ecaf1cd9b86bb5a3dc2581cf28">Link to Gist</a></p> <p>Recreating the same naive program in Java, I get the following stats (collected with Mission Control):</p> <ul> <li>17% of CPU time spent in garbage collection</li> <li>156 MB/s allocations</li> <li>4.1s to load the world</li> <li>4.9ms per tick</li> </ul> <p>Java has no value types yet, so we can’t apply the trick of making Vector a struct, but the memory use and GC time is already much better than the .NET case where we made Vector a struct. Part of the reason for this is that the JVM does escape analysis, so some of the wasteful Vector allocations that are only being used within a function can be allocated on the stack automatically. Interestingly, escape analysis is a feature coming soon to .NET, and value types are coming soon to Java.</p> <h3 id="java-refactored">Java refactored</h3> <p><a href="https://gist.github.com/jackmott/21d8765608767f972b0f8b69a344d4cf">Link to Gist</a></p> <p>But what if I refactor to avoid the wasteful allocations in the first place?</p> <ul> <li>.04% of CPU time spent in garbage collection</li> <li>1.3 MB/s allocations</li> <li>50ms to load the world</li> <li>1.18ms per tick</li> </ul> <p>The stats are now very much inline with the refactored .NET code. While slightly worse, don’t read too much into that, I’m not as experienced at Java and probably have more obvious small mistakes. What should be noted here, is that avoiding wasteful allocations is important, no matter the platform. One obvious way to further improve the Java code would be to eliminate the Vector class entirely, and just use float x,y,z in place every where we use it. This is a bit painful, but gets rid of the reliance on escape analysis and saves some object overhead. This would be equivalent to converting the class to a value type, if/when Java has those. Another option is to use a pool of Vector objects, which you reuse.</p> <h2 id="c-extremely-naively">C++ Extremely naively</h2> <p>The first experiment I ran with C++ was to strictly copy the behavior of C# / Java, and create the same objects, on the heap, every time. This was very unnatural, as creating a <em>new Vector</em> and then 2 lines of code later calling <em>delete</em> on it kind of alerts you to the absurdity of the situation. But for completeness I wrote that code and:</p> <ul> <li>0% of CPU time spent in garbage collection (but lots spent allocating!)</li> <li>1 second to load the world</li> <li>17.39ms per tick</li> </ul> <p>While the game world loads a lot faster, the per tick performance is still really bad. Even worse than the naive Java implementation, probably because there is no escape analysis saving us from allocating Vectors all over the place. This code is actually kind of absurdly naive though, it takes extra typing annoyance to make code this bad and it is pretty unlikely even someone not keen on performance would do this. However this is how I was taught to do things with C++ in school, so you never know.</p> <h3 id="minor-refactor---no-more-heap-allocating-vectors">Minor refactor - no more heap allocating Vectors</h3> <p><a href="https://gist.github.com/jackmott/a14abc480429a0aa494275b8e30e3511">Link To Gist</a></p> <p>This is vaguely equivalent to making the Vector class a struct in C#. I’m also passing Vector by value, and never allocating it on the heap. The code gets smaller and simpler, I don’t have to worry about memory management as much.</p> <ul> <li>0% of CPU time spent in garbage collection (but some spent allocating)</li> <li>737ms to load the world</li> <li>1.4ms per tick</li> </ul> <h3 id="major-refactor">Major refactor</h3> <p><a href="https://gist.github.com/jackmott/f2fb4d967003f0ca18494ca2cb1e8fe0">Link To Gist</a></p> <p><em>C++ implementation improvements courtesy of <a href="https://gist.github.com/jcelerier">Jean-Michaël Celerier</a></em></p> <p>Applying the same tricks as we did in the other languages, so that we aren’t allocating so many blocks, and a few other tricks that C++ gives us the flexibility to do:</p> <ul> <li>0% of CPU time spent in garbage collection (a bit spent allocating)</li> <li>8ms to load the world</li> <li>0.55ms per tick</li> </ul> <h2 id="rust-naively">Rust naively</h2> <p><a href="https://gist.github.com/jackmott/7a235472d74f9a2bbcfc1d6cae5ad7f4">Link To Gist</a></p> <p><em>This implementation contributed kindly by <a href="https://github.com/Maplicant">Maplicant</a></em></p> <p>This was implemented by a newcomer to Rust who attempted a translation of the Naive C# implementation, performance is quite good! fastest of all the naive ones.</p> <ul> <li>0% of CPU time spent in garbage collection (but lots spent allocating!)</li> <li>1.2 seconds to load the world</li> <li>1.7ms per tick</li> </ul> <h2 id="rust-refactored">Rust Refactored</h2> <p><a href="https://gist.github.com/jackmott/c6a37ba67b82efdd7da08c43ef271a48">Link To Gist</a></p> <p><em>This implementation contributed kindly by <a href="https://github.com/Dr-Emann">Zachary Dremann</a></em></p> <p>The refactored Rust implementation also performs very well:</p> <ul> <li>0% of CPU time spent in garbage collection (but some spent allocating!)</li> <li>14 milliseconds to load the world</li> <li>0.7ms per tick</li> </ul> <h2 id="go-haskell">Go? Haskell?</h2> <p>The <a href="https://www.reddit.com/r/haskell/comments/52j2c9/performance_in_the_large_benchmark/">Haskell</a> and <a href="https://gist.github.com/dgryski/61763e2ee58d3446b25bbe00b44f974e">Go</a> communities have gotten into the act too with lots of fun experiments. I won’t be able to collate the performance of all of these efforts but they are fun to read up on.</p> <h2 id="performance-comparisons">Performance Comparisons:</h2> <p>A couple of quick graphs showing some performance comparisons. Again, I reiterate not to read these as proof of the performance superiority of any memory management approach. I assure you that all of these implementations could optimized further than they are here. The point is to show how allocations are expensive in all cases, but in different ways, and how good design brings performance into reasonable ranges in all cases, though with differing levels of effort.</p> <h3 id="the-naive-approaches-with-vector-fixes-in-c-and-c">The Naive Approaches (With Vector Fixes in C# and C++)</h3> <p><img src="/images/mem-slow.png" alt="Naive" title="Naive" /></p> <p>Note the Y Axis here is Log Time. Notice how in all 4 languages performance starts to degrade at the same time, about 1,000 ticks in, probably reflecting when the heap begins to fill up, and allocations get expensive for C++, and GC has to kick in more for .NET and Java. While one could bicker about which language is doing best here for eternity, the fact is all three implementations are completely unacceptable.</p> <h3 id="the-refactored-approaches">The Refactored Approaches</h3> <p><img src="/images/mem-fast.png" alt="Faster" title="Faster" /></p> <p>Notice all 4 languages perform well here, but the garbage collected languages do still have GC pauses, which are a big problem in gaming. Most game developers would get even more clever, using object pooling and other techniques to try to get allocations down to zero within the main game loop, if possible.</p> <h2 id="conclusions">Conclusions</h2> <p>The main take away here is think about how you work with memory, no matter what language you use. C# offers some nice tools in value types to make this a bit easier. Java on the other hand uses escape analysis to attempt to “auto struct” things for you. Both of these approaches have pros and cons. It is less important to worry about which is best, and more important just to understand how your language works, so the code you type will leverage it’s strengths, and avoid it’s weaknesses. C++ doesn’t make allocation free, allocating too much is one of the primary causes of performance problems in C++ code as well. It does give you the most control to make things perform well, but it will be up to you to figure it out. Manage your memory well.</p> <h2 id="benchmark-details">Benchmark Details</h2> <p>All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language (Debateable for C++). If you identify cases where code or compiler/environment choices are sub optimal, email me please.</p> <h4 id="environment">Environment</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">Host</span> <span class="err">Process</span> <span class="err">Environment</span> <span class="err">Information:</span> <span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.9.8.0</span> <span class="py">OS</span><span class="p">=</span><span class="s">Microsoft Windows NT 6.2.9200.0</span> <span class="py">Processor</span><span class="p">=</span><span class="s">Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8</span> <span class="py">Frequency</span><span class="p">=</span><span class="s">2240907 ticks, Resolution=446.2479 ns, Timer=TSC</span> </code></pre></div></div> <h4 id="c-runtime-details">C# Runtime Details</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">CLR</span><span class="p">=</span><span class="s">MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]</span> <span class="py">GC</span><span class="p">=</span><span class="s">Concurrent Workstation</span> <span class="py">JitModules</span><span class="p">=</span><span class="s">clrjit-v4.6.1590.0 </span> <span class="py">Jit</span><span class="p">=</span><span class="s">RyuJit GarbageCollection=Concurrent Workstation </span> </code></pre></div></div> <h3 id="c-details">C++ Details</h3> <p>Visual Studio 2015 Update 3, Optimizations set for maximum speed, AVX2 Instructions on</p> <h3 id="java-details">Java Details</h3> <p>Oracle Java 64bit version 8 update 102 Testing done with JMH</p> <h3 id="rust-details">Rust Details</h3> <p>rustc 1.13.0 build with <code class="highlighter-rouge">cargo rustc --release -- -C lto -C target-cpu=native</code></p> Thu, 01 Sep 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/09/01/performance-in-the-large.html https://jackmott.github.io//programming/2016/09/01/performance-in-the-large.html programming Think Before You Parallelize <p>In 2005 Intel release the Pentium D, which began the era of multi-core desktop CPUs. Today, even our phones have multiple cores. Making use of all of those cores it not always easy to do, but modern languages and libraries have come a long way to help programmers take advantage. All kinds of utility functions and concurrent abstraction have been developed in an attempt to make using all our cores more accessible and simple. Sometimes these abstractions have a lot of overhead though, and sometimes it doesn’t even make sense to parallelize an operation in the first place.</p> <h2 id="dont-parallelize-when-it-is-already-parallelized">Don’t Parallelize when it is already Parallelized</h2> <p>Suppose you are working on a big number crunching function for a high traffic website that is a bit of a performance bottleneck. You get the idea to parallelize it and it tests much faster on your dev machine with 4 cores. You expect great things on the 24 core production server. However once you deploy you find that performance in production is actually slightly worse! What you forgot was that the web server was already parallelizing things at a higher level, using all 24 production cores to handle multiple requests simultaneously. When your paralellized function fires up, all the other cores are busy with other requests. So you take the hit of whatever overhead was required to parallelize the function with no benefit.</p> <p>On the other hand, if your website was say, a low traffic internal website with only a few dozen hits per day, then the plan to parallelize would likely pay off, as there will always be spare cores to crunch the numbers fast. You have to consider the overall CPU utilization of your webserver, and how your parellelized function will interact with the other jobs going on. Will it thrash the L1 cache and slow other things down? Test and measure.</p> <p>Another scenario, say you are working on 3D game, you have some trick physics math where you need to crunch numbers, maybe adding realistic building physics to Minecraft. But separate threads are already handling procedural generation of new chunks, rendering, networking, and player input. If these are keeping most of the system’s cores busy, then parallelizing your physics code isn’t going to help overall. On the other hand if those other threads are not doing a lot of work, cores may indeed be free for you to crunch some physics.</p> <p>So think about the system your code is running in, if things are getting parallelized at a higher level, it may not do any good to do it again at a lower level. Instead, focus on algorithms that run as efficiently as possible on a single core.</p> <h2 id="consider-your-target-hardware">Consider your target hardware</h2> <p>Many developers have very nice machines, probably with a minimum of 8 logical cores these days, possibly more. But consider the entire scope of where your code might run. Will it run on a low cost virtualized web app fabric in the cloud? This may only have 1 or 2 virtual cores for you to work with. Will it run on old desktops or cheap phones, that maybe only have 2 cores? An algorithm that gets sped up on your 8 core system at home may not fair so well on systems with only 2 or 3.</p> <h2 id="case-study-easy-parallel-loops">Case Study Easy Parallel Loops</h2> <p>It is common in a given programming language to have compiler hints or library functions for doing easy parallel loops when it is appropriate. What happens behind the scenes can be very different depending on the abstractions each language or library uses. In some cases a number of threads may be created to operate on chunks of the loop, or ThreadPools may be used to reduce the overhead of creating Threads. It is important to have a rough understanding of how the abstractions available to you work, so you can make educated guesses about when it might be useful to use them, how to tune them, and how to measure them. At minimum you should consider the following issues.</p> <p>If the computational overhead of creating or managing the thread is greater than the benefit you get, you can end up with a slower results than you would doing a single threaded implementation. I will compare some toy workloads with some common parallel loop abstractions in C#, F#, C++ and Java.</p> <h3 id="csharp">CSharp</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">public</span> <span class="kt">double</span> <span class="nf">ImperativeSquareSum</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">localArray</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="kt">double</span> <span class="n">result</span> <span class="p">=</span> <span class="m">0.0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">localArray</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">result</span> <span class="p">+=</span> <span class="c1">//Do Work</span> <span class="p">}</span> <span class="k">return</span> <span class="n">result</span><span class="p">;</span> <span class="p">}</span> <span class="k">public</span> <span class="kt">double</span> <span class="nf">LinqParallelSquareSum</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">localArray</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="k">return</span> <span class="n">localArray</span><span class="p">.</span><span class="nf">AsParallel</span><span class="p">().</span><span class="nf">Sum</span><span class="p">(</span><span class="cm">/* Do Work */</span><span class="p">);</span> <span class="p">}</span> <span class="k">public</span> <span class="kt">double</span> <span class="nf">ParallelForSquareSum</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">localArray</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="kt">object</span> <span class="n">lockObject</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">object</span><span class="p">();</span> <span class="kt">double</span> <span class="n">result</span> <span class="p">=</span> <span class="m">0.0</span><span class="p">;</span> <span class="n">Parallel</span><span class="p">.</span><span class="nf">For</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">localArray</span><span class="p">.</span><span class="n">Length</span><span class="p">,()</span> <span class="p">=&gt;</span> <span class="m">0.0</span><span class="p">,</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">loopState</span><span class="p">,</span> <span class="n">partialResult</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="p">{</span> <span class="cm">/*Do Work*/</span> <span class="p">},</span> <span class="p">(</span><span class="n">localPartialSum</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="p">{</span> <span class="k">lock</span> <span class="p">(</span><span class="n">lockObject</span><span class="p">)</span> <span class="p">{</span> <span class="n">result</span> <span class="p">+=</span> <span class="n">localPartialSum</span> <span class="p">}});</span> <span class="k">return</span> <span class="n">result</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <h4 id="1-million-doubles---result--xx">1 million doubles - (result += x*x)</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>Imperative</td> <td>1.1138 ms</td> <td>29,480.65</td> </tr> <tr> <td>LinqParallel</td> <td>3.3174 ms</td> <td>117,802.56</td> </tr> <tr> <td>ParallelFor</td> <td>1.9985 ms</td> <td>59,264.27</td> </tr> </tbody> </table> <p><br /> <br /> The easiest way to parallelize work like this in C# is with <a href="https://msdn.microsoft.com/en-us/library/dd460688(v=vs.110).aspx">PLINQ</a>. Just type your collection name, then <em>.AsParallel()</em> and fire away with Linq queries. Unfortunately in this case it does no good, and neither does the <em>Parallel.For</em> function. The workload of just squaring doubles and adding them up isn’t enough to get a net benefit here. You would need to roll your own function using ThreadPools or perhaps Threads directly to see a speedup.</p> <h4 id="1-million-doubles---result--mathsinx">1 million doubles - (result += Math.sin(x))</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>Imperative</td> <td>37.1130 ms</td> <td>840,522.92</td> </tr> <tr> <td>LinqParallel</td> <td>9.8497 ms</td> <td>225,694.67</td> </tr> <tr> <td>ParallelFor</td> <td>8.5615 ms</td> <td>166,386.40</td> </tr> </tbody> </table> <p><br /> With the bigger workload there is now a large improvement by parallelizing. It takes a CPU about 2 orders of magnitude more cycles to perform a sin operation that it does an add or multiply. Because of this, per-element overhead cost becomes a much smaller percentage of overall runtime, and we get the ~4x speedup we expect from 4 physical cores. It also reduces the relative cost of the simple Linq approach compared to the more complex <em>Parallel.For</em> abstraction. Consider how big the workload is to help decide if the simple Linq approach is worth the cost.</p> <h3 id="fsharp">FSharp</h3> <p>F# has a number of easy to use 3rd party libraries for this purpose. All can be used from C# as well. A quick rundown of them here:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c">(* Nessos Streams ParStream *)</span> <span class="kt">array</span> <span class="o">|&gt;</span> <span class="nn">ParStream</span><span class="p">.</span><span class="n">ofArray</span> <span class="o">|&gt;</span> <span class="nn">ParStream</span><span class="p">.</span><span class="n">fold</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">acc</span> <span class="o">+</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="p">(</span><span class="o">+</span><span class="p">)</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0</span><span class="p">)</span> <span class="c">(* FSharp.Collections.ParallelSeq *)</span> <span class="kt">array</span> <span class="o">|&gt;</span> <span class="nn">PSeq</span><span class="p">.</span><span class="n">reduce</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">acc</span><span class="o">+</span><span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="c">(* SIMDArray (uses AVX2 SIMD as well) *)</span> <span class="kt">array</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="nn">SIMDParallel</span><span class="p">.</span><span class="n">fold</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">acc</span> <span class="o">+</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">acc</span> <span class="o">+</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="p">(</span><span class="o">+</span><span class="p">)</span> <span class="p">(</span><span class="o">+</span><span class="p">)</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0</span> </code></pre></div></div> <h3 id="1-million-doubles-result--xx">1 million doubles (result += x*x)</h3> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>.NET / F# Parallel SIMDArray</td> <td>0.26ms</td> </tr> <tr> <td>.NET / F# Nessos Streams</td> <td>1.05ms</td> </tr> <tr> <td>.NET / F# ParallelSeq</td> <td>3.1ms</td> </tr> </tbody> </table> <p><br /></p> <p><a href="https://github.com/jackmott/SIMDArray">SIMDArray</a> is ‘cheating’ here as it also does SIMD operations, but I include it because I wrote it, so I do what I want. All of these out perform core library functions above.</p> <h3 id="1-million-doubles-result--mathsinx">1 million doubles (result += Math.Sin(x))</h3> <table> <thead> <tr> <th>Method</th> <th>Time</th> </tr> </thead> <tbody> <tr> <td>.NET / F# Nessos Streams</td> <td>6.7ms</td> </tr> <tr> <td>.NET / F# ParallelSeq</td> <td>9.9ms</td> </tr> </tbody> </table> <p><br /></p> <p>The Sin operation can’t be SIMDified here so SIMDArray is out. Nessos streams again proves to be better than the core library functions.</p> <h3 id="c">C++</h3> <p>Now the same experiment in C++. Most C++ compilers can auto parallelize loops, which you can control via compiler flags or inline hints in your code. For instance with Visual Studio’s C++ compiler you can just put <em>#pragma loop(hint_parallel(8))</em> on top of a loop, and it will parallelize it if it can. Unfortunately our toy example is (intentionally) a tiny bit too complex for that. Since we are summing up results, this creates a data dependency. Fortunately we can use <a href="http://openmp.org/wp/">OpenMP</a>, which is available in Microsoft Visual C++, GCC, Clang, and other popular C++ compilers:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">result</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="cp">#pragma omp parallel for reduction(+ : result) </span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">COUNT</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">result</span> <span class="o">+=</span> <span class="cm">/*Do Work*/</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>This is equivalent to the Parallel.For loop used above in C#, where you identify that you will be aggregating data. This is actually less typing and easier to read too, even if the syntax is odd. How does it perform?</p> <h4 id="1-million-doubles---result--xx--no-simd">1 million doubles - (result += x*x) No SIMD</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ForLoop</td> <td>1.031 ms</td> </tr> <tr> <td>ParallelizedForLoop</td> <td>0.375 ms</td> </tr> </tbody> </table> <p><br /> We can see that OpenMP is managing a more efficient abstraction than .NET for this case, managing almost almost a 3x speedup where .NET was actually a bit slower. Newer OpenMP implementations available on other compiles can also be directed to do SIMD vectorization in the loop for even more speed increase. That does not seem to be available in MS Visual C++, and the usual automatic vectorization seems to not happen within the omp loop. Automatic vectorization can be done on the single thread for loop but it was turned off for these C++ tests. <em>The C++ compilers does do older SSE instructions, as is the case with .NET and Java as well, but they only use a single lane. MSVC++ will use all lanes if you specify /fp:fast but only in the non OMP loop</em></p> <h4 id="1-million-doubles---result--sinx--no-simd">1 million doubles - (result += sin(x)) No SIMD</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ForLoop</td> <td>10.625 ms</td> </tr> <tr> <td>ParallelizedForLoop</td> <td>2.44 ms</td> </tr> </tbody> </table> <p><br /></p> <p>This time a little more than a 3x speedup, and as you can see the results are overall faster than .NET as well.</p> <h3 id="java">Java</h3> <p>Java’s streams library which performed excellently in a <a href="https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html">previous blog post</a> can be used here again. You simply have to tell it you want a parallel stream:</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">//Regular stream</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">array</span><span class="o">).</span><span class="na">reduce</span><span class="o">(</span><span class="mi">0</span><span class="o">,(</span><span class="n">acc</span><span class="o">,</span><span class="n">x</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="cm">/*Do Work*/</span><span class="o">);</span> <span class="c1">//ParallelStream</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">array</span><span class="o">).</span><span class="na">parallel</span><span class="o">().</span><span class="na">reduce</span><span class="o">(</span><span class="mi">0</span><span class="o">,(</span><span class="n">acc</span><span class="o">,</span><span class="n">x</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="cm">/*Do Work*/</span><span class="o">);</span> </code></pre></div></div> <p><br /></p> <h4 id="1-million-doubles-result--xx-1">1 million doubles (result += x*x)</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>Stream</td> <td>1.03ms</td> </tr> <tr> <td>Parallel Stream</td> <td>0.375ms -&gt; .8ms</td> </tr> </tbody> </table> <p><br /></p> <h4 id="1-million-doubles-result--mathsinx-1">1 million doubles (result += Math.sin(x))</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>Stream</td> <td>34.5ms</td> </tr> <tr> <td>Parallel Stream</td> <td>7.8ms -&gt; 14ms</td> </tr> </tbody> </table> <p><br /></p> <p>Java performs right on par with C++ in the first example, but falls behind when using Math.sin(). It appears that this is not due to the parallel streams, but due to Java using a more accurate sin implementation, rather than calling the x86 instruction directly. This difference may not exist on other hardware. I do not like it when a langauge tells me I can’t touch the hardware if I want. A Math.NativeSin() would be nice. The streams library overall though has proven to be excellent, matching C++ in both scalar and parallel varieties.</p> <h4 id="update">Update!</h4> <p>Further experiments with Java using the JMH testing framework have shown the parallel streams to exhibit inconsistent performance. Sometimes executing in ~.375ms indefinitely. Sometimes executing that fast for only a few dozens iterations then suddenly taking ~.8ms indefinitely after that. Reasons unknown, if you are a JVM expert and have ideas, please email me.</p> <h3 id="javascript">Javascript</h3> <p>pfffftttt (yeah I know about Web Workers)</p> <h3 id="rust">Rust</h3> <p>Rust provides no easy loop parallelizing abstractions out of the box, you have to roll your own. OpenMP style features <a href="https://github.com/rust-lang/rfcs/issues/859">may be in the works for Rust</a> though, and 3rd party libraries are available. So let’s take a look at a nice one called <a href="https://github.com/nikomatsakis/rayon">Rayon</a> which adds a “par_iter” providing similar functions as the regular iter, but in parallel. The code remains very simple:</p> <div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c">// The regular iter</span> <span class="n">vector</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="o">&amp;</span><span class="n">x</span><span class="p">|</span> <span class="cm">/* do work */</span><span class="p">)</span><span class="nf">.sum</span><span class="p">()</span> <span class="c">// Parallel iter</span> <span class="n">vector</span><span class="nf">.par_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="o">&amp;</span><span class="n">x</span><span class="p">|</span> <span class="cm">/* do work */</span><span class="p">)</span><span class="nf">.sum</span><span class="p">()</span> </code></pre></div></div> <p><br /></p> <h4 id="1-million-doubles-result--xx-2">1 million doubles (result += x*x)</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>iter</td> <td>1.05 ms</td> </tr> <tr> <td>par_iter</td> <td>.375 ms</td> </tr> </tbody> </table> <p><br /></p> <h4 id="1-million-doubles-result--mathsinx-2">1 million doubles (result += Math.sin(x))</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>iter</td> <td>9.65 ms</td> </tr> <tr> <td>par_iter</td> <td>2.44 ms</td> </tr> </tbody> </table> <p><br /></p> <p>These are excellent results, tied with C++, and requiring only a single line of code to express.</p> <h2 id="summary">Summary</h2> <p>The loop abstractions examined here are just one type of parallel or concurrent programming abstraction available. There is a whole universe out there, Actor Models, Async/Await, Tasks, Thread Pools, and so on. Be sure to understand what you are using, and measure whether it will really be useful, or whether you should focus on fast single threaded algorithms or look for third party tools with better performance.</p> <h2 id="aggregated-testing-results">Aggregated Testing Results</h2> <h4 id="1-million-doubles--result--xx--no-simd--except-simdarray">1 million doubles ( result += x*x) No SIMD ( Except SIMDArray)</h4> <table> <thead> <tr> <th>Method</th> <th>Time</th> <th>Lines Of Code</th> </tr> </thead> <tbody> <tr> <td>.NET / F# SIMDArray</td> <td>0.26ms</td> <td>1</td> </tr> <tr> <td>Rust Rayon</td> <td>0.375ms</td> <td>1</td> </tr> <tr> <td>C++ OpenMP</td> <td>0.375ms</td> <td>~5</td> </tr> <tr> <td>Java Parallel Streams</td> <td>0.375 ms -&gt; 0.8ms</td> <td>1</td> </tr> <tr> <td>.NET / F# Nessos Streams</td> <td>1.05ms</td> <td>~2</td> </tr> <tr> <td>.NET Parallel.For</td> <td>1.9ms</td> <td>~6</td> </tr> <tr> <td>.NET / F# ParallelSeq</td> <td>3.1ms</td> <td>1</td> </tr> <tr> <td>.NET Parallel Linq (Sum)</td> <td>3.3ms</td> <td>1</td> </tr> <tr> <td>.NET Parallel Linq (Aggregate)</td> <td>8ms</td> <td>1</td> </tr> </tbody> </table> <p><br /></p> <h4 id="1-million-doubles--result--sinx--no-simd">1 million doubles ( result += sin(x)) No SIMD</h4> <table> <thead> <tr> <th>Method</th> <th>Time</th> <th>Lines Of Code</th> </tr> </thead> <tbody> <tr> <td>Rust Rayon</td> <td>2.44ms</td> <td>1</td> </tr> <tr> <td>C++ OpenMP</td> <td>2.44ms</td> <td>~4</td> </tr> <tr> <td>.NET / F# Nessos Streams</td> <td>6.7ms</td> <td>~2</td> </tr> <tr> <td>Java Parallel Streams</td> <td>7.8ms -&gt; 14ms</td> <td>1</td> </tr> <tr> <td>.NET Parallel.For</td> <td>8.5615ms</td> <td>~6</td> </tr> <tr> <td>.NET Parallel Linq (Sum)</td> <td>9.8497ms</td> <td>1</td> </tr> <tr> <td>.NET / F# ParallelSeq</td> <td>9.9ms</td> <td>1</td> </tr> <tr> <td>.NET Parallel Linq (Aggregate)</td> <td>45.6ms</td> <td>1</td> </tr> </tbody> </table> <p><br /></p> <h2 id="benchmark-details">Benchmark Details</h2> <p>All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language (Debateable for C++). JIT warmup time is accounted for when applicable. If you identify cases where code or compiler/environment choices are sub optimal, email me please.</p> <h4 id="environment">Environment</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">Host</span> <span class="err">Process</span> <span class="err">Environment</span> <span class="err">Information:</span> <span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.9.8.0</span> <span class="py">OS</span><span class="p">=</span><span class="s">Microsoft Windows NT 6.2.9200.0</span> <span class="py">Processor</span><span class="p">=</span><span class="s">Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8</span> <span class="py">Frequency</span><span class="p">=</span><span class="s">2240907 ticks, Resolution=446.2479 ns, Timer=TSC</span> </code></pre></div></div> <h4 id="f--c-runtime-details">F# / C# Runtime Details</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">CLR</span><span class="p">=</span><span class="s">MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]</span> <span class="py">GC</span><span class="p">=</span><span class="s">Concurrent Workstation</span> <span class="py">JitModules</span><span class="p">=</span><span class="s">clrjit-v4.6.1590.0</span> <span class="py">Type</span><span class="p">=</span><span class="s">SIMDBenchmark Mode=Throughput Platform=X64 </span> <span class="py">Jit</span><span class="p">=</span><span class="s">RyuJit GarbageCollection=Concurrent Workstation </span> </code></pre></div></div> <h3 id="c-details">C++ Details</h3> <p>Visual Studio 2015 Update 3, Optimizations set for maximum speed, SIMD off</p> <h3 id="java-details">Java Details</h3> <p>Oracle Java 64bit version 8 update 102 Testing done with JMH</p> <h3 id="rust-details">Rust Details</h3> <p>rustc 1.13.0-nightly build with <code class="highlighter-rouge">cargo rustc --release -- -C lto -C target-cpu=native</code></p> Tue, 30 Aug 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/08/30/think-before-you-parallelize.html https://jackmott.github.io//programming/2016/08/30/think-before-you-parallelize.html programming When Big O Fools Ya <p>Big O notation is a great tool. It allows one to quickly make smart choices among various data structures and algorithms. But sometimes a casual Big O analysis can fool us if we don’t think carefully about the impact of constant factors. One such example comes up very often when programming on modern CPUs, and that is when choosing between an Array, and a List, or Tree type structure.</p> <h3 id="memory-slow-slow-memory">Memory, Slow Slow Memory</h3> <p>In the early 1980s, the time it took to get data from RAM, and the time it took to do computation on the data were roughly in parity. You could use algorithms that hop randomly over the heap, grabbing data and working with it. Since that time, CPUs have gotten faster at a much higher rate than RAM has. Today, a CPU can compute on the order of 100 to 1000 times faster than it can get data from RAM. This means when the cpu needs data from RAM it has to stall for hundreds of cycles, doing nothing. Obviously this would be a useless situation, so modern CPUs have various levels of cache built in. Any time you request one piece of data from RAM, you also get chunks of contiguous memory pulled into the caches on the CPU. The result is that when you iterate over contiguous memory, you can access it about as fast as the CPU can operate, because you will be streaming chunks of data into the L1 cache. If you iterate over memory in random locations, you will often miss the CPU caches, and performance can suffer greatly. If you want to learn more about this, <a href="https://www.youtube.com/watch?v=rX0ItVEVjHc">Mike Acton’s CppCon talk</a> is a great starting point and great fun too.</p> <p>The consequence of this is that arrays have become the go to data structure if performance is important, sometimes even when Big O analysis suggests it would be slower. Where you wanted a Tree before you may want a sorted array and a binary search algorithm. Where you wanted a Queue before you may want a growable array, and so on.</p> <h3 id="linked-list-vs-array-list">Linked List vs Array List</h3> <p>Once you are familiar with how important contiguous memory access is, it should be no surprise that if you want to iterate over a collection quickly, that an array will be faster than a Linked List. Environments with clever allocators and garbage collectors may be able to keep Linked List nodes somewhat contiguous, some of the time, but they can’t guarantee it. Using a raw array usually involves quite a bit more complex code, especially if you want to be able to insert or add items, as you will have to deal with growing the array, shuffling elements around, and so on. Most language’s have core libraries which include some sort of growable array data structure to help with this. In C++ you have <a href="http://www.cplusplus.com/reference/vector/vector/">vector</a>, in C# you have <a href="https://msdn.microsoft.com/en-us/library/6sh2ey19(v=vs.110).aspx">List&lt;T&gt; (aliased as ResizeArray in F#)</a>, and in Java there is <a href="https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html">ArrayList</a>. Usually these data structures expose the same, or similar interface as the Linked List collection. <strong>I will refer to such data structures as Array Lists from here on, but keep in mind all the C# examples are using the List&lt;T&gt; class, not the older ArrayList class.</strong></p> <p>So what if you need a data structure that you can insert items into, and iterate over quickly? Let us assume for this example, that we have a use case where we will insert into the front of a collection about 5 times more often that we iterate over it. Let us also assume that the Linked List and Array List in our environment have interfaces which are equally pleasant to work with for this task. All that remains then to make a choice is to determine which one performs better. In the interest of optimizing our own valuable time, one might turn to Big O analysis. Referring to the handy <a href="http://bigocheatsheet.com/">Big-O Cheat Sheet</a>, the relevant time complexities for these two data structures are:</p> <table> <thead> <tr> <th> </th> <th>Iterate</th> <th>Insert</th> </tr> </thead> <tbody> <tr> <td>Array List</td> <td>O(n)</td> <td>O(n)</td> </tr> <tr> <td>Linked List</td> <td>O(n)</td> <td>O(1)</td> </tr> </tbody> </table> <p><br /> Array Lists are problematic for insertion, at a minimum it has to copy every single element beyond the insertion point in the array to move them over by 1 to make space for the inserted element, making it O(n). Sometimes it will also have to reallocate a new, bigger array to make room for the insertion. This doesn’t change the Big O time complexity, but does take time, and waste memory. So it seems for our use case, where insert happens 5 times more often than iterating, that the best choice is clear. As long as n is large enough, Linked List should perform better overall.</p> <h3 id="empiricism">Empiricism</h3> <p>But, to know things for sure, we always have to count. So let us do an experiment in C#, using <a href="https://github.com/PerfDotNet/BenchmarkDotNet">BenchMarkDotNet</a>. C# provides generic collections LinkedList<T> which is a classic Linked List, and List<T> which is an Array List. Their interfaces are similar, and both allow us to implement our use case with ease. We will assume a worst case scenario for Array List, by always inserting at the front, necessitating that the entire array be copied on each insertion. The testing environment specs are:</T></T></p> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">Host</span> <span class="err">Process</span> <span class="err">Environment</span> <span class="err">Information:</span> <span class="py">BenchmarkDotNet.Core</span><span class="p">=</span><span class="s">v0.9.9.0</span> <span class="py">OS</span><span class="p">=</span><span class="s">Microsoft Windows NT 6.2.9200.0</span> <span class="py">Processor</span><span class="p">=</span><span class="s">Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8</span> <span class="py">Frequency</span><span class="p">=</span><span class="s">2240910 ticks, Resolution=446.2473 ns, Timer=TSC</span> <span class="py">CLR</span><span class="p">=</span><span class="s">MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]</span> <span class="py">GC</span><span class="p">=</span><span class="s">Concurrent Workstation</span> <span class="py">JitModules</span><span class="p">=</span><span class="s">clrjit-v4.6.1590.0</span> <span class="py">Type</span><span class="p">=</span><span class="s">Bench Mode=Throughput </span> </code></pre></div></div> <h3 id="test-cases">Test Cases:</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span><span class="p">=</span><span class="k">true</span><span class="p">)]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">ArrayTest</span><span class="p">()</span> <span class="p">{</span> <span class="c1">//In C#, List&lt;T&gt; is an array backed list.</span> <span class="n">List</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">localInserts</span> <span class="p">=</span> <span class="n">inserts</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">localInserts</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">local</span><span class="p">.</span><span class="nf">Insert</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">1</span><span class="p">);</span> <span class="c1">//Insert the number 1 at the front</span> <span class="p">}</span> <span class="c1">// For loops iterate over List&lt;T&gt; much faster than foreach</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">local</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="c1">//do some work here so the JIT doesn't elide the loop entirely</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">ListTest</span><span class="p">()</span> <span class="p">{</span> <span class="n">LinkedList</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">linkedList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">localInserts</span> <span class="p">=</span> <span class="n">inserts</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">localInserts</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">local</span><span class="p">.</span><span class="nf">AddFirst</span><span class="p">(</span><span class="m">1</span><span class="p">);</span> <span class="c1">//Insert the number 1 at the front</span> <span class="p">}</span> <span class="c1">// Again, iterating the fastest possible way over this collection</span> <span class="kt">var</span> <span class="n">node</span> <span class="p">=</span> <span class="n">local</span><span class="p">.</span><span class="n">First</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">node</span><span class="p">.</span><span class="n">Value</span><span class="p">;</span> <span class="n">node</span> <span class="p">=</span> <span class="n">node</span><span class="p">.</span><span class="n">Next</span><span class="p">;</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <h3 id="results">Results:</h3> <table> <thead> <tr> <th>Method</th> <th>length</th> <th>inserts</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ArrayTest</td> <td>100</td> <td>5</td> <td>38.9983 us</td> </tr> <tr> <td>ListTest</td> <td>100</td> <td>5</td> <td>51.7538 us</td> </tr> </tbody> </table> <p><br /></p> <p>The Array List wins by a nice margin. But this is a small list, Big O only tells us about performance as <code class="highlighter-rouge">n</code> grows large, so we should see this trend eventually reverse as <code class="highlighter-rouge">n</code> grows larger. Let’s try it:</p> <div style="width:750px;"> <div style="width:350px;float:left;"> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Inserts</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ArrayTest</td> <td>100</td> <td>5</td> <td>38.9983 us</td> </tr> <tr> <td>ListTest</td> <td>100</td> <td>5</td> <td>51.7538 us</td> </tr> <tr> <td>ArrayTest</td> <td>1000</td> <td>5</td> <td>42.1585 us</td> </tr> <tr> <td>ListTest</td> <td>1000</td> <td>5</td> <td>49.5561 us</td> </tr> <tr> <td>ArrayTest</td> <td>100000</td> <td>5</td> <td>208.9662 us</td> </tr> <tr> <td>ListTest</td> <td>100000</td> <td>5</td> <td>312.2153 us</td> </tr> <tr> <td>ArrayTest</td> <td>1000000</td> <td>5</td> <td>2,179.2469 us</td> </tr> <tr> <td>ListTest</td> <td>1000000</td> <td>5</td> <td>4,913.3430 us</td> </tr> <tr> <td>ArrayTest</td> <td>10000000</td> <td>5</td> <td>36,103.8456 us</td> </tr> <tr> <td>ListTest</td> <td>10000000</td> <td>5</td> <td>49,395.0839 us</td> </tr> </tbody> </table> </div> <div style="width:400px;float:right"> <table class="highchart" data-graph-container=".. .. .highchart-container" data-graph-type="line" graph-color="#000" data-graph-xaxis-title-text="array size" data-graph-yaxis1-title-text="log runtime us" style="display:none"> <thead> <tr> <th>Length</th> <th>ArrayList</th> <th>LinkedList</th> </tr> </thead> <tbody> <tr> <td>100</td> <td>38.9983</td> <td>51.7538</td> </tr> <tr> <td>1000</td> <td>42.1585</td> <td>49.5561</td> </tr> <tr> <td>100000</td> <td>208.9662</td> <td>312.2153</td> </tr> <tr> <td>1000000</td> <td>2179.2469</td> <td>4913.3430</td> </tr> <tr> <td>10000000</td> <td>36103.8456</td> <td>49395.0839</td> </tr> </tbody> </table> <div class="highchart-container" style="background-color:#000;"></div> </div> <div style="clear:both;"></div> </div> <p><br /> Here we get the result that will be counterintuitive to many. No matter how large <code class="highlighter-rouge">n</code> gets, the Array List still performs better overall. In order for performance to get worse, the <em>ratio</em> of inserts to iterations has to change, not just the length of the collection. Note that isn’t an actual failure of Big O analysis, it is merely a common human failure in our application of it. If you actually “did the math”, Big O would tell you that the two data structures here will grow at the same speed when there is a constant ratio of inserts to iterations.</p> <p>Where the break even point occurs will depend on many factors, though a good rule of thumb suggested by <a href="https://www.youtube.com/watch?v=fHNmRkzxHWs">Chandler Carruth</a> at Google is that Array Lists will outperform Linked Lists until you are inserting about an order of magnitude more often than you are iterating. This rule of thumb works well in this particular case, as 10:1 is where we see Array List start to lose:</p> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Inserts</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ArrayTest</td> <td>100000</td> <td>10</td> <td>328,147.7954 ns</td> </tr> <tr> <td>ListTest</td> <td>100000</td> <td>10</td> <td>324,349.0560 ns</td> </tr> </tbody> </table> <p><br /></p> <h3 id="devils-in-the-details">Devils in the Details</h3> <p>The reason Array List wins here is because the integers being iterated over are lined up contiguously in memory. Each time an integer is requested from memory an entire cache line of integers is pulled into the L1 cache, so the next 64 bytes of data are ready to go. With the Linked List, each call to <code class="highlighter-rouge">node.Next</code> makes a pointer hop to the next node, and there is no guarantee that nodes will be contiguous in memory. Therefore we will miss the cache sometimes. But we aren’t always iterating over value types like this, especially in OOP oriented managed languages we often iterate over reference types. In that case, even with an Array List, while the pointers themselves are contiguous in memory, the objects they point to are not. The situation is still better than with a Linked List, where you will be making two pointer hops per iteration instead of one, but how does this affect the relative performance?</p> <p>It narrows it quite a bit, depending on the size of the objects, and the details of your hardware and software environment. Refactoring the example above to use Lists of small objects (12 bytes), the break even point drops to about 4 inserts per iteration:</p> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Inserts</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ArrayTestObject</td> <td>100000</td> <td>0</td> <td>674.1864 us</td> </tr> <tr> <td>ListTestObject</td> <td>100000</td> <td>0</td> <td>1,140.9044 us</td> </tr> <tr> <td>ArrayTestObject</td> <td>100000</td> <td>2</td> <td>959.0482 us</td> </tr> <tr> <td>ListTestObject</td> <td>100000</td> <td>2</td> <td>1,121.5423 us</td> </tr> <tr> <td>ArrayTestObject</td> <td>100000</td> <td>4</td> <td>1,230.6550 us</td> </tr> <tr> <td>ListTestObject</td> <td>100000</td> <td>4</td> <td>1,142.6658 us</td> </tr> </tbody> </table> <p><br /></p> <p>Managed C# code suffers a bit in this case because iterating over this Array List incurs some unnecessary array bounds checking. C++ vector would likely fare better. If you were really aggressive about this you could probably write a faster Array List class using unsafe C# code to avoid the array bounds checks. Also, the relative differences here will depend greatly on how your allocator and garbage collector manage the heap, how big your objects are, and other factors. Larger objects tended to cause the relative performance of the Array List to improve in my environment. In the context of a complete application the relative performance of Array List might improve as well as the heap gets more fragmented, but you will have to test to know for sure.</p> <p>As an aside, if your objects are sufficiently small (16 to 32 bytes or less, depending on various factors) you should consider making them value types (<code class="highlighter-rouge">struct</code> in .NET) instead of objects. Not only will you benefit greatly from contiguous memory access, but you will potentially reduce garbage collection overhead as well, depending on your usage of them:</p> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Inserts</th> <th>Median</th> </tr> </thead> <tbody> <tr> <td>ArrayTestObject</td> <td>100000</td> <td>10</td> <td>2,094.8273 us</td> </tr> <tr> <td>ListTestObject</td> <td>100000</td> <td>10</td> <td>1,154.3014 us</td> </tr> <tr> <td>ArrayTestStruct</td> <td>100000</td> <td>10</td> <td>792.0004 us</td> </tr> <tr> <td>ListTestStruct</td> <td>100000</td> <td>10</td> <td>1,206.0713 us</td> </tr> </tbody> </table> <p><br /></p> <p>Java may handle this better since it does some automatic cleverness with small objects, or you may have to just use separate arrays of primitive types. Though onerous to type, <a href="https://software.intel.com/en-us/articles/memory-layout-transformations">this can sometimes be faster</a> than an array of structs, depending on your data access patterns. Consider it when performance matters.</p> <h3 id="make-sure-the-abstraction-is-worth-it">Make Sure the Abstraction is Worth It</h3> <p>It is common for people to object to these sorts of considerations on the basis of code clarity, correctness, and maintainability. Of course each problem domain has it’s own priorities, but I feel strongly that when the clarity benefit of the abstraction is small, and the performance impact is large, that we should choose better performance as a rule. By taking time to understand your environment, you will be aware of cases where a faster but equally clear option exists, as is often the case with Array Lists vs Lists.</p> <p>As some food for thought, here are 7 different ways to add up a list of numbers in C#, with their run times and memory costs. Checked arithmetic is used in all cases to keep the comparison with Linq fair, as it’s Sum method uses checked arithmetic. Notice how <em>much</em> better performing the fastest option is. Notice how expensive the most popular method (Linq) is. Notice that the <code class="highlighter-rouge">foreach</code> abstraction works out well with raw Arrays, but not with Array List or Linked List. Whatever your language and environment of choice is, understand these details so you can make smart default choices.</p> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Median</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>LinkedListLinq</td> <td>100000</td> <td>990.7718 us</td> <td>23,192.49</td> </tr> <tr> <td>RawArrayLinq</td> <td>100000</td> <td>643.8204 us</td> <td>11,856.39</td> </tr> <tr> <td>LinkedListForEach</td> <td>100000</td> <td>489.7294 us</td> <td>11,909.99</td> </tr> <tr> <td>LinkedListFor</td> <td>100000</td> <td>299.9746 us</td> <td>6,033.70</td> </tr> <tr> <td>ArrayListForEach</td> <td>100000</td> <td>270.3873 us</td> <td>6,035.88</td> </tr> <tr> <td>ArrayListFor</td> <td>100000</td> <td>97.0850 us</td> <td>1,574.32</td> </tr> <tr> <td>RawArrayForEach</td> <td>100000</td> <td>53.0535 us</td> <td>1,574.84</td> </tr> <tr> <td>RawArrayFor</td> <td>100000</td> <td>53.1745 us</td> <td>1,577.77</td> </tr> </tbody> </table> <p><br /></p> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">LinkedListLinq</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">linkedList</span><span class="p">;</span> <span class="k">return</span> <span class="n">local</span><span class="p">.</span><span class="nf">Sum</span><span class="p">();</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">LinkedListForEach</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">linkedList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">checked</span> <span class="p">{</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">node</span> <span class="k">in</span> <span class="n">local</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">node</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">LinkedListFor</span><span class="p">()</span> <span class="p">{</span> <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">linkedList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="kt">var</span> <span class="n">node</span> <span class="p">=</span> <span class="n">local</span><span class="p">.</span><span class="n">First</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="k">checked</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">node</span><span class="p">.</span><span class="n">Value</span><span class="p">;</span> <span class="n">node</span> <span class="p">=</span> <span class="n">node</span><span class="p">.</span><span class="n">Next</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">ArrayListFor</span><span class="p">()</span> <span class="p">{</span> <span class="c1">//In C#, List&lt;T&gt; is an array backed list</span> <span class="n">List</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Count</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="k">checked</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">local</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">ArrayListForEach</span><span class="p">()</span> <span class="p">{</span> <span class="c1">//In C#, List&lt;T&gt; is an array backed list</span> <span class="n">List</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayList</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">checked</span> <span class="p">{</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">x</span> <span class="k">in</span> <span class="n">local</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">x</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">RawArrayLinq</span><span class="p">()</span> <span class="p">{</span> <span class="kt">int</span><span class="p">[]</span> <span class="n">local</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="k">return</span> <span class="n">local</span><span class="p">.</span><span class="nf">Sum</span><span class="p">();</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">RawArrayForEach</span><span class="p">()</span> <span class="p">{</span> <span class="kt">int</span><span class="p">[]</span> <span class="n">local</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">checked</span> <span class="p">{</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">x</span> <span class="k">in</span> <span class="n">local</span><span class="p">)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">x</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span> <span class="k">public</span> <span class="kt">int</span> <span class="nf">RawArrayFor</span><span class="p">()</span> <span class="p">{</span> <span class="kt">int</span><span class="p">[]</span> <span class="n">local</span> <span class="p">=</span> <span class="n">rawArray</span><span class="p">;</span> <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="k">checked</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">local</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="p">}</span> <span class="k">return</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <script src="https://code.jquery.com/jquery-2.2.4.min.js" type="text/javascript"></script> <script src="https://code.highcharts.com/highcharts.js"></script> <script src="/js/highchartTable-min.js" type="text/javascript"></script> <script> $(document).ready(function() { $('table.highchart').highchartTable(); }); </script> Sat, 20 Aug 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/08/20/when-bigo-foolsya.html https://jackmott.github.io//programming/2016/08/20/when-bigo-foolsya.html programming Adventures in F# Performance <p>Apologies to functional programming enthusiasts, what follows is a lot of imperative code. What can I say, it is the array library after all!</p> <p>After working on an F# <a href="https://github.com/jackmott/SIMDArray">SIMD Array library</a> for a while, and learning about some nice bench marking tools for .NET thanks to <a href="https://twitter.com/cloudRoutine">Jared Hester</a>. I got the idea to try contributing to the F# core libraries myself. I had been poking around in the official <a href="https://github.com/Microsoft/visualfsharp">Microsoft F# repo</a> because I was modeling my SIMD Library after the <a href="https://github.com/Microsoft/visualfsharp/blob/master/src/fsharp/FSharp.Core/array.fs">core Array library</a>, duplicating all relevant functions in SIMD form. As I got familiar with the code I saw a function I thought I could speed up. Steffen Forkmann pointed me to a <a href="http://www.navision-blog.de/blog/2016/04/25/make-failure-great-again-a-small-journey-into-the-f-compiler/">blog post of his</a> about how to get started building and contributing to the FSharp language, so I got to work.</p> <h3 id="arrayfilter">Array.filter</h3> <p>This was the first function I thought I could improve, and mostly I was wrong! <code class="highlighter-rouge">Array.filter</code> takes an array and a predicate function as its arguments and applies the function to each element of the array. The resulting array contains only the elements that satisfy the predicate. The original implementation used a List<T>, which is a .NET collection similar to a C++ Vector, an array backed List that doubles in size as you add items and fill it up. Each time you fill it up, you have to allocate a whole new array and discard the old one. Which leads to a worst case scenario where if the array's length just exceeds a power of 2, like 1025, and 0 elements are filtered, you end up allocating 3,836 elements when you only needed 1025. And then you allocate another 1025 to copy the array out of the List. But in the best case, you allocate only a handful of bytes for List<T> overhead, when everything is filtered:</T></T></p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">filter</span> <span class="n">f</span> <span class="p">(</span><span class="kt">array</span><span class="o">:</span> <span class="n">_</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">res</span> <span class="o">=</span> <span class="nc">List</span><span class="o">&lt;_&gt;</span><span class="bp">()</span> <span class="o">//</span> <span class="nc">ResizeArray</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span> <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">if</span> <span class="n">f</span> <span class="n">x</span> <span class="k">then</span> <span class="n">res</span><span class="o">.</span><span class="nc">Add</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">res</span><span class="o">.</span><span class="nc">ToArray</span><span class="bp">()</span> </code></pre></div></div> <p>I tried a few things, and settled on this for a while:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">filter</span> <span class="n">f</span> <span class="p">(</span><span class="kt">array</span><span class="o">:</span> <span class="n">_</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">temp</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="k">if</span> <span class="n">f</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="n">temp</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="bp">true</span> <span class="n">c</span> <span class="o">&lt;-</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">let</span> <span class="n">result</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="n">c</span> <span class="n">c</span> <span class="o">&lt;-</span> <span class="mi">0</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">result</span><span class="o">.</span><span class="nc">Length</span> <span class="k">do</span> <span class="k">if</span> <span class="n">temp</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="n">result</span><span class="o">.</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">c</span> <span class="o">&lt;-</span> <span class="n">c</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">i</span> <span class="o">&lt;-</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">result</span> </code></pre></div></div> <p>This allocates an array of booleans the same length as the input up front, which are usually stored as bytes in .NET. So in the common case, where you have a 32bit or 64bit pointer, int, or float, as your array element, it will allocate no more than 1/8 to 1/4 of your array size in extra data instead of 3x to 4x. Reducing GC pressure is a big win with garbage collected languages so that seemed like a good thing. There are some gotchas though:</p> <ul> <li>The loops now both have branches in them.</li> <li>The branch pattern will sometimes be random, so branch prediction will miss them, which is slow.</li> <li>The performance advantage goes negative compared to the original implementation as the size of the array type shrinks.</li> </ul> <p>So in cases where most things are filtered, and the distribution of elements is somewhat random as to whether they get filtered or not, performance was sometimes worse. Performance also differed in 32bit vs 64 bit builds, and on different machines. Benchmarking this was really hard because you have to account for different array type sizes, lengths, different distribution of filtering and amount of filtering. It didn’t always win, and it was hard to decide if it was really better.</p> <p>Then <a href="https://github.com/asik">Asik</a> suggested a solution which ended up being the final answer:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">filter</span> <span class="n">f</span> <span class="p">(</span><span class="kt">array</span><span class="o">:</span> <span class="n">_</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">res</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="kt">array</span> <span class="k">do</span> <span class="k">if</span> <span class="n">f</span> <span class="n">x</span> <span class="k">then</span> <span class="n">res</span><span class="o">.</span><span class="p">[</span><span class="n">count</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">x</span> <span class="n">count</span> <span class="o">&lt;-</span> <span class="n">count</span> <span class="o">+</span> <span class="mi">1</span> <span class="nn">Array</span><span class="p">.</span><span class="n">subUnchecked</span> <span class="mi">0</span> <span class="n">count</span> <span class="n">res</span> </code></pre></div></div> <p>This just allocates an entire array of whatever type was input, adds elements into it, and then uses Array.sub which calls fast native code to copy sub sections of arrays into new ones. This was <em>always</em> faster than the original core lib solution, but sometimes allocated more memory. The Microsoft guys considered that a net win, so they took it. The improvement here varied a lot, but was usually around 20% faster. Worst case performance would be with large array types (Like a 16 byte struct) where most elements are likely to get filtered. You might want to roll your own filter if you are doing that. This same optimization was applied to the similar Array.choose function.</p> <h4 id="update">UPDATE:</h4> <p><a href="https://github.com/asik">Asik</a> and I have collaborated and got a <a href="https://github.com/jackmott/visualfsharp/blob/497f11a9bc7367af79264090096d3bf6da2b6903/src/fsharp/FSharp.Core/array.fs#L487">new filter merged</a> that keeps the speed of the above solution, while reducing allocations by ~30% on average. We did this by implementing a growing array by hand, taking advantage of extra knowledge we have, like that the upper bound for it’s size is array.Length, and some other tricks. Another <a href="https://gist.github.com/manofstick/229384a1bd0bdb26caf9a780b952d9b8#file-filter_bench-fs-L89">interesting solution</a> is being proposed by <a href="https://gist.github.com/manofstick">Paul Westcott</a> which uses a bit array. This may reduce allocations yet again while maintaining similar rerformance, pretty cool.</p> <p>As an aside, if your predicate is a pure function, and a fast function, you can <a href="https://github.com/jackmott/SIMDArray/blob/master/src/SIMDArray/SIMDArray.fs#L1076">apply the predicate twice</a> to avoid any extra allocations at all. This is <em>very</em> fast for sufficiently simple predicates, like &gt; or &lt; comparisons.</p> <h4 id="performance-test-results-for-filtering-50-of-random-doubles-on-64bit-ryujit">Performance test results for filtering 50% of random doubles on 64bit RyuJit</h4> <table> <thead> <tr> <th>Method</th> <th>Median</th> <th>StdDev</th> <th>Gen 0</th> <th>Gen 1</th> <th>Gen 2</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>CoreFilter</td> <td>10.7906 ms</td> <td>0.2096 ms</td> <td>20.00</td> <td>-</td> <td>314.00</td> <td>3 953 196,34</td> </tr> <tr> <td>ArrayFilter</td> <td>8.3605 ms</td> <td>0.0374 ms</td> <td>-</td> <td>-</td> <td>329.99</td> <td>3 762 296,97</td> </tr> </tbody> </table> <p><br /></p> <h3 id="arraypartition">Array.partition</h3> <p>I wasn’t too happy with the filter optimization because I felt like someimtes taking a memory hit wasn’t so great. So I started scanning through the library for other opportunities, and came across Array.partition, which takes an array and a predicate, returning a tuple with two arrays. One array contains every element that was true, the other every element that was false.</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">partition</span> <span class="n">f</span> <span class="p">(</span><span class="kt">array</span><span class="o">:</span> <span class="n">_</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">res1</span> <span class="o">=</span> <span class="nc">List</span><span class="o">&lt;_&gt;</span><span class="bp">()</span> <span class="o">//</span> <span class="nc">ResizeArray</span> <span class="k">let</span> <span class="n">res2</span> <span class="o">=</span> <span class="nc">List</span><span class="o">&lt;_&gt;</span><span class="bp">()</span> <span class="o">//</span> <span class="nc">ResizeArray</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">do</span> <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">if</span> <span class="n">f</span> <span class="n">x</span> <span class="k">then</span> <span class="n">res1</span><span class="o">.</span><span class="nc">Add</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">else</span> <span class="n">res2</span><span class="o">.</span><span class="nc">Add</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">res1</span><span class="o">.</span><span class="nc">ToArray</span><span class="bp">()</span><span class="o">,</span> <span class="n">res2</span><span class="o">.</span><span class="nc">ToArray</span><span class="bp">()</span> </code></pre></div></div> <p>I had more respect for the (array backed) List solutions now, after failing to get a clear win by using raw arrays with filter. So I tried to look for something more clever. I realized that one invariant here is that the result will always be the same size as the input. If the input is 100 elements, the output will be 100 elements. So fundamentally, we shouldn’t need to use a data structures that grows. I thought about creating a struct where you could tag each element with a true or false on the first pass, and then copy the results into the two output arrays. But that still wastes array.Length bytes of memory. Then I had a great idea, maybe my best idea! Allocate an array the same size and type as the input, and put all the true elements on the left, and all the false elements on the right! The only memory wasted is an extra int to keep track of where one set ends and the other begins. You then just copy the left side of the array into the first result, and the reverse of the right side of the array into the second result:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">partition</span> <span class="n">f</span> <span class="p">(</span><span class="kt">array</span><span class="o">:</span> <span class="n">_</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">res</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">upCount</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">downCount</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">for</span> <span class="n">x</span> <span class="k">in</span> <span class="kt">array</span> <span class="k">do</span> <span class="k">if</span> <span class="n">f</span> <span class="n">x</span> <span class="k">then</span> <span class="n">res</span><span class="o">.</span><span class="p">[</span><span class="n">upCount</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">x</span> <span class="n">upCount</span> <span class="o">&lt;-</span> <span class="n">upCount</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">else</span> <span class="n">res</span><span class="o">.</span><span class="p">[</span><span class="n">downCount</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">x</span> <span class="n">downCount</span> <span class="o">&lt;-</span> <span class="n">downCount</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">let</span> <span class="n">res1</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">subUnchecked</span> <span class="mi">0</span> <span class="n">upCount</span> <span class="n">res</span> <span class="k">let</span> <span class="n">res2</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="p">(</span><span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="o">-</span> <span class="n">upCount</span><span class="p">)</span> <span class="n">downCount</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">res2</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="n">res2</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">res</span><span class="o">.</span><span class="p">[</span><span class="n">downCount</span><span class="p">]</span> <span class="n">downCount</span> <span class="o">&lt;-</span> <span class="n">downCount</span> <span class="o">-</span> <span class="mi">1</span> <span class="n">res1</span> <span class="o">,</span> <span class="n">res2</span> </code></pre></div></div> <h4 id="performance-test-results-for-partitioning-random-int-arrays-with-predicate-fun-x---x--2--0">Performance test results for partitioning random int arrays with predicate <code class="highlighter-rouge">(fun x -&gt; x % 2 = 0)</code></h4> <table> <thead> <tr> <th>Method</th> <th>ArrayLength</th> <th>Median</th> <th>StdDev</th> <th>Scaled</th> <th>Gen 0</th> <th>Gen 1</th> <th>Gen 2</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>Partition</td> <td>10</td> <td>180.8758 ns</td> <td>16.3650 ns</td> <td>1.00</td> <td>0.01</td> <td>-</td> <td>-</td> <td>185.22</td> </tr> <tr> <td>NewPartition</td> <td>10</td> <td>76.6145 ns</td> <td>1.2114 ns</td> <td>0.42</td> <td>0.01</td> <td>-</td> <td>-</td> <td>90.38</td> </tr> <tr> <td>Partition</td> <td>10000</td> <td>117,268.5175 ns</td> <td>1,064.2667 ns</td> <td>1.00</td> <td>6.40</td> <td>-</td> <td>-</td> <td>99,742.26</td> </tr> <tr> <td>NewPartition</td> <td>10000</td> <td>79,020.6291 ns</td> <td>474.4149 ns</td> <td>0.67</td> <td>2.64</td> <td>-</td> <td>-</td> <td>43,572.00</td> </tr> <tr> <td>Partition</td> <td>10000000</td> <td>154,545,402.8213 ns</td> <td>3,116,253.3692 ns</td> <td>1.00</td> <td>-</td> <td>-</td> <td>62.02</td> <td>59,133,643.66</td> </tr> <tr> <td>NewPartition</td> <td>10000000</td> <td>98,768,489.7225 ns</td> <td>726,198.4079 ns</td> <td>0.64</td> <td>-</td> <td>-</td> <td>34.00</td> <td>29,686,956.03</td> </tr> </tbody> </table> <p><br /></p> <h3 id="adventures-in-il-and-dissasembly">Adventures in IL and Dissasembly</h3> <p>One of the performance drawbacks of most managed/safe languages is that they do array bounds checking. This prevents you from accidentally wandering off the end of an array and over writing memory at random, which is a useful feature. But it comes with a performance cost, as you end up eating some cpu cycles checking array bounds each time through the loop. The .NET JIT will identify <a href="https://blogs.msdn.microsoft.com/clrcodegeneration/2009/08/13/array-bounds-check-elimination-in-the-clr/">Some but not all</a> cases when these bounds checks can be eliminated. You have to take some care to structure your loop just right, or it will be missed. F# added some confusion here since their loops have different syntax than C#, and <a href="https://github.com/Microsoft/visualfsharp/issues/1419">sometimes compile strangely, or badly</a>. You can peek at the byte code or C# equivalent representation of it with tools like <a href="http://ilspy.net/">ILSpy</a> For instance this loop:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">len</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">len</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="c">(* stuff *)</span> </code></pre></div></div> <p>compiles to the C# equivalent of:</p> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">num</span> <span class="p">=</span> <span class="n">len</span> <span class="p">-</span> <span class="m">1</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">num</span> <span class="p">&gt;=</span> <span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="k">do</span> <span class="p">{</span> <span class="c1">// stuff</span> <span class="n">i</span><span class="p">++;</span> <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="p">!=</span> <span class="n">num</span> <span class="p">+</span> <span class="m">1</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>This is madness, maybe some of that madness gets JITted away, but it definitely does cause the array bounds elision to be missed, slowing it down. This was a pattern used in many places in the core Array library, so I just went through and mechanistically replaced them all with the pattern that works:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="c">(* stuff *)</span> </code></pre></div></div> <p>Which becomes the C# equivalent of:</p> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="c1">//stuff</span> </code></pre></div></div> <p>AHHHH, much better, and now we get array bounds elision from the JIT too. The impact of this change can be pretty big in some cases, when any functions applied per array element are very simple the array bounds checking makes up a sizeable % of total run time. In other cases it will be a very small impact. Array.map (fun x-&gt; SieveofEratosthenes x) isn’t going to be noticeably better. But it impacted almost all of the functions in the Array module, and I assume (which is dangerous) it would take some overhead out of JITing the IL as well.</p> <p>If you want to know for sure if the loop is doing what you want, as 32Bit JITs differ from the 64 bit one differ from Mono etc., you will need to view the dissasembly. In Visual Studio you can get it at from Debug -&gt; Windows -&gt; Disassembly while the program is running. Here is an example of code with, and without a bounds check:</p> <p><img src="/images/dissassembly.png" alt="Dissasembly Example" /></p> <p>Since this process is done in the JIT, you don’t always have control over it. Sometimes you can massage your code to be sure the JIT will do the right thing, but sometimes you can’t. If you get desperate, write the function in C# using an unsafe loop, and call it from F#.</p> <h4 id="other-loop-patterns-to-beware-of-in-net">Other loop patterns to beware of in .NET:</h4> <ul> <li>For lops that go from 0 to anything <em>less</em> than the array length, will not get the bounds check elided.</li> <li>For loops that go backwards, will not get array bounds checking elided.</li> <li>With for loops over arrays in F# that have a stride length of something other than 1, the compiler generates a loop that uses an Enumerator, which is much slower and generates garbage. Use a while loop, or tail recursion instead.</li> <li>For loops over arrays that are class members will miss the array bounds elision. Make a function local copy of the array reference first.</li> <li>The <code class="highlighter-rouge">for x in array</code> syntax in F# works out fine. There may be other performance considerations but a normal for loop is generated and bounds checking is elided.</li> </ul> <p><em>These things are all true as of 64bit RyuJIT .NET 4.6.2 and F# 4.4.0, some of them are being actively worked on and could improve soon.</em></p> <h4 id="performance-test-results-of-bounds--check-elision-from-arraymap-with-mapping-function-fun-x---x--1">Performance test results of bounds check elision from <code class="highlighter-rouge">Array.map</code> with mapping function <code class="highlighter-rouge">(fun x -&gt; x + 1)</code></h4> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Median</th> <th>StdDev</th> <th>Scaled</th> </tr> </thead> <tbody> <tr> <td>Old</td> <td>10</td> <td>17.5030 ns</td> <td>0.5275 ns</td> <td>1.00</td> </tr> <tr> <td>New</td> <td>10</td> <td>14.1205 ns</td> <td>0.4858 ns</td> <td>0.81</td> </tr> <tr> <td>Old</td> <td>10000</td> <td>10,212.8762 ns</td> <td>118.7990 ns</td> <td>1.00</td> </tr> <tr> <td>New</td> <td>10000</td> <td>8,963.2690 ns</td> <td>329.8907 ns</td> <td>0.88</td> </tr> </tbody> </table> <p><br /></p> <h3 id="delving-into-parallel">Delving Into Parallel</h3> <p>The Array module has a sub module Parallel. <code class="highlighter-rouge">Array.Parallel.map</code>, for instance, will use a <code class="highlighter-rouge">Parallel.For</code> loop to multithread your map operation. Scanning through these I saw <code class="highlighter-rouge">Parallel.partition</code>:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">partition</span> <span class="n">predicate</span> <span class="p">(</span><span class="kt">array</span> <span class="o">:</span> <span class="k">'</span><span class="nc">T</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">inputLength</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="n">lastInputIndex</span> <span class="o">=</span> <span class="n">inputLength</span> <span class="o">-</span> <span class="mi">1</span> <span class="k">let</span> <span class="n">isTrue</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="n">inputLength</span> <span class="nn">Parallel</span><span class="p">.</span><span class="nc">For</span><span class="p">(</span><span class="mi">0</span><span class="o">,</span> <span class="n">inputLength</span><span class="o">,</span> <span class="k">fun</span> <span class="n">i</span> <span class="o">-&gt;</span> <span class="n">isTrue</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="n">predicate</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">)</span> <span class="o">|&gt;</span> <span class="n">ignore</span> <span class="k">let</span> <span class="k">mutable</span> <span class="bp">true</span><span class="nc">Length</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="k">in</span> <span class="mi">0</span> <span class="o">..</span> <span class="n">lastInputIndex</span> <span class="k">do</span> <span class="k">if</span> <span class="n">isTrue</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="bp">true</span><span class="nc">Length</span> <span class="o">&lt;-</span> <span class="bp">true</span><span class="nc">Length</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">let</span> <span class="bp">true</span><span class="nc">Result</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="bp">true</span><span class="nc">Length</span> <span class="k">let</span> <span class="bp">false</span><span class="nc">Result</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="p">(</span><span class="n">inputLength</span> <span class="o">-</span> <span class="bp">true</span><span class="nc">Length</span><span class="p">)</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">iTrue</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">iFalse</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">lastInputIndex</span> <span class="k">do</span> <span class="k">if</span> <span class="n">isTrue</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="bp">true</span><span class="nn">Result</span><span class="p">.[</span><span class="n">iTrue</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">iTrue</span> <span class="o">&lt;-</span> <span class="n">iTrue</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">else</span> <span class="bp">false</span><span class="nn">Result</span><span class="p">.[</span><span class="n">iFalse</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">iFalse</span> <span class="o">&lt;-</span> <span class="n">iFalse</span> <span class="o">+</span> <span class="mi">1</span> <span class="p">(</span><span class="bp">true</span><span class="nc">Result</span><span class="o">,</span> <span class="bp">false</span><span class="nc">Result</span><span class="p">)</span> </code></pre></div></div> <p>What stuck out at me here was that they were iterating over the entire isTrue array a second time in order to count up how many true elements there are. This struck me as fundamentally unnecessary. So I tried creating an accumulation variable above the Parallel.For call, and just incrementing that within the loop. Nope! You can’t add in parallel like that safely on x86 (or perhaps any architecture?) It worked sometimes but not always. Then I remembered<br /> <code class="highlighter-rouge">System.Threading.Interlocked.Increment(Int32)</code>, which provides a thread safe way to increment an int. This worked! But then it was just as slow as the scalar version of the function, since every thread was constantly locking on the increment function. So I <a href="https://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.for(v=vs.110).aspx">read the documentation</a>. Sometimes this stuff is awful to read. <code class="highlighter-rouge">Func&lt;Int32&gt;</code>? <code class="highlighter-rouge">Action&lt;TLocal&gt;</code>? PC LOAD LETTER?!?! But if you go slow and stare at this for a while it will start to make sense. The key info here is that there is a <code class="highlighter-rouge">Parallel.For</code> loop which can internally keep track of an accumulator for you. This will let us track the total number of true elements without iterating over the array again. So the new solution becomes:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">partition</span> <span class="n">predicate</span> <span class="p">(</span><span class="kt">array</span> <span class="o">:</span> <span class="k">'</span><span class="nc">T</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array"</span> <span class="kt">array</span> <span class="k">let</span> <span class="n">inputLength</span> <span class="o">=</span> <span class="kt">array</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="n">isTrue</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="n">inputLength</span> <span class="k">let</span> <span class="k">mutable</span> <span class="bp">true</span><span class="nc">Length</span> <span class="o">=</span> <span class="mi">0</span> <span class="nn">Parallel</span><span class="p">.</span><span class="nc">For</span><span class="p">(</span><span class="mi">0</span><span class="o">,</span> <span class="n">inputLength</span><span class="o">,</span> <span class="p">(</span><span class="k">fun</span> <span class="bp">()</span> <span class="o">-&gt;</span> <span class="mi">0</span><span class="p">)</span><span class="o">,</span> <span class="p">(</span><span class="k">fun</span> <span class="n">i</span> <span class="n">_</span> <span class="bp">true</span><span class="nc">Count</span> <span class="o">-&gt;</span> <span class="k">if</span> <span class="n">predicate</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="n">isTrue</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="bp">true</span> <span class="bp">true</span><span class="nc">Count</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">else</span> <span class="bp">true</span><span class="nc">Count</span><span class="p">)</span><span class="o">,</span> <span class="nc">Action</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span> <span class="p">(</span><span class="k">fun</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="nn">System</span><span class="p">.</span><span class="nn">Threading</span><span class="p">.</span><span class="nn">Interlocked</span><span class="p">.</span><span class="nc">Add</span><span class="p">(</span><span class="o">&amp;</span><span class="bp">true</span><span class="nc">Length</span><span class="o">,</span><span class="n">x</span><span class="p">)</span> <span class="o">|&gt;</span> <span class="n">ignore</span><span class="p">)</span> <span class="p">)</span> <span class="o">|&gt;</span> <span class="n">ignore</span> <span class="k">let</span> <span class="n">res1</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="bp">true</span><span class="nc">Length</span> <span class="k">let</span> <span class="n">res2</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">zeroCreateUnchecked</span> <span class="p">(</span><span class="n">inputLength</span> <span class="o">-</span> <span class="bp">true</span><span class="nc">Length</span><span class="p">)</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">iTrue</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">let</span> <span class="k">mutable</span> <span class="n">iFalse</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">isTrue</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="k">if</span> <span class="n">isTrue</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">then</span> <span class="n">res1</span><span class="o">.</span><span class="p">[</span><span class="n">iTrue</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">iTrue</span> <span class="o">&lt;-</span> <span class="n">iTrue</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">else</span> <span class="n">res2</span><span class="o">.</span><span class="p">[</span><span class="n">iFalse</span><span class="p">]</span> <span class="o">&lt;-</span> <span class="kt">array</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">iFalse</span> <span class="o">&lt;-</span> <span class="n">iFalse</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">res1</span><span class="o">,</span> <span class="n">res2</span> </code></pre></div></div> <p>In this case, each thread has its’ own accumulator value, keeping track of their own <code class="highlighter-rouge">trueCount</code>. So they are free to increment it without locking. As threads finish, they then do a locked add, adding their own personal <code class="highlighter-rouge">trueCount</code> to the final result stored in <code class="highlighter-rouge">trueLength</code>. This locked add only happens NumThreads times, instead of array.Length times, so causes no terrible performance penalty. The final result is about 30% faster with no memory use penalty.</p> <h4 id="performance-test-results-of-arrayparallelpartition-with-predicate-fun-x---x--2--0">Performance test results of <code class="highlighter-rouge">Array.Parallel.partition</code> with predicate <code class="highlighter-rouge">(fun x -&gt; x % 2 = 0)</code></h4> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Median</th> <th>StdDev</th> <th>Scaled</th> <th>Gen 0</th> <th>Gen 1</th> <th>Gen 2</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>Original</td> <td>1000</td> <td>21.8514 us</td> <td>0.5300 us</td> <td>1.00</td> <td>0.16</td> <td>-</td> <td>-</td> <td>3,471.77</td> </tr> <tr> <td>New</td> <td>1000</td> <td>20.5297 us</td> <td>0.8840 us</td> <td>0.94</td> <td>0.17</td> <td>-</td> <td>-</td> <td>3,489.75</td> </tr> <tr> <td>Original</td> <td>10000</td> <td>160.0466 us</td> <td>3.1249 us</td> <td>1.00</td> <td>1.21</td> <td>-</td> <td>-</td> <td>28,955.03</td> </tr> <tr> <td>New</td> <td>10000</td> <td>118.1885 us</td> <td>2.8572 us</td> <td>0.74</td> <td>1.20</td> <td>-</td> <td>-</td> <td>28,666.02</td> </tr> <tr> <td>Original</td> <td>100000</td> <td>1,282.9827 us</td> <td>7.3705 us</td> <td>1.00</td> <td>-</td> <td>-</td> <td>10.17</td> <td>211,334.08</td> </tr> <tr> <td>New</td> <td>100000</td> <td>917.0063 us</td> <td>17.4501 us</td> <td>0.71</td> <td>-</td> <td>-</td> <td>7.27</td> <td>151,441.53</td> </tr> <tr> <td>Original</td> <td>1000000</td> <td>12,467.9427 us</td> <td>728.8799 us</td> <td>1.00</td> <td>-</td> <td>-</td> <td>65.99</td> <td>2,353,833.73</td> </tr> <tr> <td>New</td> <td>1000000</td> <td>9,700.7108 us</td> <td>990.4339 us</td> <td>0.78</td> <td>-</td> <td>-</td> <td>65.24</td> <td>2,309,151.64</td> </tr> <tr> <td>Original</td> <td>10000000</td> <td>125,043.1745 us</td> <td>1,753.3497 us</td> <td>1.00</td> <td>-</td> <td>-</td> <td>35.28</td> <td>29,670,713.02</td> </tr> <tr> <td>New</td> <td>10000000</td> <td>86,908.7271 us</td> <td>1,472.4448 us</td> <td>0.70</td> <td>-</td> <td>-</td> <td>35.53</td> <td>29,909,345.33</td> </tr> </tbody> </table> <p><br /></p> <h3 id="recursion-is-slower--sometimes">Recursion is slower … sometimes</h3> <p>Sometimes a recursive implementation will be a substantive speed hit. While the F# compiler is very good at tail recursion optimization, turning most recursive functions into nice loops, there can still be a small to medium performance penalty in some cases. For example, Array.compareWith got about 20% faster when converted from this recursive implementation to while loops:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">inline</span> <span class="n">compareWith</span> <span class="p">(</span><span class="n">comparer</span><span class="o">:</span><span class="k">'</span><span class="nc">T</span> <span class="o">-&gt;</span> <span class="k">'</span><span class="nc">T</span> <span class="o">-&gt;</span> <span class="kt">int</span><span class="p">)</span> <span class="p">(</span><span class="n">array1</span><span class="o">:</span> <span class="k">'</span><span class="nc">T</span><span class="bp">[]</span><span class="p">)</span> <span class="p">(</span><span class="n">array2</span><span class="o">:</span> <span class="k">'</span><span class="nc">T</span><span class="bp">[]</span><span class="p">)</span> <span class="o">=</span> <span class="n">checkNonNull</span> <span class="s2">"array1"</span> <span class="n">array1</span> <span class="n">checkNonNull</span> <span class="s2">"array2"</span> <span class="n">array2</span> <span class="k">let</span> <span class="n">length1</span> <span class="o">=</span> <span class="n">array1</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="n">length2</span> <span class="o">=</span> <span class="n">array2</span><span class="o">.</span><span class="nc">Length</span> <span class="k">let</span> <span class="n">minLength</span> <span class="o">=</span> <span class="nn">Operators</span><span class="p">.</span><span class="n">min</span> <span class="n">length1</span> <span class="n">length2</span> <span class="k">let</span> <span class="k">rec</span> <span class="n">loop</span> <span class="n">index</span> <span class="o">=</span> <span class="k">if</span> <span class="n">index</span> <span class="o">=</span> <span class="n">minLength</span> <span class="k">then</span> <span class="k">if</span> <span class="n">length1</span> <span class="o">=</span> <span class="n">length2</span> <span class="k">then</span> <span class="mi">0</span> <span class="n">elif</span> <span class="n">length1</span> <span class="o">&lt;</span> <span class="n">length2</span> <span class="k">then</span> <span class="o">-</span><span class="mi">1</span> <span class="k">else</span> <span class="mi">1</span> <span class="k">else</span> <span class="k">let</span> <span class="n">result</span> <span class="o">=</span> <span class="n">comparer</span> <span class="n">array1</span><span class="o">.</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="n">array2</span><span class="o">.</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="k">if</span> <span class="n">result</span> <span class="o">&lt;&gt;</span> <span class="mi">0</span> <span class="k">then</span> <span class="n">result</span> <span class="k">else</span> <span class="n">loop</span> <span class="p">(</span><span class="n">index</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="n">loop</span> <span class="mi">0</span> </code></pre></div></div> <p>It took some care to realize this performance improvement, early attempts were actually worse. So keep in mind that testing and IL inspection will be necessary to know for sure if a recursive implementation is a problem. It often is not.</p> <h3 id="notes-on-benchmarking">Notes on Benchmarking</h3> <p>Benchmarking managed code isn’t easy. As well as dealing with non deterministic hardware and operating systems just like you do with C/C++, you also add the complication of the JIT, the runtime, and garbage collection. All of these things can cause code to run faster one moment, and slower the next. Especially when looking for marginal gains you can easily fool yourself into thinking you have made progress when you have actually regressed, or vice versa. Also you may have achieved a slight runtime improvement for <em>your</em> function, but generated more garbage that has to be collected, leading to a net loss. Or maybe you thrashed the L1 cache with your new algorithm such that the next function goes slower, when run in the context of a full program. These can be hard to identify in a benchmark. This is why I cringe a bit when I see people say “you can optimize it later when you identify that it is a bottleneck”. Identifying bottlenecks can be hard. If you see easy ways to avoid pointer hopping or creating garbage, take them.</p> <p>I used the <a href="https://github.com/PerfDotNet/BenchmarkDotNet">BenchmarkDotNet</a> library which helps solve some, but not all, of these challenges. It will automatically warm up the JIT for you, figure out how many trials need to run for each test to get good data, and report on memory usage and GC events (though this feature has some bugs, so be careful). It also spits out nice reports on the results in HTML, CSV, and Markdown formats. The Markdown format is very handy as you can paste it into your Pull Requests. You can see a sample stub that I used <a href="/src/array-perf.fs">here</a>.</p> <h3 id="you-can-do-this-too">You can do this too</h3> <p>If you are interested in improving the quality or performance of software in the world, consider doing something about it. You do not need to be highly skilled or experienced. I am just an average web developer by day, not a language architect or assembler expert or anything. You just need some patience. Learning how a given project’s repository and build process works is often the hardest part. Ask questions of the community, don’t worry about seeming dumb. You will get less dumb every time you ask a dumb question. Pick your favorite open source language or library, make it better. Code bases are huge and even those written by grey beard wizards will have mistakes and bottlenecks that you can find and fix. If the code base is way above your head, start with improving documentation or error messages or other important but not so glamorous work. It is always highly appreciated, and can be a way to familiarize yourself with the project and endear yourself to the other team members. Plus it also makes the world a better place.</p> <h3 id="what-does-this-mean-for-fsharp">What does this mean for FSharp</h3> <p>The net effect of all of this for F# programs out there will vary considerably. For the changes I’ve been working on you need to be making use of arrays (you should! Learn about cache misses). There are PRs from other people for performance improvements in other areas too, which is great to see. If you are using arrays and using the core library Array module, things will just go faster. Whether it makes a substantive difference just depends on your use case. For fun I put together a toy example that hits a lot of the key functions that have been sped up, and compared the current 4.4.0 Core lib against what will hopefully all get merged into 4.4.1:</p> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">(* Init,create and map faster due to array bounds check elision *)</span> <span class="k">let</span> <span class="n">array1</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">init</span> <span class="nc">TEN_MILLION</span> <span class="p">(</span><span class="k">fun</span> <span class="n">i</span> <span class="o">-&gt;</span> <span class="n">i</span><span class="p">)</span> <span class="k">let</span> <span class="n">array2</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">create</span> <span class="nc">TEN_MILLION</span> <span class="mi">5</span> <span class="k">let</span> <span class="n">added</span> <span class="o">=</span> <span class="nn">Array</span><span class="p">.</span><span class="n">map2</span> <span class="p">(</span><span class="k">fun</span> <span class="n">x</span> <span class="n">y</span> <span class="o">-&gt;</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">)</span> <span class="n">array1</span> <span class="n">array2</span> <span class="c">(* Rev faster due to array bounds elision and micro optimizations *)</span> <span class="k">let</span> <span class="n">backwards</span> <span class="o">=</span> <span class="n">added</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">rev</span> <span class="c">(* AverageBy is much faster as it now no longer just calls into Seq.AverageBy *)</span> <span class="k">let</span> <span class="n">average</span> <span class="o">=</span> <span class="n">backwards</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">averageBy</span> <span class="p">(</span><span class="k">fun</span> <span class="n">x</span><span class="o">-&gt;</span> <span class="p">(</span><span class="kt">float</span><span class="p">)</span><span class="n">x</span><span class="p">)</span> <span class="c">(* Use aggregating Parallel.For loop *)</span> <span class="k">let</span> <span class="n">greaterThan400</span> <span class="o">=</span> <span class="n">backwards</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="nn">Parallel</span><span class="p">.</span><span class="n">choose</span> <span class="p">(</span><span class="k">fun</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="k">match</span> <span class="n">x</span> <span class="k">with</span> <span class="o">|</span> <span class="n">x</span> <span class="k">when</span> <span class="n">x</span> <span class="o">&gt;</span> <span class="mi">400</span> <span class="o">-&gt;</span> <span class="nc">Some</span> <span class="n">x</span> <span class="o">|</span> <span class="n">_</span> <span class="o">-&gt;</span> <span class="nc">None</span> <span class="p">)</span> <span class="c">(* Partition faster and uses less memory due to new algorithm *)</span> <span class="k">let</span> <span class="p">(</span><span class="n">even</span><span class="o">,</span><span class="n">odd</span><span class="p">)</span> <span class="o">=</span> <span class="n">greaterThan400</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">partition</span> <span class="p">(</span><span class="k">fun</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="c">(* Filter faster due to using preallocated array instead of List&lt;T&gt; *)</span> <span class="k">let</span> <span class="n">filtered</span> <span class="o">=</span> <span class="n">even</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">filter</span><span class="p">(</span><span class="k">fun</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">4</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> </code></pre></div></div> <h3 id="results">Results</h3> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="err">Host</span> <span class="err">Process</span> <span class="err">Environment</span> <span class="err">Information:</span> <span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.9.8.0</span> <span class="py">OS</span><span class="p">=</span><span class="s">Microsoft Windows NT 6.2.9200.0</span> <span class="py">Processor</span><span class="p">=</span><span class="s">Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8</span> <span class="py">Frequency</span><span class="p">=</span><span class="s">2240908 ticks, Resolution=446.2477 ns, Timer=TSC</span> <span class="py">CLR</span><span class="p">=</span><span class="s">MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]</span> <span class="py">GC</span><span class="p">=</span><span class="s">Concurrent Workstation</span> <span class="py">JitModules</span><span class="p">=</span><span class="s">clrjit-v4.6.1590.0</span> <span class="py">Type</span><span class="p">=</span><span class="s">SIMDBenchmark Mode=Throughput Platform=X64 </span> <span class="py">Jit</span><span class="p">=</span><span class="s">RyuJit GarbageCollection=Concurrent Workstation </span> </code></pre></div></div> <table> <thead> <tr> <th>Method</th> <th>Length</th> <th>Median</th> <th>StdDev</th> <th>Gen 0</th> <th>Gen 1</th> <th>Gen 2</th> <th>Bytes Allocated/Op</th> </tr> </thead> <tbody> <tr> <td>Old</td> <td>10</td> <td>3.4036 us</td> <td>0.0806 us</td> <td>0.07</td> <td>0.00</td> <td>-</td> <td>852.03</td> </tr> <tr> <td>New</td> <td>10</td> <td>3.4044 us</td> <td>0.3243 us</td> <td>0.08</td> <td>0.00</td> <td>-</td> <td>1,118.24</td> </tr> <tr> <td>Old</td> <td>1000</td> <td>52.2478 us</td> <td>4.7930 us</td> <td>2.15</td> <td>-</td> <td>-</td> <td>31,762.76</td> </tr> <tr> <td>New</td> <td>1000</td> <td>41.3602 us</td> <td>2.7741 us</td> <td>1.37</td> <td>-</td> <td>-</td> <td>22,699.78</td> </tr> <tr> <td>Old</td> <td>100000</td> <td>6,001.7350 us</td> <td>286.9376 us</td> <td>58.77</td> <td>2.10</td> <td>74.12</td> <td>3,114,798.18</td> </tr> <tr> <td>New</td> <td>100000</td> <td>3,296.3410 us</td> <td>89.5167 us</td> <td>-</td> <td>-</td> <td>78.91</td> <td>3,254,820.11</td> </tr> <tr> <td>Old</td> <td>1000000</td> <td>40,985.6462 us</td> <td>830.1541 us</td> <td>556.34</td> <td>-</td> <td>211.89</td> <td>30,759,505.83</td> </tr> <tr> <td>New</td> <td>1000000</td> <td>33,555.2876 us</td> <td>3,514.3688 us</td> <td>519.70</td> <td>-</td> <td>229.11</td> <td>29,579,065.56</td> </tr> <tr> <td>Old</td> <td>10000000</td> <td>405,780.3801 us</td> <td>8,032.9328 us</td> <td>5,660.00</td> <td>-</td> <td>227.00</td> <td>333,268,994.49</td> </tr> <tr> <td>New</td> <td>10000000</td> <td>286,415.5958 us</td> <td>7,394.8703 us</td> <td>5,049.00</td> <td>-</td> <td>183.00</td> <td>287,616,350.72</td> </tr> </tbody> </table> Sat, 13 Aug 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/08/13/adventures-in-fsharp.html https://jackmott.github.io//programming/2016/08/13/adventures-in-fsharp.html programming Making the obvious code fast <p><a href="http://number-none.com/blow/">Jonathan Blow</a> of “The Witness” fame likes to talk about just typing the obvious code first. Usually it will turn out to be fast enough. If it doesn’t, you can go back and optimize it later. His thoughts come in the context of working on games in C/C++. I think these languages, with modern incarnations of their compilers, are compatible with this philosophy. Not only are the compilers very mature but they are low level enough that you are forced to do things by hand, and think about what the machine is doing most of the time, especially if you stick to C or a ‘mostly C’ subset of C++. However in most higher level languages, there tend to be performance traps where the obvious, or idiomatic solution is particularly bad.</p> <p>What counts as obvious or idiomatic, is of course often a matter of opinion. The language itself may encourage certain choices by making them easier to type, or highlighting them in documentation and teaching materials. The community that grows up around a language may just come to prefer certain constructs and encourage others to use them. It is very common to see programmers encouraged to use high level constructs over lower level ones, in the interest of readability and simplicity. This is a worthy ideal, but often people aren’t aware of what the cost really is. Some of these constructs have a much higher cost than people realize.</p> <p>In this article I will explore a number of languages, with a toy map and reduce example. Within each language, I will explore a number of approaches, ranging from high level to hand coded imperative loops and SIMD operations. Some of the performance pitfalls I will show may be specific to this toy example. With a different toy example, the languages that excel and those that do poorly could be totally different. This is meant merely to explore, and get people thinking about the performance cost of abstractions. For each case I will show code examples so you can consider the differences in complexity.</p> <h2 id="the-task">The Task</h2> <p>We wish to take an array of 32 million 64bit floating point values, and compute the sum of their squares. This will let us explore some fundamental abilities of various languages. Their ability to iterate over arrays efficiently, whether they can vectorize basic loops, and whether higher order functions like map and reduce compile to efficient code. When applicable, I will show runtimes of both map and reduce, so we get insight into whether the language can stream higher order functions together, and also the runtime with a single reduce or fold operation.</p> <h2 id="the-results">The Results</h2> <ul> <li><a href="#benchmark">Benchmark Details</a></li> </ul> <h3 id="c---17-milliseconds">C - 17 milliseconds</h3> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">COUNT</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="kt">double</span> <span class="n">v</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="n">sum</span> <span class="o">+=</span> <span class="n">v</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>ANSI C is a bare bones language, no higher order functions or loop abstractions exist to even think about, so this imperative loop is what most programmers wil turn to to complete this task. If I thought that this would be a performance critical piece of code, I might use SIMD intrinsics, which requires this nasty mess:</p> <h3 id="c---simd-explicit---17-milliseconds">C - SIMD Explicit - 17 milliseconds</h3> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">__m256d</span> <span class="n">vsum</span> <span class="o">=</span> <span class="n">_mm256_setzero_pd</span><span class="p">();</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">COUNT</span><span class="o">/</span><span class="mi">4</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span> <span class="n">__m256d</span> <span class="n">v</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="n">vsum</span> <span class="o">=</span> <span class="n">_mm256_add_pd</span><span class="p">(</span><span class="n">vsum</span><span class="p">,</span><span class="n">_mm256_mul_pd</span><span class="p">(</span><span class="n">v</span><span class="p">,</span><span class="n">v</span><span class="p">));</span> <span class="p">}</span> <span class="kt">double</span> <span class="o">*</span><span class="n">tsum</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">vsum</span><span class="p">;</span> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">tsum</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">tsum</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">+</span><span class="n">tsum</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">+</span><span class="n">tsum</span><span class="p">[</span><span class="mi">3</span><span class="p">];</span> </code></pre></div></div> <p>However, notice that the runtime is the same for the obvious and SIMD versions! It turns out that the obvious code was automatically turned into SIMD enhanced machine instructions. A process called “Auto vectorization”. Visual C++ is not known for being the most clever of C++ compilers but it still gets this right:</p> <pre><code class="language-asm">double sum = 0.0; for (int i = 0; i &lt; COUNT; i++) { 00007FF7085C1120 vmovupd ymm0,ymmword ptr [rcx] 00007FF7085C1124 lea rcx,[rcx+40h] double v = values[i] * values[i]; //square em 00007FF7085C1128 vmulpd ymm2,ymm0,ymm0 00007FF7085C112C vmovupd ymm0,ymmword ptr [rcx-20h] 00007FF7085C1131 vaddpd ymm4,ymm2,ymm4 00007FF7085C1135 vmulpd ymm2,ymm0,ymm0 00007FF7085C1139 vaddpd ymm3,ymm2,ymm5 00007FF7085C113D vmovupd ymm5,ymm3 00007FF7085C1141 sub rdx,1 00007FF7085C1145 jne imperative+80h (07FF7085C1120h) sum += v; } </code></pre> <p>To get the SIMD instructions used here, which can operate on 4 doubles at a time, you have to specify to the compiler that you want ‘fast floating point’ and specify that you want to target AVX2 instructions as well. Results will be different when vectorized, though they will actually be more accurate, not less. (in this case, maybe all?)</p> <h3 id="c-linq-select-sum---260-milliseconds">C# Linq Select Sum - 260 milliseconds</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">var</span> <span class="n">sum</span> <span class="p">=</span> <span class="n">values</span><span class="p">.</span><span class="nf">Sum</span><span class="p">(</span><span class="n">x</span> <span class="p">=&gt;</span> <span class="n">x</span> <span class="p">*</span> <span class="n">x</span><span class="p">);</span> </code></pre></div></div> <h3 id="c-linq-aggregate---280-milliseconds">C# Linq Aggregate - 280 milliseconds</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">var</span> <span class="n">sum</span> <span class="p">=</span> <span class="n">values</span><span class="p">.</span><span class="nf">Aggregate</span><span class="p">(</span><span class="m">0.0</span><span class="p">,(</span><span class="n">acc</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">acc</span> <span class="p">+</span> <span class="n">x</span> <span class="p">*</span> <span class="n">x</span><span class="p">);</span> </code></pre></div></div> <h3 id="c-for-loop---34-milliseconds">C# for loop - 34 milliseconds</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0.0</span><span class="p">;</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">v</span> <span class="k">in</span> <span class="n">values</span><span class="p">)</span> <span class="p">{</span> <span class="kt">double</span> <span class="n">square</span> <span class="p">=</span> <span class="n">v</span> <span class="p">*</span> <span class="n">v</span><span class="p">;</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">square</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>Stepping up a level to C#, we have a couple of idiomatic solutions. Many C# programmers today might use Linq which as you can see is much slower. It also creates a lot of garbage, putting more pressure on the garbage collector. Oddly, the Aggregate function, which is equivalent to fold or reduce in most other languages, is slower despite being a single step instead of two. The foreach loop in the second example is also commonly used. While this pattern has big performance pitfalls when used on collections like List&lt;T&gt;, with arrays it compiles to efficient code. This is nice as it saves you some typing without runtime penalty. The runtime here is still twice as slow as the C code, but that is entirely due to not being automatically vectorized.<br /> With the .NET JIT, it is not considered a worthwhile tradeoff to do this particular optimization.</p> <p>With C# you also have to take some care with array access in loops, or <a href="http://www.codeproject.com/Articles/844781/Digging-Into-NET-Loop-Performance-Bounds-checking">bounds checking overhead can be introduced</a>. In this case the JIT gets it right, and there is no bounds checking overhead.</p> <h3 id="c-simd-explicit---17-milliseconds">C# SIMD Explicit - 17 milliseconds</h3> <div class="language-c# highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">Vector</span><span class="p">&lt;</span><span class="kt">double</span><span class="p">&gt;</span> <span class="n">vsum</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Vector</span><span class="p">&lt;</span><span class="kt">double</span><span class="p">&gt;(</span><span class="m">0.0</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">COUNT</span><span class="p">;</span> <span class="n">i</span> <span class="p">+=</span> <span class="n">Vector</span><span class="p">&lt;</span><span class="kt">double</span><span class="p">&gt;.</span><span class="n">Count</span><span class="p">)</span> <span class="p">{</span> <span class="kt">var</span> <span class="k">value</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Vector</span><span class="p">&lt;</span><span class="kt">double</span><span class="p">&gt;(</span><span class="n">values</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="n">vsum</span> <span class="p">=</span> <span class="n">vsum</span> <span class="p">+</span> <span class="p">(</span><span class="k">value</span> <span class="p">*</span> <span class="k">value</span><span class="p">);</span> <span class="p">}</span> <span class="kt">double</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">Vector</span><span class="p">&lt;</span><span class="kt">double</span><span class="p">&gt;.</span><span class="n">Count</span><span class="p">;</span><span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="n">sum</span> <span class="p">+=</span> <span class="n">vsum</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> </code></pre></div></div> <p>While the .NET JIT won’t do SIMD automatically, we can explicitly use some SIMD instructions, and achieve performance nearly identical to C. An advantage here for C# is that the SIMD code is a bit less nasty than using intrinsics, and that particular instructions whether they be AVX2, SSE2, NEON, or whatever the hardware supports, can be decided upon at runtime. Whereas the C code above would require separate compilation for each architecture. A disadvantage for C# is that not all SIMD instructions are exposed by the Vector library, so something like <a href="https://github.com/Auburns/FastNoiseSIMD">SIMD enhanced noise functions</a> can’t be done with nearly the same performance. As well, the machine code produced by the Vector library is not always as efficient when you step out of toy examples.</p> <h3 id="f---127-milliseconds">F# - 127 milliseconds</h3> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">values</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">map</span> <span class="n">squares</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">sum</span> </code></pre></div></div> <p>The obvious F# code is beautiful, I like typing this, and I like working with it. But performance is terrible. Just as with C# you get no auto vectorization, as they use the same JIT. Additionally the array is iterated over twice, once to map them to squares, and once to sum them. Finally, since immutability is the default, each operation returns a new array, incurring allocation costs and GC pressure. So the total performance impact on an application is likely to be worse than this micro benchmark would suggest.</p> <h3 id="f-streams---98-milliseconds">F# Streams - 98 milliseconds</h3> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">values</span> <span class="o">|&gt;</span> <span class="nn">Stream</span><span class="p">.</span><span class="n">map</span> <span class="n">square</span> <span class="o">|&gt;</span> <span class="nn">Stream</span><span class="p">.</span><span class="n">sum</span> </code></pre></div></div> <p>F# is a functional first language, rather than a pure functional language like Haskell. If you do happen to use pure functions, you can stream your map and sum operations together, and avoid iterating over the array twice. The <a href="https://github.com/nessos/Streams">Nessos Streams</a> library provides this, with a nice performance improvement as a result.</p> <h3 id="f-fold---75-milliseconds">F# Fold - 75 milliseconds</h3> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">values</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="n">fold</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">x</span> <span class="o">-&gt;</span> <span class="n">acc</span> <span class="o">+</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0</span> </code></pre></div></div> <p>When we use a single fold operation, we no longer iterate over the collection twice and allocate extra memory, and runtime improves even more. Since there is no overhead for streaming together multiple higher order functions as there is in the Streams library, it does slightly better.</p> <h3 id="f-imperative---38-milliseconds">F# Imperative - 38 milliseconds</h3> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="k">mutable</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">to</span> <span class="n">values</span><span class="o">.</span><span class="nc">Length</span><span class="o">-</span><span class="mi">1</span> <span class="k">do</span> <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">values</span><span class="o">.</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">sum</span> <span class="o">&lt;-</span> <span class="n">sum</span> <span class="o">+</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span> </code></pre></div></div> <p>One of the nice things about F#, is that while it is a functional leaning language, very few barriers are put in your way if you want to go imperative for the sake of speed. Write a normal for loop, and you get the same performance as SSE vectorized C.</p> <h3 id="f-simd---18ms">F# SIMD - 18ms</h3> <div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">values</span> <span class="o">|&gt;</span> <span class="nn">Array</span><span class="p">.</span><span class="nn">SIMD</span><span class="p">.</span><span class="n">fold</span> <span class="p">(</span><span class="k">fun</span> <span class="n">acc</span> <span class="n">v</span> <span class="o">-&gt;</span> <span class="n">acc</span> <span class="o">+</span><span class="n">v</span><span class="o">*</span><span class="n">v</span><span class="p">)</span> <span class="p">(</span><span class="o">+</span><span class="p">)</span> <span class="mi">0</span><span class="o">.</span><span class="mi">0</span> </code></pre></div></div> <p>Now to get serious. First we use fold, so that we can combine the summing and squaring into a single pass. Then we use the <a href="https://github.com/jackmott/SIMDArray">SIMDArray extensions</a> that I have been working on which let you take full advantage of SIMD with more idiomatic F#. Performance here is great, nearly as fast as C, but it took a lot of work to get here. At the moment there is no way to combine the lazy stream optimization with the SIMD ones. If you want to filter-&gt;map-&gt;reduce you will still be doing a lot of extra work. This should be possible in principle though. Please submit a PR!</p> <h3 id="rust---34ms">Rust - 34ms</h3> <div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">values</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">. map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="nf">. sum</span><span class="p">()</span> </code></pre></div></div> <p>Rust achieves impressive numbers with the most obvious approach. This is super cool. I feel that this behavior should be the goal for any language offering these kinds of higher order functions as part of the language or core library. Using a traditional for loop or a ‘for x in y’ style loop is also just as fast. It is also possible to use rust intrinsics to get the same speed as the AVX2 vectorized C code here, but to use those you have to write out the loop explicitly:</p> <h3 id="rust-simd---17ms">Rust SIMD - 17ms</h3> <div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="k">mut</span> <span class="n">sum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="k">for</span> <span class="n">v</span> <span class="n">in</span> <span class="n">values</span> <span class="p">{</span> <span class="k">let</span> <span class="n">x</span> <span class="p">:</span> <span class="nb">f64</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">intrinsics</span><span class="p">::</span><span class="nf">fmul_fast</span><span class="p">(</span><span class="o">*</span><span class="n">v</span><span class="p">,</span><span class="o">*</span><span class="n">v</span><span class="p">);</span> <span class="n">sum</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">intrinsics</span><span class="p">::</span><span class="nf">fadd_fast</span><span class="p">(</span><span class="n">sum</span><span class="p">,</span><span class="n">x</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="n">sum</span> </code></pre></div></div> <p>It would be nice if the rustc compiler had an option to just apply this globally, so you could use the higher order functions. Also, these features are marked as unstable, and likely to remain unstable forever. This might make it problematic to use this feature for any important production project. It would also be nice if the unsafe block was not required. Hopefully the Rust maintainers have a plan to make this better.</p> <h3 id="javascript-map-reduce-nodejs-10000ms">Javascript map reduce (node.js) 10,000ms</h3> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">sum</span> <span class="o">=</span> <span class="nx">values</span><span class="p">.</span><span class="nx">map</span><span class="p">(</span><span class="nx">x</span> <span class="o">=&gt;</span> <span class="nx">x</span><span class="o">*</span><span class="nx">x</span><span class="p">).</span> <span class="nx">reduce</span><span class="p">(</span> <span class="p">(</span><span class="nx">total</span><span class="p">,</span><span class="nx">num</span><span class="p">,</span><span class="nx">index</span><span class="p">,</span><span class="nx">array</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">total</span><span class="o">+</span><span class="nx">num</span><span class="p">,</span><span class="mf">0.0</span><span class="p">);</span> </code></pre></div></div> <h3 id="javascript-reduce-nodejs-800-and-then-300-milliseconds">Javascript reduce (node.js) 800 and then 300 milliseconds</h3> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">sum</span> <span class="o">=</span> <span class="nx">values</span><span class="p">.</span><span class="nx">reduce</span><span class="p">(</span> <span class="p">(</span><span class="nx">total</span><span class="p">,</span><span class="nx">num</span><span class="p">,</span><span class="nx">index</span><span class="p">,</span><span class="nx">array</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">total</span><span class="o">+</span><span class="nx">num</span><span class="o">*</span><span class="nx">num</span><span class="p">,</span><span class="mf">0.0</span><span class="p">)</span> </code></pre></div></div> <p>It is common to see these higher order javascript functions suggested as the most elegant way to do this, but it is incredibly slow. Simplifying the combined map and reduce improves runtime by an order of magnitude to 800ms, though after 3 or 4 iterations the JIT does some magic and runtime drops to 300ms thereafter. This represents the first time I have seen any substantive JIT optimization happen during runtime in the wild!</p> <h3 id="javascript-foreach-nodejs-800-and-then-300-milliseconds">Javascript foreach (node.js) 800 and then 300 milliseconds</h3> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">var</span> <span class="nx">sum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span> <span class="nx">array</span><span class="p">.</span><span class="nx">forEach</span><span class="p">(</span> <span class="p">(</span><span class="nx">element</span><span class="p">,</span><span class="nx">index</span><span class="p">,</span><span class="nx">array</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="nx">sum</span> <span class="o">+=</span> <span class="nx">element</span><span class="o">*</span><span class="nx">element</span> <span class="p">)</span> </code></pre></div></div> <p>Slightly less elegant but also a popular idiom in javascript, this is faster than map and reduce, but is still amazingly slow. Again, after 3 or 4 iterations the JIT does some magic and it speeds up from around 800 to 300 milliseconds.</p> <h3 id="javascript-imperative-nodejs-37-milliseconds">Javascript imperative (node.js) 37 milliseconds</h3> <div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">var</span> <span class="nx">sum</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o">&lt;</span> <span class="nx">values</span><span class="p">.</span><span class="nx">length</span><span class="p">;</span><span class="nx">i</span><span class="o">++</span><span class="p">){</span> <span class="kd">var</span> <span class="nx">x</span> <span class="o">=</span> <span class="nx">values</span><span class="p">[</span><span class="nx">i</span><span class="p">];</span> <span class="nx">sum</span> <span class="o">+=</span> <span class="nx">x</span><span class="o">*</span><span class="nx">x</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>Finally, when we get down to a basic imperative for loop, javascript performs comparably to SEE vectorized C.</p> <h3 id="java-streams-map-sum-138-milliseconds">Java Streams Map Sum 138 milliseconds</h3> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">values</span><span class="o">).</span> <span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="o">-&gt;</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="o">).</span> <span class="n">sum</span><span class="o">();</span> </code></pre></div></div> <h3 id="java-streams-reduce-34-milliseconds">Java Streams Reduce 34 milliseconds</h3> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">values</span><span class="o">).</span> <span class="n">reduce</span><span class="o">(</span><span class="mi">0</span><span class="o">,(</span><span class="n">acc</span><span class="o">,</span><span class="n">x</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="n">acc</span><span class="o">+</span><span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="o">);</span> </code></pre></div></div> <p>Java 8 includes a very nice library called <a href="https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html">stream</a> which provides higher order functions over collections in a lazy evaluated manner, similar to the F# Nessos streams library and Rust. Given that this is a lazy evaluated system, it is odd that there is such a performance difference between map then sum and a single reduction. The reduce function is compiling down to the equivalent of SSE vectorized C, but the map then sum is not even close. It turns out that the <code class="highlighter-rouge">sum()</code> method on <code class="highlighter-rouge">DoubleStream</code>:</p> <blockquote> <p>may be implemented using compensated summation or other technique to reduce the error bound in the numerical sum compared to a simple summation of double values.</p> </blockquote> <p>A nice feature, but not clearly communicated by the method name! If we tweak the java code to do normal summation the runtime remains as fast as SSE vectorized C, a nice accomplishment:</p> <h3 id="java-streams-map-reduce-34-milliseconds">Java Streams Map Reduce 34 milliseconds</h3> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="n">Arrays</span><span class="o">.</span><span class="na">stream</span><span class="o">(</span><span class="n">values</span><span class="o">).</span> <span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="o">-&gt;</span> <span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="o">).</span> <span class="n">reduce</span><span class="o">(</span><span class="mi">0</span><span class="o">,(</span><span class="n">acc</span><span class="o">,</span><span class="n">x</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="n">acc</span><span class="o">+</span><span class="n">x</span><span class="o">);</span> </code></pre></div></div> <p>There does not appear to be a way to get SIMD out of Java, either explicitly or via automatic vectorization by the Hotspot JVM. There are 3rd party libraries available that do it by calling C++ code. I do see some literature stating that the JVM can and does auto-vectorize, but I’m not seeing evidence of that in this case, or when I use a for loop, either.</p> <h3 id="go-for-range-37-milliseconds">Go for Range 37 milliseconds</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">:=</span><span class="x"> </span><span class="m">0.0</span><span class="x"> </span><span class="k">for</span><span class="x"> </span><span class="n">_</span><span class="p">,</span><span class="n">v</span><span class="x"> </span><span class="o">:=</span><span class="x"> </span><span class="k">range</span><span class="x"> </span><span class="n">values</span><span class="p">[</span><span class="o">:</span><span class="p">]</span><span class="x"> </span><span class="p">{</span><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">=</span><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">+</span><span class="x"> </span><span class="n">v</span><span class="o">*</span><span class="n">v</span><span class="x"> </span><span class="p">}</span><span class="x"> </span></code></pre></div></div> <h3 id="go-for-loop-37-milliseconds">Go for loop 37 milliseconds</h3> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">:=</span><span class="x"> </span><span class="m">0.0</span><span class="x"> </span><span class="k">for</span><span class="x"> </span><span class="n">i</span><span class="x"> </span><span class="o">:=</span><span class="x"> </span><span class="m">0</span><span class="p">;</span><span class="x"> </span><span class="n">i</span><span class="x"> </span><span class="o">&lt;</span><span class="x"> </span><span class="nb">len</span><span class="p">(</span><span class="n">values</span><span class="p">);</span><span class="x"> </span><span class="n">i</span><span class="o">++</span><span class="x"> </span><span class="p">{</span><span class="x"> </span><span class="n">x</span><span class="x"> </span><span class="o">:=</span><span class="x"> </span><span class="n">values</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">=</span><span class="x"> </span><span class="n">sum</span><span class="x"> </span><span class="o">+</span><span class="x"> </span><span class="n">x</span><span class="o">*</span><span class="n">x</span><span class="x"> </span><span class="p">}</span><span class="x"> </span></code></pre></div></div> <p>Go has good performance with the both the usual imperative loop and their ‘range’ idiom which is like a ‘foreach’ in other languages. <br /> Neither auto vectorization nor explicit SIMD support appears to be completely not on the Go radar. There are no map/reduce/fold higher order functions in the standard library, so we can’t compare them. Go does a good thing here by not providing a slow path at all.</p> <h3 id="conclusion">Conclusion</h3> <p>I have shown some performance pitfalls in various languages here. One should not read too much into this as an argument for general performance of these languages. Every language has some pitfalls where the preferred or easiest approaches to solving a problem can lead to performance pitfalls. In Java, for instance, everything is objects. Objects all allocate on the heap (unless the JIT does some work at runtime to determine it doesn’t need to go on the heap, but that isn’t a freebie). Since Java is also a garbage collected language, this can lead to performance pitfalls when you type the obvious code. With experience, you can learn about these pitfalls and do work to avoid them, just like you can avoid pitfalls of Linq in C#, by not using it, or the pitfalls of F# by using Stream or SIMD libraries instead of the core ones. But even then, you have to take extra care, and type extra code, or take on more dependencies to do that. This is partially purpose defeating, since high level languages are supposed to let you type less, and get things working faster.</p> <p>What I would like to see is more of an attitude change among high level language designers and their communities. None of the issues above need to exist. Java could (and will, soon) provide value types (as C# does) to make it less painful to avoid GC pressure if you use lots of small, short lived constructs. Go could provide more SIMD support, either via a SIMD library or better auto vectorization. F# could provide efficient Streams as part of the core library like Java does. .NET could auto vectorize in the JIT and/or provide more complete coverage of SIMD instructions in the Vector library. We, the community, can help by providing libraries and <a href="https://jackmott.github.io/programming/2016/08/13/adventures-in-fsharp.html">submitting PRs</a> to make the obvious code faster. Time and energy will be saved, batteries will last longer, users will be happier.</p> <h3 id="benchmark-details-">Benchmark Details <a name="benchmark"></a></h3> <p>All benchmarks run with what I believe to be the latest and greatest compilers available for Windows for each language. JIT warmup time is accounted for when applicable. If you identify cases where code or compiler/environment choices are sub optimal, email me please.</p> <h4 id="environment">Environment</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">Host</span> <span class="err">Process</span> <span class="err">Environment</span> <span class="err">Information:</span> <span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.9.8.0</span> <span class="py">OS</span><span class="p">=</span><span class="s">Microsoft Windows NT 6.2.9200.0</span> <span class="py">Processor</span><span class="p">=</span><span class="s">Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8</span> <span class="py">Frequency</span><span class="p">=</span><span class="s">2240907 ticks, Resolution=446.2479 ns, Timer=TSC</span> </code></pre></div></div> <h4 id="f--c-runtime-details">F# / C# Runtime Details</h4> <div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">CLR</span><span class="p">=</span><span class="s">MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]</span> <span class="py">GC</span><span class="p">=</span><span class="s">Concurrent Workstation</span> <span class="py">JitModules</span><span class="p">=</span><span class="s">clrjit-v4.6.1590.0</span> <span class="py">Type</span><span class="p">=</span><span class="s">SIMDBenchmark Mode=Throughput Platform=X64 </span> <span class="py">Jit</span><span class="p">=</span><span class="s">RyuJit GarbageCollection=Concurrent Workstation </span> </code></pre></div></div> <h3 id="c-details">C Details</h3> <p>Visual Studio 2015 Update 3, fast floating point, 64 bit, AVX2 instructions enabled, all speed optimizations on</p> <h3 id="rust-details">Rust Details</h3> <p>v1.13 Nightly, –release -opt-level=3</p> <h3 id="javascriptnode-details">Javascript/Node Details</h3> <p>v6.4.0 64bit NODE_ENV=production</p> <h3 id="java-details">Java Details</h3> <p>Oracle Java 64bit version 8 update 102</p> <h3 id="go-details">Go Details</h3> <p>Go 1.7</p> Fri, 22 Jul 2016 19:17:27 +0000 https://jackmott.github.io//programming/2016/07/22/making-obvious-fast.html https://jackmott.github.io//programming/2016/07/22/making-obvious-fast.html programming De-Cruft Visual Studio <p><img src="/images/vs-cruft.png" alt="VS-Cruft" title="VS Cruft" /> Above is a screen shot of what Visual Studio looks like on most people’s desktops. There is a <em>lot</em> going on, and some people like it that way. They have a large monitor, they use all of these features, and they suffer no performance and stability problems. Some of us however, are really only interested in seeing the code, and find the rest of this to be a distraction that eats system resources and screen real estate. I will quickly explain how you can turn off any of these visual features that you do not actually want. This can free you from visual distractions, or open up screen real estate to make side by side code editing a more practical endeavour. It may even reduce system resource use and improve stability somewhat. (citation needed). These tips all assume you are using Visual Studio 2015, though some may work on older versions as well.</p> <h3 id="remove-any-extensions-you-dont-use">Remove any extensions you don’t use</h3> <p>Often times performance and stability issues with Visual Studio 2015 are due to extensions. Take a quick glance at your installed extensions by navigating to Tools-&gt;Extensions and Updates-&gt;Installed and see if there is anything there that you never actually use. If so, uninstall it. If you have used ReSharper for a long time, Visual Studio has slowly been adding a lot of the features that ReSharper used to add. If you don’t need ReSharper, you can get huge improvements in responsiveness by uninstalling it.</p> <h3 id="disable-codelens">Disable CodeLens</h3> <p>The CodeLens feature of Visual Studio can be quite useful, it can display various meta-data about your code, and it’s state within the context of your source control. But if you do not make use of it often, you can save a great deal of visual clutter, and perhaps improve the resource utilization and responsiveness of Visual Studio as well by turning it off. You can disable it globally at Tools-&gt;Options-&gt;Text Editor-&gt;All Languages, or on a per-language basis if you prefer.</p> <h3 id="solution-explorer-and-output--error-panes">Solution Explorer and Output / Error Panes</h3> <p><img src="/images/vs-solution-explore.gif" alt="VS-AutoHide" title="VS AutoHide" /></p> <p>These are very commonly used tools, but you can have your cake and eat it too with them. At the top of these panels you will see a thumb-tack icon. You can click that to toggle ‘Auto-Hide’. With ‘Auto-Hide’ enabled the panes will normally stay minified, but you can bring them up and put the focus in them with shortcut keys (CNTRL-ALT-L for solution explorer, CNTRL-ALT-O for Output, etc.) and then you can hop back to your code with the ESC key. While they are open, the focus will be in the pane and you can navigate them with the ARROW and ENTER keys, no need for the mouse. This is a great way to free up space, and maintain the usefulness of the solution explorer.</p> <h3 id="hide-the-navigation-bar-and-code-outlining">Hide The Navigation Bar and Code Outlining</h3> <p>Tools-&gt;Options-&gt;Text Editor-&gt;All Languages-&gt;General will give you the option to turn off the Navigation bar, a thin bar with dropdowns that notify you of the current method you are in. Some people like this feature, if you never use it, free up the space and turn it off. You can turn if on/off on a per language basis as well. I also like to actually turn line numbers on here.</p> <p>Hiding the code outlining graphics, if you find those useless, is a bit more tricky. For some languages like C++ and C#, you can turn the feature off from within the language specific options under Text Editor. For others, like Javascript, you have to turn it off by hand on each file with CNTRL-M CNTRL-P</p> <h3 id="reduce-the-margins">Reduce the Margins</h3> <p>By default there is a lot of horizontal space taken up by the Selection and Indicator margins. If you don’t make use of these, you can go to Tools-&gt;Options-&gt;Text Editor-&gt;General and unselect them both.</p> <h3 id="cleaning-up-the-menu-bar">Cleaning up the Menu Bar</h3> <p>If you have been learning your keyboard shortcuts, the icons under the menu bar should be completely useless to you, and you can remove them by right clicking empty space in that area and deselecting any icon groups you don’t need. This can quickly free up vertical space. You can go even further, and hide the menu text as well with extensions like ‘Hide Main Menu’. You can still use the menu, as pressing the ALT key brings it back up. Just go to Tools-&gt;Extensions and Updates-&gt;Online and search for ‘Hide Main Menu’</p> <h3 id="full-screen">Full Screen</h3> <p>Tap ALT-SHIFT-ENTER to go into fullscreen mode, this frees up some space as the window borders go away. This also makes the icons under the menu go away.</p> <h3 id="tab-group-jumper">Tab Group Jumper</h3> <p>More of a productivity improvement than a de-crufting, but once you free up all this space, you may find you have room for 2 or 3 pages of code side by side. Unfortunately visual studio provides no way to jump between tab groups without the mouse. The Tab Group Jumper extension adds this functionality. Tools-&gt;Extensions and Updates-&gt;Online and search for ‘Tab Jumper’</p> <h3 id="after-de-crufting">After De-Crufting</h3> <p><img src="/images/vs-nocruft.png" alt="VS-NoCruft" title="VS NoCruft" /></p> <p>Now with space freed up, you have more room on your screen for code, side by side editing, or whatever else you desire.</p> Mon, 11 Jul 2016 19:17:27 +0000 https://jackmott.github.io//programming/tools/editor/ide/visual/studio/2016/07/11/decruft-visual-studio.html https://jackmott.github.io//programming/tools/editor/ide/visual/studio/2016/07/11/decruft-visual-studio.html programming tools editor ide visual studio Marginal Gains <p>In a former life I was heavily involved in bike racing, and became obsessed with the concept of “marginal gains”. It is the idea that there exist a multitude of choices you can make, each of which, in isolation, has little to no effect on your result, but in totality can be the difference between success and failure. It is a philosophy which requires a delicate balance. Too much time spent worrying about minutiae can distract from the business of actually training. But ignoring marginal gains completely means you will eventually lose to someone who did not.</p> <p>So it is with being a software developer. Taking time to master your tools will save you time, and expand your abilities. But, at some point, you just need to shut up and code.</p> <h2 id="become-one-with-the-command-line">Become one with the command line</h2> <p>If you spend most of your time in Windows software development it is possible to get by and never really master various command line systems and tools. Taking the time to force yourself to learn these things can be extremely valuable and open up entirely new worlds. Being familiar with how to get things building from source in Linux for example, can allow you to leverage and contribute to open source projects that might otherwise be unavailable to you. If your projects are deployed to cloud infrastructures such as Azure or AWS, being able to manage all of that from Bash or Powershell let’s you get things done much, much faster than working through web GUI interfaces. You will likely be able to easily automate a lot of your daily tasks with scripts. For instance do you type “git add *, git commit -m “foo”, git push” 30 times a day? (or use your mouse and click through 3 menus in the GUI equivalent?) That is an easy fix for even a beginner at bash or batch file scripting. Being familiar with the command line also opens up options such as using faster or more flexible code editors, rather than being stuck in Visual Studio, Eclipse, etc because you don’t understand how to build and run things without the IDE to help you. The possibilities here are too many to list. Take stock of the kinds of things you work on, and take time out of your day to learn Bash, or Powershell, or whatever command line skills may be relevant to your job and interests.</p> <h2 id="master-your-code-editors">Master your code editors</h2> <p>Whatever you use to edit code, whether it be an IDE like Visual Studio or a text editor like VIM or Sublime, take some time to truly master it. Think carefully about what wastes your time as you use it. Do you spend lots of time navigating around code with the arrow keys? Learn what shortcuts are available to speed that up. Move to next token, move to next / previous matching brace, these are often features available with a hotkey in a good editor. If you want a feature that isn’t there, look into customizing the editor to add it. Occasionally take a day and make it a goal to do all your coding without touching the mouse. At first you will find hundreds of things you can’t accomplish without doing so, but gradually you will learn how to bring that down to zero, either by learning keyboard shortcuts, or tweaking your environment so you don’t need them. Here are some common examples:</p> <h3 id="changing-indentation-on-blocks">Changing indentation on blocks</h3> <p><img src="/images/tab-1.gif" alt="tab-1" title="Tab Slow" /> <img src="/images/tab-2.gif" alt="tab-2" title="Tab Fast" /></p> <p>With Visual Studio you can do this by selecting and using TAB and SHIFT-TAB. You can even select rectangular regions with ALT-SHIFT-ARROWS.</p> <h3 id="token-delete-and-multi-line-editing">Token Delete and Multi Line Editing</h3> <p><img src="/images/token-1.gif" alt="token-1" title="Token Slow" /> <img src="/images/token-2.gif" alt="token-2" title="Token Faster" /> <img src="/images/token-3.gif" alt="token-3" title="Token Fastest" /></p> <p>In Visual Studio you can delete entire tokens at a time with CNTRL-DEL or CNTRL-BACKSPACE. You can also navigate the cursor a token at a time with CNTRL-ARROW. Some may find it useful to also have a camelcase/pascalcase aware feature. The third gif here shows Multi Line editing in action, where you can use the ALT-SHIFT selection feature and make identical changes to many lines at once. Most code editors will have a feature like this, it can be very useful to learn it.</p> <p>There are dozens of little tricks like this available in any decent code editor. Any time you find yourself having to bang a lot of keys or use the mouse to get things done, investigate whether your editor has a shortcut built in for the task, or whether you can easily add one. These will take practice to use quickly and without thinking about it, but when you build up a nice set of shortcuts, such that you are rarely touching the mouse or repeating keystrokes, you will get things done faster, and more pleasantly.</p> <p>A few more freebies in Visual Studio:</p> <ul> <li>F9 set/unset a breakpoint on the current line</li> <li>F12 go to definition of the token the cursor is currently on</li> <li>F5 to run SHIFT-F5 to stop the current default project</li> <li>CNTRL-TAB to pop to the previously focused window</li> <li>Hold CNTRL and press TAB to cycle through all previously open windows</li> <li>CNTRL-K CNTRL-O in C++ files to hop between .h/.hpp and .cpp/.c files</li> <li>CNTRL-ALT-L to pop to the solution explorer, ARROWS and ENTER to navigate/open</li> </ul> <h2 id="expand-either-the-depth-or-breadth-of-your-skills">Expand either the depth, or breadth, of your skills</h2> <p>Take some time to Git Gud. Maybe you are committed to being deeply expert in one aspect of programming. If so, take time to deepen your understanding. Think about what language features or programming concepts have confused you in the past. Set time aside to master those things.</p> <p>Alternatively, especially if you are younger, take time to learn and practice entirely new philosophies. Has your career to date been entirely in managed runtimes? Start a personal project in C, D, or Rust, and learn what programming without garbage collection, and with access to the bare metal is like. Have you done nothing but Object Oriented Programming your whole life? Try some side projects in a functional language, and try it again in ANSI C. Learn for yourself what the pros and cons of these paradigms are for you. Do not believe the assertions you hear every day about different programming paradigms, almost none of them are backed up with rigorous evidence. Take time to investigate the universes you are not familiar with. Consider contributing to an open source project, where you can learn from a new set of people than you deal with at your day job. You will likely pick up useful tips and techniques from that that you can use elsewhere. Even if you end up sticking with what you already know, you will likely learn some things to make you better at that too.</p> Fri, 01 Jul 2016 19:17:27 +0000 https://jackmott.github.io//programming/tools/editor/ide/visual/studio/2016/07/01/marginal-gains.html https://jackmott.github.io//programming/tools/editor/ide/visual/studio/2016/07/01/marginal-gains.html programming tools editor ide visual studio