<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Dangling Pointers]]></title><description><![CDATA[Summaries of computer science research papers, with a focus on unsolved problems.]]></description><link>https://danglingpointers.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!j8tU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F720c32cb-d53e-457b-98ce-fd37bde18669_1280x1280.png</url><title>Dangling Pointers</title><link>https://danglingpointers.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 11 Apr 2026 03:40:48 GMT</lastBuildDate><atom:link href="https://danglingpointers.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Blake Pelton]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[danglingpointers@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[danglingpointers@substack.com]]></itunes:email><itunes:name><![CDATA[Blake Pelton]]></itunes:name></itunes:owner><itunes:author><![CDATA[Blake Pelton]]></itunes:author><googleplay:owner><![CDATA[danglingpointers@substack.com]]></googleplay:owner><googleplay:email><![CDATA[danglingpointers@substack.com]]></googleplay:email><googleplay:author><![CDATA[Blake Pelton]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[FlexGuard: Fast Mutual Exclusion Independent of Subscription]]></title><description><![CDATA[The perfect amount of busy waiting]]></description><link>https://danglingpointers.substack.com/p/flexguard-fast-mutual-exclusion-independent</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/flexguard-fast-mutual-exclusion-independent</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 07 Apr 2026 12:04:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qT24!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3731569.3764852">FlexGuard: Fast Mutual Exclusion Independent of Subscription</a> Victor Laforet, Sanidhya Kashyap, C&#259;lin Iorgulescu, Julia Lawall, and Jean-Pierre Lozi <em>SOSP'25</em></p><p>This paper presents an interesting use of <a href="https://ebpf.io/">eBPF</a> to effectively add an OS feature: coordination between user space locking code and the kernel thread scheduler to improve locking performance.</p><h2>Oversubscription</h2><p>The paper describes most lock implementations as <em>spin-then-park</em> locks (e.g., busy wait in user space for some time, then give up and call the OS to block the waiting thread).  A big problem with busy waiting is the performance cliff under <em>oversubscription</em>.  Oversubscription occurs when there are more active threads than cores.  In this case, busy waiting can be harmful, because it wastes CPU cycles when there is other useful work to do.  The worst case occurs when a thread acquires a lock and then is preempted by the OS scheduler while many other threads are busy waiting.  If the OS thread scheduler were smart, it would preempt one of the busy waiters and let the lock holder keep running.  But alas, that level of coordination isn&#8217;t available &#8230; until now.</p><h2>eBPF</h2><p>In the good old days, researchers would have modified Linux scheduling code and tested their modified kernel.  The modern (easier) way to achieve this is to use eBPF.  The authors wrote an eBPF program that runs (in kernel space) each time a context switch occurs.  This program is called the <em>Preemption Monitor</em>.  The Preemption Monitor works in conjunction with a custom user space lock implementation.  </p><p>The net result is that the Preemption Monitor can reliably detect when the OS scheduler preempts a thread that is holding a lock.  When this occurs the eBPF program writes information to a variable that user space code can read.</p><h2>Lock Algorithm</h2><p>The locking algorithm is as follows:</p><ul><li><p>First, try to acquire the lock with a simple atomic compare-and-swap.</p></li><li><p>If that fails, then busy wait.  Similar to <a href="https://danglingpointers.substack.com/p/hapax-locks-scalable-value-based">Hapax locks</a>, this busy waiting avoids contention on one cache line by forcing all threads to agree on the order they will acquire the lock and letting each thread spin on per-thread variables.</p></li><li><p>During busy waiting, the variable written by the Preemption Monitor is checked.  If this variable indicates that there currently exists a thread which has acquired a lock and has been preempted by the OS, then threads stop busy waiting and instead call the OS to block until the lock is released (using the same system call that a futex would use).</p></li></ul><h2>Results</h2><p>Fig. 2 has performance results.  The x-axis shows thread count (which varies over time).  The green line is FlexGuard.  The idea is that it gives great performance when there is no oversubscription (i.e., fewer than 150 threads) and offers performance similar to a purely blocking lock (the dark blue line) when there is oversubscription.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qT24!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qT24!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 424w, https://substackcdn.com/image/fetch/$s_!qT24!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 848w, https://substackcdn.com/image/fetch/$s_!qT24!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 1272w, https://substackcdn.com/image/fetch/$s_!qT24!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qT24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png" width="1130" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1130,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187417,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/190421438?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qT24!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 424w, https://substackcdn.com/image/fetch/$s_!qT24!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 848w, https://substackcdn.com/image/fetch/$s_!qT24!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 1272w, https://substackcdn.com/image/fetch/$s_!qT24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1495e8c0-9468-4eb5-b121-21b4a0c234c6_1130x419.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3731569.3764852">https://dl.acm.org/doi/10.1145/3731569.3764852</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>This problem seems ripe for overengineering.  In some sick world, the compiler, OS, and hardware could all coordinate to support a &#8220;true critical section&#8221;.  All pages accessed inside this critical section would be pinned into main memory (or even closer to the CPU), and the OS would try extremely hard not to preempt threads inside of the critical section.  This would require some upper bound on the critical section working set and running time.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations]]></title><description><![CDATA[When you have a hammer ...]]></description><link>https://danglingpointers.substack.com/p/rtspmspm-harnessing-ray-tracing-for</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/rtspmspm-harnessing-ray-tracing-for</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Wed, 01 Apr 2026 12:03:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!h7rK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/full/10.1145/3695053.3731072">RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations</a> Hongrui Zhang, Yunan Zhang, and Hung-Wei Tseng <em>ISCA'25</em></p><h2>&#8221;Life Finds a Way&#8221;</h2><p>I recall a couple of decades ago when Pat Hanrahan said something like &#8220;all hardware wants to be programmable&#8221;.  You can find a similar sentiment <a href="https://queue.acm.org/detail.cfm?id=1365496">here</a>:</p><div class="pullquote"><p>With most SGI machines, if you opened one up and looked at what was actually in there&#8212;processing vertexes in particular, but for some machines, processing the fragments&#8212;it was a programmable engine. It&#8217;s just that it was not programmable by you; it was programmable by me.</p></div><p>And now, twenty years later, GPU companies have bucked the programmability trend and added dedicated ray tracing hardware to their chips.  Little did they know, users would find a way to utilize this hardware for applications that have nothing to do with graphics.</p><h2>Sparse Matrix Multiply</h2><p>The task at hand is multiplying two (very) sparse matrices (<code>A</code> and <code>B</code>).  Each matrix can be partitioned into a 2D grid, where most cells in the grid contain all 0&#8217;s.  Cells in <code>A</code> with non-zero entries must be multiplied by specific cells in <code>B</code> with non-zero entries (using a dense matrix multiplication for each product of two cells).</p><h2>Ray Tracing?</h2><p>The core idea is elegantly simple, and is illustrated in Fig. 5:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h7rK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h7rK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 424w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 848w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 1272w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h7rK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151267,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181816106?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h7rK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 424w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 848w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 1272w, https://substackcdn.com/image/fetch/$s_!h7rK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe855c303-59fb-4b6f-9340-3c5bee6744ba_1969x463.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/full/10.1145/3695053.3731072">https://dl.acm.org/doi/full/10.1145/3695053.3731072</a></figcaption></figure></div><p>The steps are:</p><ol><li><p>Build a ray tracing acceleration structure corresponding to the non-zero cells in <code>B</code></p></li><li><p>For each non-zero cell in <code>A:</code></p><ol><li><p>Trace a ray through <code>B</code> to determine if there are any non-zero cells in <code>B</code> that need to be multiplied by the current cell in <code>A</code></p></li></ol></li></ol><p>In fig. 5 the coordinates of the non-zero cells in matrix <code>A</code> are: [(2, 1) (2, 3) (3, 3) (7, 1)]. The figure shows rays overlaid on top of the result matrix, but I find it easier to think of the rays traced through matrix <code>B</code>.<br><br>The ray corresponding to the cell in <code>A</code> at (2, 1) has a column index of 1, so the algorithm traces a ray horizontally through B at row 1. The ray tracing hardware will find that this ray intersects with the cell from <code>B</code> at coordinate (1, 4). So, these cells are multiplied together to determine their contribution to the result.</p><h2>Results</h2><p>Fig. 7 has benchmark results.  All results are normalized to the performance of the <code>cuSPARSE</code> library (i.e., values greater than one represent a speedup).  <code>MKL</code> corresponds to the Intel MKL library running on a Core i7 14700K processor.  The &#8220;w/o RT cores&#8221; bars show results from the same algorithm with ray tracing implemented in general CUDA code rather than using the ray tracing accelerators.</p><p>It is amazing that this beats <code>cuSPARSE</code> across the board.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w9gC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w9gC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 424w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 848w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 1272w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w9gC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png" width="950" height="596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:596,&quot;width&quot;:950,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87273,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181816106?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w9gC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 424w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 848w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 1272w, https://substackcdn.com/image/fetch/$s_!w9gC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ba700-ab47-4a29-8e4b-5c6688cb3df8_950x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/full/10.1145/3695053.3731072">https://dl.acm.org/doi/full/10.1145/3695053.3731072</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>It seems like the core problem to be solved here is pointer-chasing.  I wonder if a more general-purpose processor that is located closer to off-chip memory could provide similar benefits.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Dissecting and Modeling the Architecture of Modern GPU Cores]]></title><description><![CDATA[NVIDIA GPU Nitty Gritty]]></description><link>https://danglingpointers.substack.com/p/dissecting-and-modeling-the-architecture</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/dissecting-and-modeling-the-architecture</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 24 Mar 2026 12:06:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TurB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3725843.3756041">Dissecting and Modeling the Architecture of Modern GPU Cores</a> Rodrigo Huerta, Mojtaba Abaie Shoushtary, Jos&#233;-Lorenzo Cruz, and Antonio Gonzalez <em>MICRO'25</em></p><p>The purpose of this paper is to understand the microarchitecture of recent NVIDIA GPUs, to be able to update architectural simulators that are used for research purposes.  The authors uncovered lots of interesting tidbits.  Take this information with a grain of salt; it is derived from careful experimentation rather than NVIDIA documentation.</p><h2>Sub-Core</h2><p>The paper uses the term <em>sub-core</em> to represent the hardware module which can execute warp-wide instructions.  Each SM comprises four sub-cores.  Fig. 3 illustrates the components within a sub-core and shows how 4 sub-cores share instruction and data caches:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TurB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TurB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 424w, https://substackcdn.com/image/fetch/$s_!TurB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 848w, https://substackcdn.com/image/fetch/$s_!TurB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 1272w, https://substackcdn.com/image/fetch/$s_!TurB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TurB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png" width="869" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:869,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151396,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/190060959?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TurB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 424w, https://substackcdn.com/image/fetch/$s_!TurB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 848w, https://substackcdn.com/image/fetch/$s_!TurB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 1272w, https://substackcdn.com/image/fetch/$s_!TurB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ea1b371-b35e-422a-8ceb-f4e67118b2cc_869x882.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756041">https://dl.acm.org/doi/10.1145/3725843.3756041</a></figcaption></figure></div><h2>Instruction Issue</h2><p>The responsibility of resolving inter-instruction hazards (within a given warp) is split between the compiler and the hardware.  There are two mechanisms the compiler can use to inform the hardware how it should avoid hazards:</p><ul><li><p>The instruction encoding allows any instruction to set the value of a per-warp <em>stall counter.</em>  When the hardware issues such an instruction, it sets the stall counter to the specified value.  On each clock cycle thereafter, the counter is decremented by one.  The hardware will not issue more instructions for the warp until the counter reaches zero.  This is useful for handling hazards with a fixed latency.</p></li><li><p>Variable-latency hazards are resolved with <em>dependence counters</em>.  The hardware tracks the value of six dependence counters per warp.  The instruction encoding allows the compiler to specify up to two counters which should be incremented when an instruction is issued.  One of these counters is decremented when the instruction <em>writes to</em> the register file, and the other is decremented when the instruction <em>reads from</em> the register file (to resolve WAR hazards).  Additionally, the compiler can specify that a given instruction cannot issue until the value of specific dependence counters are zero.</p></li></ul><p>In fig. 2 above, the values of these counters are checked in the <code>Issue</code> block, and the counters are incremented in the <code>Control</code> block.</p><p>The warp scheduler prefers to pick a warp and stick with it (e.g., it is not a round-robin scheduler).  If the current warp cannot be scheduled (e.g., the stall counter is greater than zero, or there was a cache miss), then the scheduler switches to another warp.</p><p>The warp scheduler issues instructions in program order (within a warp).  There is no out-of-order execution support.</p><h2>Register File Ports</h2><p>The register file has a limited number of ports, and instructions must be controlled to avoid attempting too many reads or writes in parallel.  Register file port contention is not handled by the warp scheduler, instead it is handled further down the pipe.  For example, the <code>Allocate</code> stage in fig. 2 will stall fixed-latency instructions until register file read ports are available.</p><p>The <em>register file cache</em> (RFC) is a hardware component that reduces contention on the register file read ports.  The RFC has storage for 6 vectors (and tags).  The compiler can mark a source operand of an instruction such that the hardware will store the source operand in the cache for a subsequent operation to use.  Note that the RFC does not store per-warp values and is only useful for caching data within one warp.  This plays nicely with the &#8220;pick a warp and stick to it&#8221; scheduling policy.</p><p>Listing 4 has some example code sequences demonstrating how the compiler can direct the operation of the RFC (e.g., <code>R2.reuse</code>):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RCdT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RCdT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 424w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 848w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 1272w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RCdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png" width="1077" height="682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:682,&quot;width&quot;:1077,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127071,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/190060959?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RCdT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 424w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 848w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 1272w, https://substackcdn.com/image/fetch/$s_!RCdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd08555d8-2615-413f-97c1-5e971f5c7455_1077x682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756041">https://dl.acm.org/doi/10.1145/3725843.3756041</a></figcaption></figure></div><h2>Memory Access</h2><p>Most of the resources that are shared between sub-cores are shared for efficiency reasons.  A single sub-core will not generate memory requests at a high throughput, and there is locality of reference between the memory accesses in multiple sub-cores.  The <code>shared memory</code> block in fig. 3 is shared in order to properly support thread group shared memory (as a thread group is spread across all sub-cores in a SM).</p><p>The shared memory access modules can handle one request every two cycles.  That means if all 4 sub-cores are contending on memory, each one can make a request every 8 cycles.  There is a FIFO of depth ~4 between each sub-core and the shared memory structures.  Typical read-after-write latency in shared memory is between 20-40 cycles.</p><h2>Results</h2><p>The authors built a simulation model based on their experiments.  <em>Mean percentage absolute error</em> (MAPE) is one metric for measuring how accurate a simulation model is compared to real hardware.  Table 4 shows that the model derived from the findings in this paper are a better performance model for recent NVIDIA GPUs than the baseline (<code>Accel-sim</code>):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zkGz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zkGz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 424w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 848w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 1272w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zkGz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png" width="1364" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:1364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/190060959?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zkGz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 424w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 848w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 1272w, https://substackcdn.com/image/fetch/$s_!zkGz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb9e19d4-795c-48f1-8fd1-1823f99099e4_1364x462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756041">https://dl.acm.org/doi/10.1145/3725843.3756041</a></figcaption></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><h2></h2>]]></content:encoded></item><item><title><![CDATA[Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture]]></title><description><![CDATA[Active Messages Primer]]></description><link>https://danglingpointers.substack.com/p/nexus-machine-an-energy-efficient</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/nexus-machine-an-energy-efficient</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 19 Mar 2026 12:24:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EkKi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3725843.3756091">Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture</a> Rohan Juneja, Pranav Dangi, Thilini Kaushalya Bandara, Tulika Mitra, and Li-Shiuan Peh <em>MICRO'25</em></p><p>This paper presents an implementation of the <a href="https://en.wikipedia.org/wiki/Active_message">Active Message</a> (AM) architecture, as an alternative to FPGA/CGRA architectures.  AM architectures have been studied for a while; this was my first exposure.</p><h2>Spatial Computing</h2><p>An accelerator implemented on an FPGA or CGRA typically uses a spatial computing paradigm.  Each &#8220;instruction&#8221; in the algorithm is pinned to a physical location on the chip, and data flows between the instructions.  I prefer to think of the data in motion as the local variables associated with threads that also move (using <a href="https://dl.acm.org/doi/10.1145/3656420">a specialized memory consistency model</a>).</p><p>The active message architecture flips that script around.  Data structures are pinned, while <em>instructions move to the relevant data</em>.</p><h2>Active Messages</h2><p>Fig. 5 shows two <em>processing elements</em> (PEs), each of which contain two <em>active messages</em> (AMs).  An active message looks a lot like an instruction: it contains an opcode, source operands, and a result operand.  Throughout the computation, AMs move between PEs.  PEs have a local ALU and local memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EkKi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EkKi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 424w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 848w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 1272w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EkKi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png" width="789" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:789,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120473,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189702301?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bd77ed-3363-42c2-82c0-67be6a3406e7_789x842.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EkKi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 424w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 848w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 1272w, https://substackcdn.com/image/fetch/$s_!EkKi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbf980a0-2ffb-4527-a083-e006acf7ccdc_789x805.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756091">https://dl.acm.org/doi/10.1145/3725843.3756091</a></figcaption></figure></div><p>The AM at the top of the figure has <code>Opcode=LOAD</code> and <code>Op1=f</code>.  Here, <code>f</code> is an operand that is being carried around for future use.  The AM with a <code>LOAD</code> opcode will make its way through the chip until it arrives at the PE which contains the data to be loaded.  At this point, the load operation will execute, and a new AM will be created.  In the figure above, the new AM is the one at the bottom of PE0.  It has <code>Opcode=MUL</code>, <code>Op1=f</code>, and <code>Op2=h</code>.  Op1 is forwarded unchanged from the predecessor AM.  The value of <code>Op2</code> was the value of the data loaded from memory.  The new opcode was obtained from the <em>config memory</em>, which contains a description of the program that is being executed.</p><p>The next step to be performed is to multiply <code>f * h</code>.  One might expect PE0 to perform the multiplication, but in the figure above the AM is routed to <code>PE1</code>, which performs the multiplication.  A reason why you would want to do this is in a situation where there are many AMs queued to access the data memory associated with PE0, but few AMs queued to access the data memory associated with PE1.  In this situation, it is better to let PE0 perform loads for other AMs (because PE0 is the only PE that can fulfill that task) and find a PE that is currently idle to perform the multiplication (any PE can perform the multiplication).</p><h2>Results</h2><p>Now the question you should be asking is: what real-world applications exhibit load imbalances between PEs like this?  If a data structure were split between all PEs evenly, you would think that load will be spread nicely across the PEs.  The answer is: irregular workloads like sparse matrix-vector multiplication.  Fig. 6 shows how a source matrix, source vector, and result vector could be partitioned across 4 PEs.  You can imagine how the sparsity of the tensors being operated on would cause load imbalance between the PEs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I83G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I83G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 424w, https://substackcdn.com/image/fetch/$s_!I83G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 848w, https://substackcdn.com/image/fetch/$s_!I83G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 1272w, https://substackcdn.com/image/fetch/$s_!I83G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I83G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png" width="966" height="618" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:618,&quot;width&quot;:966,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90579,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189702301?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I83G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 424w, https://substackcdn.com/image/fetch/$s_!I83G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 848w, https://substackcdn.com/image/fetch/$s_!I83G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 1272w, https://substackcdn.com/image/fetch/$s_!I83G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F447f1d4d-3f23-46c8-a13d-bbe817118d15_966x618.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756091">https://dl.acm.org/doi/10.1145/3725843.3756091</a></figcaption></figure></div><p>Fig. 11 compares the Nexus Machine against other architectures (each design has the same number of ALUs).  Fig. 12 shows performance-per-watt.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cPPo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cPPo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 424w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 848w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 1272w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cPPo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png" width="1456" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189702301?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cPPo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 424w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 848w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 1272w, https://substackcdn.com/image/fetch/$s_!cPPo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f5a253-3bd6-4ed6-8bff-3796f4c3d882_1783x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3725843.3756091">https://dl.acm.org/doi/10.1145/3725843.3756091</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>I imagine that AM architectures work best for algorithms that are insensitive to the order in which AMs are executed.  That would be the case for matrix/vector multiplication (assuming addition is associative).  </p><p>It seems like there is a large design space here related to PE capabilities.  Data structures could be replicated across PEs to enable memory access AMs to be serviced by multiple PEs, or the ALUs inside of each PE could be heterogeneous (e.g., some PEs can do division, others cannot).</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs]]></title><description><![CDATA[Sub-NUMA Clustering]]></description><link>https://danglingpointers.substack.com/p/tina-tiered-network-buffer-architecture</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/tina-tiered-network-buffer-architecture</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 17 Mar 2026 12:06:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-JBE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3760250.3762224">TiNA: Tiered Network Buffer Architecture for Fast Networking in Chiplet-based CPUs</a> Siddharth Agarwal, Tianchen Wang, Jinghan Huang, Saksham Agarwal, and Nam Sung Kim <em>ASPLOS'26</em></p><p>Here we <a href="https://danglingpointers.substack.com/p/ceio-a-cache-efficient-network-io">go</a> <a href="https://danglingpointers.substack.com/p/disentangling-the-dual-role-of-nic">again</a>, another paper in a top-tier conference on the classic CS problem: how to DMA received packets from NIC to host.  It would be interesting to understand why this is such a hot topic these days.</p><p>This paper deals with the case where the host CPU comprises multiple chiplets.  If you get nothing else from this, I hope you will learn something about SNC mode (I had not heard of it before).</p><h2>SNC</h2><p>Recent Intel CPUs can be placed into <em>Sub-NUMA Clustering</em> mode (via a BIOS setting).  This causes each chiplet to appear as a separate NUMA node.  It is like a single socket CPU is transformed into a 4 socket CPU.  The DRAM memory space is divided into four regions (one per chiplet), and the LLC slices within a chiplet only cache data from one memory space.  This can be advantageous for some applications, because it can lower average LLC and DRAM access latency (by avoiding inter-chiplet communication).  The downside is that the peak LLC capacity available to a single core is reduced.  Fig. 3 illustrates these tradeoffs:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-JBE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-JBE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 424w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 848w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 1272w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-JBE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png" width="915" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edb30ae2-028a-4757-828d-0eecdbd48625_915x525.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:915,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118807,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189378958?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-JBE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 424w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 848w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 1272w, https://substackcdn.com/image/fetch/$s_!-JBE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb30ae2-028a-4757-828d-0eecdbd48625_915x525.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762224">https://dl.acm.org/doi/10.1145/3760250.3762224</a></figcaption></figure></div><h2>SNC and DDIO</h2><p>Recall that <a href="https://danglingpointers.substack.com/p/disentangling-the-dual-role-of-nic?utm_source=publication-search">DDIO</a> is a feature of Intel CPUs that allows a NIC to write received packets directly into the LLC, which the host CPU can then read.  PCIe lanes are distributed among chiplets.  This means that the NIC is directly connected to one chiplet.</p><p>One way to support DDIO with SNC is to allocate buffers for received packets in the memory region associated with the chiplet that the NIC is connected to.  This improves LLC bandwidth (for both the NIC and CPU cores) but decreases the LLC capacity available for network packets.</p><p>In practice, this means that longer bursts of network packets degrade performance more when SNC is enabled (i.e., leaky DMA is a larger problem in SNC mode).  Fig. 6 has data from a microbenchmark to back this up:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dOY7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dOY7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 424w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 848w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 1272w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dOY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png" width="567" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:567,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54764,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189378958?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dOY7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 424w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 848w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 1272w, https://substackcdn.com/image/fetch/$s_!dOY7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8883db83-3caf-45b6-899d-64da4a40afc3_567x350.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762224">https://dl.acm.org/doi/10.1145/3760250.3762224</a></figcaption></figure></div><h2>TiNA</h2><p>The solution proposed by this paper requires a change to the NIC/driver interface.  Each ring buffer of received network packets is replaced by <code>N</code> ring buffers (where <code>N</code> is the number of chiplets).  Ring buffer <code>i</code> is placed in the memory region associated with chiplet <code>i</code>.</p><p>The NIC knows about all of these ring buffers and dynamically decides which one to use.  The NIC prefers to use the ring buffer associated with the chiplet that it is directly connected to.  However, if a burst of traffic causes high utilization of the LLC capacity of that chiplet, then the NIC will fall back to using the other ring buffers.</p><p>The NIC estimates LLC utilization based on two competing rates:</p><ol><li><p>The rate that received network packets are produced by the NIC</p></li><li><p>The rate that received network packets are consumed by the host</p></li></ol><p>The first rate is easy for the NIC to compute as it knows how fast it is sending bytes to the host.  The second rate is computed by networking software running on the host, and periodically sent to the NIC.</p><p>The overall approach reminds me of <a href="https://danglingpointers.substack.com/p/ceio-a-cache-efficient-network-io">CEIO</a>.  The key difference is the set of memory segments available.  CEIO uses NIC-local DRAM as the fallback path.</p><h2>Packet Ordering</h2><p>One complication of splitting a single ring buffer into multiple is ensuring that the host processes received packets in order.  This paper proposes using sequence numbers associated with each packet.  Most protocols already use per-packet sequence numbers.  For other protocols (e.g., UDP), the NIC adds a sequence number based on the order in which packets were received.</p><p>When the host reads a packet from a logical ring buffer, it examines the sequence numbers from the packets at the head of each of the <code>N</code> physical ring buffers and chooses the packet with the lowest sequence number.</p><h2>Results</h2><p>Fig. 9 has benchmark results: lower latency than SNC and non-SNC across a range of microbenchmarks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2fJT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2fJT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 424w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 848w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 1272w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2fJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png" width="774" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:774,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58749,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189378958?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2fJT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 424w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 848w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 1272w, https://substackcdn.com/image/fetch/$s_!2fJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3f6f0c-5961-4cf6-b2ea-bf325710bb2c_774x338.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762224">https://dl.acm.org/doi/10.1145/3760250.3762224</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>It would be nice if SNC allowed more fine-grained configuration.  For example, there may be applications where ideal performance is achieved if each CPU <em>core</em> only has access to the L3 slice that is directly connected to it.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Binary Compatible Critical Section Delegation]]></title><description><![CDATA[A sneak way to ship code to data]]></description><link>https://danglingpointers.substack.com/p/binary-compatible-critical-section</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/binary-compatible-critical-section</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 12 Mar 2026 12:06:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zFN4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3774934.3786439">Binary Compatible Critical Section Delegation</a> Junyao Zhang, Zhuo Wang, and Zhe Zhou <em>PPoPP'26</em></p><p>The futex design works great when contention is low but leaves much to be desired when contention is high.  I generally think that algorithms should be crafted to avoid high lock contention, but this paper offers a contrarian approach that improves performance <em>without code changes</em>.</p><h2>Contention Costs</h2><p>Acquiring a futex involves atomic operations on the cache lines that contain the futex state.  In the case of high contention, these cache lines violently bounce between cores.  Also, user space code will eventually give up trying to acquire a lock the easy way and will call into the kernel, which has its own synchronization to protect the shared data structures that manage the queue of threads waiting to acquire the lock.</p><p>The problems don&#8217;t end when a lock is finally acquired.  A typical futex guards some specific application data.  The cache lines containing that data will also uncomfortably bounce between cores.</p><h2>Delegation</h2><p>The idea behind delegation is to <em>replace the queue of pending threads with a queue of pending operations</em>.  An operation comprises the code that will be executed under the lock, and the associated data.</p><p>This C++ code snippet shows how I think of delegation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;cpp&quot;,&quot;nodeId&quot;:&quot;3e2cc4ea-eb0f-42fb-a3ef-f1e03428ba16&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-cpp">// Typical locking
uint32_t x = 3;
mutex.lock();
shared_counter += x;
mutex.unlock();

// Delegation
uint32_t x = 3;
auto fn = [x, &amp;shared_counter]()
{
    shared_counter += x;
};
mutex.delegate(fn); // returns after `fn` is executed by _some_ thread</code></pre></div><p>In the uncontended case, <code>mutex.delegate</code> will execute <code>fn</code> directly.  In the contended case, <code>fn</code> will be placed into a queue to be executed later.  After any thread finishes executing a function like <code>fn</code>, that thread will check the queue.  If the queue is not empty, that thread will go ahead and execute all of the functions contained in the queue.</p><p>In the example above, the data guarded by the lock is <code>shared_counter</code>.  If a particular thread calls 10 functions from the queue, then <code>shared_counter</code> remains local to the core that thread is running on, and the system will avoid moving <code>shared_counter</code> between cores 10 times.</p><h2>Automatic Delegation</h2><p>The magic of this paper is that it shows how to change the OS kernel to automatically implement delegation for any application that uses futexes.  When the futex code gives up trying to acquire a futex in user space, it calls the OS to wait on the futex.  The implementation of this system call is changed to implement automatic delegation.  Automatic delegation can fail (as illustrated by Fig. 2), in which case the traditional futex waiting algorithm is used.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zFN4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zFN4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 424w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 848w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 1272w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zFN4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png" width="1293" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1293,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189177871?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zFN4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 424w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 848w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 1272w, https://substackcdn.com/image/fetch/$s_!zFN4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea170e8-5181-4a3b-b824-3c3de660da47_1293x787.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3774934.3786439">https://dl.acm.org/doi/10.1145/3774934.3786439</a></figcaption></figure></div><p>This paper makes heavy use of the <a href="https://github.com/glarer/UserspaceBypass">Userspace Bypass</a> library (a.k.a. UB; paper <a href="https://www.usenix.org/system/files/osdi23-zhou-zhe.pdf">here</a>).  This library allows the kernel to safely execute user-mode code.  It was originally designed to optimize syscall heavy applications, by allowing the kernel to execute the small tidbits of user space code in-between system calls.  UB uses binary translation to translate instructions that were meant to run in user space into instructions that can securely be executed by the kernel.</p><p>Binary compatible critical section delegation uses UB to translate the code inside of the critical section (i.e., the code between the futex lock and unlock calls) into code that can be safely executed by the kernel.  A pointer to this translated code is placed into a queue of delegated calls (the <em>vw queue</em>).  The set of threads which are trying to acquire a lock cooperatively execute the functions in the vw queue.  At any one time, at most one thread is elected to be the delegate thread.  It drains the vw queue by executing (in kernel space) all the delegated functions in the queue.  This works great in cases where the code inside of the critical section accesses a lot of shared state, because that shared state can happily reside in the cache of the core that is running the delegate thread, rather than bouncing between cores.</p><h2>Results</h2><p>The paper has impressive results from microbenchmarks, but I think real applications are more relevant.  Table 2 shows performance results for a few applications and a few locking strategies.  BCD is the work in this paper.  TCS and TCB are prior work which have the drawback of not being compatible with existing binaries.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aAqz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aAqz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 424w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 848w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 1272w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aAqz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png" width="518" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:518,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42808,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/189177871?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aAqz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 424w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 848w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 1272w, https://substackcdn.com/image/fetch/$s_!aAqz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d2b69bb-ec1b-4bb0-b78e-21cd08a272cd_518x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3774934.3786439">https://dl.acm.org/doi/10.1145/3774934.3786439</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>There is a hint here at another advantage of pipeline parallelism over data parallelism: allowing persistent data structures to remain local to a core.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Radshield: Software Radiation Protection for Commodity Hardware in Space]]></title><description><![CDATA[Putting the space back in user space]]></description><link>https://danglingpointers.substack.com/p/radshield-software-radiation-protection</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/radshield-software-radiation-protection</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 10 Mar 2026 20:45:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jegW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3760250.3762218">Radshield: Software Radiation Protection for Commodity Hardware in Space</a> Haoda Wang, Steven Myint, Vandi Verma, Yonatan Winetraub, Junfeng Yang, and Asaf Cidon <em>ASPLOS'25</em></p><p>If you read no further, here are two interesting factoids about outer space from this paper:</p><p>Launch costs have fallen 60x, with the current cost to launch 1kg to space clocking in at $1,400 (see Fig. 1 below).</p><p>Many satellites orbiting the Earth and devices sent to Mars use Snapdragon CPUs!  I assumed that all chips leaving planet Earth would be specialized for space, apparently not.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jegW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jegW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 424w, https://substackcdn.com/image/fetch/$s_!jegW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 848w, https://substackcdn.com/image/fetch/$s_!jegW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 1272w, https://substackcdn.com/image/fetch/$s_!jegW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jegW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png" width="735" height="418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:418,&quot;width&quot;:735,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47676,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818469?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jegW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 424w, https://substackcdn.com/image/fetch/$s_!jegW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 848w, https://substackcdn.com/image/fetch/$s_!jegW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 1272w, https://substackcdn.com/image/fetch/$s_!jegW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df1e6f4-baa2-431f-93b6-03434f4f156a_735x418.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762218">https://dl.acm.org/doi/10.1145/3760250.3762218</a></figcaption></figure></div><p>This paper describes software solutions to deal with two common problems that occur in outer space: <em>Single-Event Latchups</em> and <em>Single-Event Upsets</em>, both of which are caused by radiation interfering with the normal operation of a circuit.</p><h2>Single-Event Latchups</h2><p>A single-event latchup (SEL) causes one portion of the chip to heat up.  If left unmitigated, this can damage the chip.  The solution to this is to detect the problem and reboot.  The trick is in the detection.  </p><p>The classic detection method monitors chip current draw.  However, this technique fails with a modern off-the-shelf CPU which is designed to have a wide variability in current draw.  When compute load increases, clock frequencies and voltages change, cores come out of sleep states, and power consumption naturally increases.  The point of this design is to save power during idle periods, which is especially important for satellites which must get their power from the sun.</p><p>The solution proposed by this paper is called <em>ILD.</em>  The idea is to predict the expected current draw based on a simple model that uses CPU performance counters (e.g., cache hit rate, instruction execution rate) as input.  If the measured current draw is much larger than predicted, then the system is rebooted.</p><p>The model is not perfect, and the authors noticed that this scheme only works well when the CPU load is not too high.  This &#8220;predict, check, reboot if necessary&#8221; cycle only occurs during relatively calm periods of time.  The system is modified to force 3-second idle periods every 3 minutes to ensure that reliable measurements can be taken.  An SEL takes about 5 minutes to damage the chip, the 3-minute period is chosen to be below that threshold.</p><h2>Single-Event Upsets</h2><p>A single-event upset causes the value of a bit to flip (in memory, cache, the register file, etc).  There are two common solutions to SEUs:</p><ol><li><p>Use ECC on stored data</p></li><li><p>Perform computations with <em>triple modular redundancy</em> (3-MR), which requires computing each result 3 times and choosing the most popular result if there is disagreement about the correct result</p></li></ol><p>This paper deals with mitigating SEUs that affect user &#8220;space&#8221; code.</p><p>The authors define the term <em>reliability frontier</em> to represent the interface between hardware components that support ECC and those that do not.  For example, if flash storage has ECC but DRAM does not, then flash is considered part of the reliability frontier.</p><p>A <s>typical smartphone CPU</s> advanced satellite chip has multiple CPU cores.  One way to alleviate the compute cost of 3-MR is to compute all 3 results on 3 separate cores in parallel.  A problem with this approach is that the CPU cores may share unreliable hardware.  For example, the last level cache could be shared by all cores but not support ECC.  If a bit flips in the LLC, then all cores will see the corrupted value, and parallel 3-MR will not detect a problem.</p><p>The paper proposes an algorithm called EMR.  The idea is to break a computation into multiple tasks and associate metadata with each task that describes the subset of input data accessed by the task.  Fig. 6 shows a motivating example.  The task of analyzing an image may be decomposed into many tasks, where each task processes a subset of the input image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eddX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eddX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 424w, https://substackcdn.com/image/fetch/$s_!eddX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 848w, https://substackcdn.com/image/fetch/$s_!eddX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 1272w, https://substackcdn.com/image/fetch/$s_!eddX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eddX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png" width="561" height="351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/462a26c8-e08a-4181-b971-687582192769_561x351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:351,&quot;width&quot;:561,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:257141,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818469?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eddX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 424w, https://substackcdn.com/image/fetch/$s_!eddX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 848w, https://substackcdn.com/image/fetch/$s_!eddX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 1272w, https://substackcdn.com/image/fetch/$s_!eddX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F462a26c8-e08a-4181-b971-687582192769_561x351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762218">https://dl.acm.org/doi/10.1145/3760250.3762218</a></figcaption></figure></div><p>In EMR, there is an API to explicitly create tasks and specify the set of input data that each task reads from.  EMR then runs tasks in multiple epochs.  Within an epoch, no two tasks read the same input data.  EMR invalidates caches up to the reliability frontier between epochs.  If there are many tasks, and few epochs, then this system works great (i.e., it has high CPU utilization and does not spend too much time invalidating caches).</p><h2>Results</h2><p>Table 2 compares ILD performance in detecting SELs against a random forest model and a model that simply compares current draw against a fixed value:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C_9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C_9E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 424w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 848w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 1272w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C_9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png" width="577" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:164,&quot;width&quot;:577,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818469?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C_9E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 424w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 848w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 1272w, https://substackcdn.com/image/fetch/$s_!C_9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ff9878-1ed6-4cce-8ba9-b5dd50f8a11e_577x164.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762218">https://dl.acm.org/doi/10.1145/3760250.3762218</a></figcaption></figure></div><p>Fig. 11 shows the performance impact of EMR.  Each result is normalized against a parallel version of 3-MR which ignores the problems associated with shared hardware.  The red bars represent 3-MR run on a single core; the blue bars represent EMR.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bw6c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bw6c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 424w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 848w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 1272w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bw6c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png" width="568" height="452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:568,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818469?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bw6c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 424w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 848w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 1272w, https://substackcdn.com/image/fetch/$s_!Bw6c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc4f3222-1c61-47f5-a447-3d64494127c4_568x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3760250.3762218">https://dl.acm.org/doi/10.1145/3760250.3762218</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>EMR would benefit from a system that detects when a programmer misspecifies the set of inputs that will be read.  Maybe hardware or software support could be added to detect this kind of bug.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds]]></title><description><![CDATA[Your L3 is more configurable than you thought]]></description><link>https://danglingpointers.substack.com/p/cacheman-a-comprehensive-last-level</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/cacheman-a-comprehensive-last-level</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 10 Mar 2026 12:03:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!G8JC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3774934.3786415">Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds</a></p><p>I learned a lot about the LLC configuration and monitoring capabilities of modern CPUs from this paper, I bet you will too.</p><h2>Shared L3 in the Cloud</h2><p>The problem this paper addresses is: how to avoid performance variability in cloud applications due to cross-VM contention for the last level cache (e.g., the L3 cache on a Xeon)?  In a typical CPU, the L1 and L2 caches are private to a core, but the L3 is shared.  In a cloud environment, the L3 is shared by multiple tenants, and is an avenue for a &#8220;noisy neighbor&#8221; to annoy its neighbors.</p><h2>CAT and CMT</h2><p>The work described by this paper builds upon <a href="https://github.com/intel/intel-cmt-cat/wiki">Intel CMT and CAT.</a>  <em>Cache Monitoring Technology</em> allows the hypervisor to track how much of the L3 cache is occupied by each VM.  <em>Cache Allocation Technology</em> allows the hypervisor to restrict a VM to only use a subset of the L3.</p><p>CAT allows a VM to be assigned to a <em>cache level of service</em> (CLOS), which defines the set of L3 <em>ways </em>accessible to the VM (<a href="https://www.sciencedirect.com/topics/computer-science/set-associative-cache">this page</a> defines the term &#8220;ways&#8221; if you are unfamiliar).  A typical CPU used by a cloud service provider has more CPU cores than L3 ways.  If a cloud server hosts many small VMs, then L3 ways must be shared amongst VMs.  The key problem solved by this paper is how to reduce performance variability given this constraint.</p><h2>Gradient-Based Sharing</h2><p>Fig. 1 illustrates the assignments of CLOS levels to LLC ways advocated by this paper.  Each row is a level of service, and each column is a way of the LLC cache.  CLOS[0] can access all ways, CLOS[1] can access all LLC ways except for one.  CLOS[7] can only access a single way of the LLC.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G8JC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G8JC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 424w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 848w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 1272w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G8JC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png" width="508" height="234.23798627002287" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:403,&quot;width&quot;:874,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:43825,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818327?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G8JC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 424w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 848w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 1272w, https://substackcdn.com/image/fetch/$s_!G8JC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc967f956-9fd8-41b0-857c-960495d5e8f3_874x403.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3774934.3786415">https://dl.acm.org/doi/10.1145/3774934.3786415</a></figcaption></figure></div><p>The hypervisor uses Intel CMT to monitor how much of the LLC is occupied by each VM.  Every 6 seconds the hypervisor uses this information to change the CLOS that each VM is assigned to.</p><p>The hypervisor computes a target LLC occupancy for each VM based on the number of cores assigned to the VM.  This target is compared against the measured LLC occupancy to classify each VM into one of three categories:</p><ul><li><p>Poor (the VM is starved for space)</p></li><li><p>Adequate (the VM is using just the right amount of cache)</p></li><li><p>Excess (the VM is hogging too much)</p></li></ul><p>VMs in the poor category are <em>de-suppressed</em> (i.e., assigned to a CLOS with access to more LLC ways).  Additionally, VMs in the excess category are suppressed (i.e., assigned to a CLOS with access to fewer ways), but this suppression only occurs when there are VMs in the poor category.  </p><p>This policy means that cache-hungry VMs can use more than their fair share of the L3 during periods of low server utilization.  This can lead to higher mean performance, at the cost of a wider standard deviation.  The paper describes a 4th state (overflow), which is only applied to VMs that wish to be held back even if there is plenty of L3 space available.  These VMs are suppressed when they are found to be using too much L3, even if all other VMs on the system are getting enough cache space.</p><h2>Results</h2><p>Fig. 5 shows a case where this strategy works well compared to static allocation.  The server in question is running 5 VMs, each running a different application:</p><ul><li><p>VM1 - 32 cores</p></li><li><p>VM2 - 16 cores (but doesn&#8217;t fully utilize those cores)</p></li><li><p>VM3 - 8 cores</p></li><li><p>VM4 - 4 cores</p></li><li><p>VM5 - 4 cores</p></li></ul><p>The top of figure 5 shows a simple static partitioning of LLC ways.  VM1 is assigned to 6 ways, VM2 is assigned to 3 ways, VM3 is assigned to 2 ways, and VMs 4 and 5 must share 1 way.  They have to share because sharing based on the number of ways in the LLC is inherently coarse-grained.  </p><p>The two charts show measured LLC utilization over 10 minutes.  Notice the Y-axis.  The technique described in this paper (Cacheman) allows VM4 and VM5 to use far more aggregate LLC capacity than the static partitioning.  Also notice that in the static partitioning, VM5 always uses more LLC than VM4 (because they are running different applications), whereas Cacheman allows for a more even balance between them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nee1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nee1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 424w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 848w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 1272w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nee1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png" width="516" height="460.4088397790055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:724,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:114965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187818327?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nee1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 424w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 848w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 1272w, https://substackcdn.com/image/fetch/$s_!Nee1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88186331-5fd5-4d38-bcb5-07bd0a55ec0b_724x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3774934.3786415">https://dl.acm.org/doi/10.1145/3774934.3786415</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>While the L3 cache is logically a monolithic shared resource, it is physically partitioned across the chip (with a separate slice near each core).  It seems like it could be more efficient if VMs could be assigned to nearby L3 slices rather than L3 ways.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[Hapax Locks: Scalable Value-Based Mutual Exclusion]]></title><description><![CDATA[Spin locks with less coherence traffic]]></description><link>https://danglingpointers.substack.com/p/hapax-locks-scalable-value-based</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/hapax-locks-scalable-value-based</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 05 Mar 2026 13:03:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!F4f1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3774934.3786443">Hapax Locks: Scalable Value-Based Mutual Exclusion</a> Dave Dice and Alex Kogan <em>PPoPP'26</em></p><p>This paper describes a locking algorithm intended for cases where spinning is acceptable (e.g., one thread per core systems).  It is similar to a <a href="https://en.wikipedia.org/wiki/Ticket_lock">ticket lock</a> but generates less coherence traffic.  Each lock/unlock operation causes a <strong>constant</strong> number of cache lines to move between cores, regardless of the number of cores involved or how long they spin for.</p><p>As we&#8217;ve seen in a <a href="https://danglingpointers.substack.com/p/re-architecting-end-host-networking">previous paper</a>, polling a value in memory is cheap if the cache line is already local to the core which is polling.</p><h2>Basic Algorithm</h2><p>A Hapax lock comprises two 64-bit fields:</p><ul><li><p><code>Arrive</code></p></li><li><p><code>Depart</code></p></li></ul><p>Additionally, there is a global (shared among all Hapax locks) 64-bit sequence number.  Each time a thread attempts to lock a Hapax lock, it generates a <em>Hapax value</em> which uniquely identifies the <em>locking episode</em>.  A locking episode is a single lock/unlock sequence performed by a specific thread.  A Hapax value is generated by atomically incrementing the sequence number.  It is assumed that the 64-bit counter additions will never overflow.</p><p>Next, the locking thread atomically exchange the value of <code>Arrive</code> with the Hapax value it just generated.  This exchange operation generates a total ordering among Hapax values.  It is a way for threads to cooperatively decide the order in which they will acquire the lock.</p><p>Say thread <code>A</code> generates Hapax value <code>N</code> and stores it into <code>Arrive</code> (via an atomic exchange operation).  Next, thread <code>B</code> generates Hapax value <code>M</code> and atomically exchanges the value of <code>Arrive</code> with <code>M</code>.  The result of the exchange operation performed by <code>B</code> will be <code>N</code>.  At this point, thread <code>B</code> knows that it is directly behind thread A in the queue and must wait for thread <code>A</code> to release the lock.</p><p>To finish acquiring the lock, threads continually poll <code>Depart</code>, waiting for <code>Depart</code> to equal the Hapax value of the preceding locking episode.  In the example above, thread <code>B</code> polls <code>Depart</code> until it sees the value <code>N</code>.  At this point, the lock has been acquired.  Unlocking is implemented by storing the Hapax value used by the unlocking thread into <code>Depart</code>.  In the running example, thread <code>B</code> would unlock the lock by storing the value <code>M</code> into <code>Depart</code>.</p><p>This algorithm generates a lot of coherence traffic.  In particular, the cache line which holds the sequence number would move between cores each time a new Hapax value is generated.  Also, each store to <code>Depart</code> would send coherence traffic to each core which had recently polled the value of <code>Depart</code>.  The paper has two techniques to address these issues.</p><h2>Hapax Value Amortization</h2><p>While the sequence number monotonically increases, the values stored in <code>Arrive</code> and <code>Depart</code> do not.  There are two reasons for this.  First, a single sequence number is shared among all Hapax locks.  The second reason is that multiple threads can generate Hapax values and then race to perform the atomic exchange operation.  For example, thread <code>A</code> could generate a Hapax value of <code>N</code> while thread <code>B</code> generates a Hapax value of <code>N+1</code>.  They then race each other to atomically exchange their Hapax value with the value of <code>Arrive</code>.  If thread <code>B</code> wins the race, then <code>Arrive</code> will first take on the value <code>N+1</code>, and then later it will have the value <code>N</code>.</p><p>Once you realize that the values of <code>Arrive</code> and <code>Depart</code> are not monotonically increasing, it is straightforward to see how the generation of Hapax values can be made cheap.  A thread can hoard a batch of Hapax values with a single atomic add operation.  For example, a thread could atomically increase the value of the sequence number by 1024.  At this point, the thread has allocated 1024 Hapax values for itself that it can use in the future without accessing the cache line which holds the shared sequence number.  The paper proposes allocating Hapax values in blocks of 64K.</p><h2>Depart Amortization</h2><p>The paper proposes adding an additional array which serves a similar role as <code>Depart</code>.  The number of elements in the array should be greater than the number of cores (the paper uses an array of 4096 values).  Like the sequence number, this array is shared among all Hapax locks.  </p><p>When a thread writes its Hapax value into <code>Depart</code>, the thread also stores its Hapax value into one of the 4096 elements.  The array index is determined by the Hapax value.  Many potential hash functions could be used.  The paper proposes hashing bits [27:16] of the Hapax value.  The 16 is related to the allocator block size.</p><p>In the locking sequence, a thread loads the value of <code>Depart</code> once.  If the value of Depart does not match the expected Hapax value, then the locking thread polls the appropriate element of the shared array.  The thread polls this element until its value changes.  If the new value is the expected value of <code>Depart</code>, then the lock has been acquired.  If not, then a hash collision has occurred (e.g., a locking episode associated with a different Hapax lock caused the value to be updated).  In this case, the thread starts over by checking <code>Depart</code> and then polling the array element if necessary.</p><p>This scheme minimizes coherence traffic associated with polling.  When an unlocking core stores a value into an array element, the associated cache line will typically be present only in the cache of the next core in line.  Coherence traffic is only generated related to the locking and unlocking cores.  Other threads (which are further back in the line) will be polling other array elements and thus be loading from other cache lines and so the cores those threads are running on won&#8217;t see the coherence messages.</p><h2>Results</h2><p>Fig. 3 has results from a microbenchmark.  Hapax locks scale much better than ticket locks and go head-to-head with other state-of-the-art locking algorithms.  The Hapax implementation is so concise (about 100 lines) that the authors included C++ source code in the paper.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F4f1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F4f1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 424w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 848w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F4f1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png" width="1198" height="1132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1132,&quot;width&quot;:1198,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:196086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/188224627?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F4f1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 424w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 848w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!F4f1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F463df60c-f173-4d8f-a9ab-79ad13b3a406_1198x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: https://dl.acm.org/doi/10.1145/3774934.3786443</figcaption></figure></div><h2>Dangling Pointers</h2><p>The big downside of spinning is that it wastes cycles in the case where there are other threads that the OS could schedule.  I wonder if there is a lightweight coordination mechanism available.  For example, the OS could write scheduling information into memory that is mapped read-only into user space.  This could be used to communicate to the spinning code whether or not there are other threads ready to run.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures]]></title><description><![CDATA[Don't neglect your scalar ALUs]]></description><link>https://danglingpointers.substack.com/p/scalar-interpolation-a-better-balance</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/scalar-interpolation-a-better-balance</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 03 Mar 2026 13:03:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xrsy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3696443.3708950">Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures</a> Reza Ghanbari, Henry Kao, Jo&#227;o P. L. De Carvalho, Ehsan Amiri, and J. Nelson Amaral <em>CGO'25</em></p><p>This paper serves as a warning: don&#8217;t go overboard with vector instructions.  There is a non-trivial amount of performance to be had by balancing compute between scalar and vector instructions.  Even if you fear that automatic vectorization is fragile, this paper has some interesting lessons.</p><h2>Vectorization Example</h2><p>Listing 1 contains a vectorizable loop and listing 2 shows a vectorized implementation:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UzgV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UzgV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 424w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 848w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 1272w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UzgV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png" width="883" height="208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:883,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29647,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187023198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UzgV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 424w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 848w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 1272w, https://substackcdn.com/image/fetch/$s_!UzgV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028153ae-6640-42d6-9c8f-6f9f61a1152d_883x208.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3696443.3708950">https://dl.acm.org/doi/10.1145/3696443.3708950</a></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xrsy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xrsy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 424w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 848w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 1272w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xrsy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png" width="913" height="763" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6885011e-143a-4c06-901a-7001b885ad79_913x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:763,&quot;width&quot;:913,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85240,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187023198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xrsy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 424w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 848w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 1272w, https://substackcdn.com/image/fetch/$s_!xrsy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6885011e-143a-4c06-901a-7001b885ad79_913x763.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3696443.3708950">https://dl.acm.org/doi/10.1145/3696443.3708950</a></figcaption></figure></div><p>After achieving this result, one may be tempted to pat oneself on the back and call it a day.  If you were a workaholic, you might profile the optimized code.  If you did, you would see something like the data in table 1:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7y3N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7y3N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 424w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 848w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 1272w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7y3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png" width="527" height="480.64939759036145" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5339898-45c0-489d-a307-ab8ea8d23572_830x757.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:830,&quot;resizeWidth&quot;:527,&quot;bytes&quot;:96255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187023198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7y3N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 424w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 848w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 1272w, https://substackcdn.com/image/fetch/$s_!7y3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5339898-45c0-489d-a307-ab8ea8d23572_830x757.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3696443.3708950">https://dl.acm.org/doi/10.1145/3696443.3708950</a></figcaption></figure></div><p>And you could conclude that this algorithm is compute-bound.  But what do we really mean by &#8220;compute-bound&#8221;?  A processor contains many execution ports, each with a unique set of capabilities.</p><p>In the running example, the execution ports capable of vector multiplication and addition are fully booked, but the other ports are sitting mostly idle!</p><h2>Scalar Interpolation</h2><p>Listing 3 shows a modified loop which tries to balance the load between the vector and scalar execution ports.  Each loop iteration processes 9 elements (8 via vector instructions, and 1 via scalar instructions).  This assumes that the processor supports fast unaligned vector loads and stores.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Twv6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Twv6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 424w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 848w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Twv6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png" width="940" height="822" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:940,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187023198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Twv6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 424w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 848w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Twv6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ded3a5-bd41-4264-902e-c7548d7c71cb_940x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3696443.3708950">https://dl.acm.org/doi/10.1145/3696443.3708950</a></figcaption></figure></div><p>Section 3 has details on how to change LLVM to get it to do this transformation.</p><h2>Results</h2><p>Fig. 3 shows benchmark results.  By my calculations, the geometric mean of the speedups is 8%.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tJ9m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tJ9m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 424w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 848w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 1272w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tJ9m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png" width="1456" height="508" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:508,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/187023198?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tJ9m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 424w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 848w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 1272w, https://substackcdn.com/image/fetch/$s_!tJ9m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff06f3735-4c8c-4ad2-a363-1715f6025e9a_1823x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3696443.3708950">https://dl.acm.org/doi/10.1145/3696443.3708950</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>This paper builds on top of automatic vectorization.  In other words, the input source code is scalar and the compiler vectorizes loops while balancing the workload.  An alternative would be to have the source code in a vectorized form and then let the compiler &#8220;devectorize&#8221; where it makes sense.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p><br></p>]]></content:encoded></item><item><title><![CDATA[Flexible I/O for Database Management Systems with xNVMe]]></title><description><![CDATA[One storage API to rule them all]]></description><link>https://danglingpointers.substack.com/p/flexible-io-for-database-management</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/flexible-io-for-database-management</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 26 Feb 2026 13:10:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CtdL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf">Flexible I/O for Database Management Systems with xNVMe</a> Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier Gonz&#225;lez, Vivek Shah, P&#305;nar T&#246;z&#252;n <em>CIDR&#8217;26</em></p><p>This paper describes <a href="https://xnvme.io">xNVMe</a>, a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB.  </p><h2>xNVMe</h2><p>Section 2 contains the hard sell for <code>xNVMe</code>.  The &#8220;x&#8221; prefix serves a similar role to the &#8220;X&#8221; in DirectX.  It is fast, while also being portable across operating systems and storage devices.  </p><p>The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry).  There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously).  Listing 3 has an example, which feels like &#8220;Mantle for NVMe&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Amk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Amk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 424w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 848w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 1272w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Amk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png" width="846" height="241" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:241,&quot;width&quot;:846,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51719,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/186325717?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Amk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 424w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 848w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 1272w, https://substackcdn.com/image/fetch/$s_!8Amk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff857145d-9aec-4cfc-b2f4-326449dce34b_846x241.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf">https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf</a></figcaption></figure></div><p>The <code>xNVMe</code> API works on Linux, FreeBSD, Windows, and macOS.  Some operating systems have multiple backends available (e.g., <code>libaio</code>, <code>io_uring</code>).</p><h2>DuckDB</h2><p>The point of this paper is that it is easy to drop <code>xNVMe</code> into an existing application.  The paper describes <code>nvmefs</code>, which is an implementation of the DuckDB <code>FileSystem</code> interface and uses <code>xNVMe</code>.  <code>nvmefs </code>creates dedicated <code>xNVMe</code> queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel).</p><p>The paper also describes how <code>xNVMe</code> supports shiny new NVMe features like <em>Flexible Data Placement</em> (FDP).  This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance).</p><h2>Results</h2><p>Most of the results in the paper show comparable performance for <code>xNVMe</code> vs the baseline DuckDB filesystem.  Fig. 5 shows one benchmark where <code>xNVMe</code> yields a significant improvement:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CtdL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CtdL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 424w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 848w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 1272w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CtdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png" width="887" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:887,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/186325717?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CtdL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 424w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 848w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 1272w, https://substackcdn.com/image/fetch/$s_!CtdL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bb8a6a7-846e-4f91-a22f-57282f954b00_887x683.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf">https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>I think the long-term success of <code>xNVMe</code> will depend on governance.  Potential members of the <code>xNVMe</code> ecosystem could be scared off by Samsung&#8217;s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?)  There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation]]></title><description><![CDATA[Bloom filters for faster regex evaluation]]></description><link>https://danglingpointers.substack.com/p/a-127-fjbtransition-digital-compute</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/a-127-fjbtransition-digital-compute</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 24 Feb 2026 12:15:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VZvt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3716368.3735157">A 1.27 fJ/B/transition Digital Compute-in-Memory Architecture for Non-Deterministic Finite Automata Evaluation</a> Christian Lanius, Florian Freye, and Tobias Gemmeke <em>GLVLSI'25</em></p><p>This paper ostensibly describes an ASIC accelerator for NFA evaluation (e.g., regex matching), but this paper also describes two orthogonal techniques for optimizing NFA evaluation which are applicable to more than just this ASIC.</p><h2>NFA Primer</h2><p>Any regular expression can be converted to a <em><a href="https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton">non-deterministic finite automaton</a> (NFA)</em>.  Think of an NFA like a state machine where some inputs can trigger multiple transitions.  The state machine is defined by a set of <em>transitions</em>.  A transition is an (<code>input symbol</code>, <code>current state</code>, <code>next state</code>) tuple.  The non-deterministic naming comes from the fact that multiple tuples may exist with identical (<code>input symbol</code>, <code>current state</code>) values; they only differ in their <code>next state</code> values.  This means that an NFA can be in multiple states at once.</p><p>One way to evaluate an NFA is to use a bitmap to track the set of active states.  For each new input symbol, the set of active states in the bitmap is used to determine which transitions apply.  Each activated transition sets one bit in the bitmap used to represent the active states for the next input symbol.</p><h2>Compute-in-Memory</h2><p>The hardware described in this paper uses a <em>compute-in-memory</em> (CIM) microarchitecture.  A set of columns stores the state machine, with each column storing one transition.  This assumes that the transition function is sparse (i.e., the number of transitions used is much lower than the maximum possible).  During initialization, the transitions are written into the CIM hardware.  </p><p>An input symbol is processed by broadcasting it and the current state bitmap to all columns.  All columns evaluate whether their transition should be activated.  The hardware then iterates (over multiple clock cycles) over all activated transitions and updates the state bitmap for the next input symbol.</p><p>The left side of Fig. 5 illustrates the hardware in each column which compares the input symbol, current state, against the stored tuple:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VZvt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VZvt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 424w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 848w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 1272w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VZvt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png" width="780" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cad142a8-998d-4285-a877-f672274c3f08_780x549.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:780,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108941,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/184797496?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VZvt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 424w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 848w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 1272w, https://substackcdn.com/image/fetch/$s_!VZvt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcad142a8-998d-4285-a877-f672274c3f08_780x549.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3716368.3735157">https://dl.acm.org/doi/10.1145/3716368.3735157</a></figcaption></figure></div><p>The algorithm described above processes at most one input symbol per cycle (and it is slower for inputs that activate multiple transitions).  The paper contains two tricks for overcoming this limitation.</p><h2>Cool Trick #1 - Two Symbols Per Cycle</h2><p>Fig. 4 illustrates how an NFA that accepts one symbol per cycle can be converted into an NFA which accepts two symbols per cycle.  For example, rather than consider <code>a</code> and <code>b</code> to be separate symbols, put them together into one mega-symbol: <code>ab</code>.  This is feasible as long as your NFA implementation isn&#8217;t too sensitive to the number of bits per symbol.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uZoh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uZoh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 424w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 848w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 1272w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uZoh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png" width="927" height="372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:372,&quot;width&quot;:927,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61459,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/184797496?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uZoh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 424w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 848w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 1272w, https://substackcdn.com/image/fetch/$s_!uZoh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10ca4b97-a996-4cdc-b477-83c2feddf9f3_927x372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3716368.3735157">https://dl.acm.org/doi/10.1145/3716368.3735157</a></figcaption></figure></div><h2>Cool Trick #2 - Bloom Filter</h2><p>The target application for this hardware is monitoring network traffic for threats (e.g., <a href="https://www.snort.org/">Snort</a>).  A key observation is that most inputs (network packets) do not produce a match, so it is reasonable to assume that most of the time the NFA will be in the initial state, and most input symbols will not trigger any transitions.</p><p>If that assumption holds, then a <a href="https://en.wikipedia.org/wiki/Bloom_filter">bloom filter</a> can be used to quickly skip many input symbols before they even reach the core NFA evaluation hardware.  </p><p>The bloom filter is built when the NFA transition function changes.  To build the bloom filter, iterate over each transition for which <code>(current state == initial state)</code> holds.  For each such transition, compute a hash of the input symbol, decompose the hashed value into <code>N</code> indices, and set the corresponding <code>N</code> bits in the bloom filter.</p><p>To test an input symbol against the bloom filter, hash the input symbol, decompose the hashed value into <code>N</code> indices, and check to see if all of the <code>N</code> corresponding bits are set in the bloom filter.  If any bit is not set, then the input symbol does not trigger a transition from the initial state.  When that symbol finally arrives at the NFA hardware, it can be dropped if the NFA is in the initial state.</p><h2>Results</h2><p>Table 1 compares PPA results against other published NFA accelerators.  It is a bit apples-to-oranges as the various designs target different technology nodes.  The metric that stands out is the low power consumption of this design.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rzdC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rzdC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 424w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 848w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 1272w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rzdC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png" width="1413" height="468" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1413,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/184797496?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rzdC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 424w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 848w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 1272w, https://substackcdn.com/image/fetch/$s_!rzdC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2e41199-91fe-4f3e-93e7-fc08566d8537_1413x468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3716368.3735157">https://dl.acm.org/doi/10.1145/3716368.3735157</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>I wonder if the bloom filter trick can be extended.  For example, rather than assuming the NFA will always be in the initial state, the hardware could dynamically compute which states are the most frequent and then use bloom filters to drop input symbols which cannot trigger any transitions from those states.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Better Memory Tiering, Right from the First Placement]]></title><description><![CDATA[Tricks to avoid migration]]></description><link>https://danglingpointers.substack.com/p/better-memory-tiering-right-from</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/better-memory-tiering-right-from</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 19 Feb 2026 13:05:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o8Dj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3676151.3719378">Better Memory Tiering, Right from the First Placement</a> Jo&#227;o P&#243;voas, Jo&#227;o Barreto, Bartosz Chomi&#324;ski, Andr&#233; Gon&#231;alves, Fedar Karabeinikau, Maciej Maciejewski, Jakub Schmiegel, and Kostiantyn Storozhuk <em>ICPE'25</em></p><p>This paper addresses the <em>first placement problem</em> in systems with multiple tiers of memory (e.g., DRAM paired with HBM, or local DRAM paired with remote DRAM accessed over CXL).  </p><p>The paper cites plenty of prior work which dynamically migrates pages/allocations out of suboptimal memory tiers.  What is different about this paper is that it attempts to avoid placing data in a suboptimal tier in the first place.  The key insight is: <em>statistics from one allocation can be used to generate better placements for similar allocations which will occur in the future.</em></p><p>Fig. 3 offers insight into how much waste there is in a policy which initially places all pages into a fast tier and then migrates them to a slower tier if they are accessed infrequently.  The figure shows results from one migration policy, applied to three benchmarks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o8Dj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o8Dj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 424w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 848w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 1272w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o8Dj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png" width="883" height="661" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/beba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:661,&quot;width&quot;:883,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75573,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/184787069?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o8Dj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 424w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 848w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 1272w, https://substackcdn.com/image/fetch/$s_!o8Dj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba80b7-1f05-47ad-a72d-49afe69bc4f9_883x661.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3676151.3719378">https://dl.acm.org/doi/10.1145/3676151.3719378</a></figcaption></figure></div><h2>Allocation Contexts</h2><p>This paper proposes gathering statistics for each <em>allocation context</em>.  An allocation context is defined by the source code location of the allocation, the call stack at the moment of allocation, and the size of the allocation.  If two allocations match on these attributes, then they are considered part of the same context.</p><p>The system hooks heap allocation functions (e.g., <code>malloc</code>, <code>free</code>) to track all outstanding allocations associated with each allocation context.  The x86 PMU event <code>MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16</code> is used to determine how frequently each allocation context is accessed.  </p><p>A tidbit I learned from this paper is that some x86 performance monitoring features do more than just count events.  For example, <code>MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16</code> randomly samples load operations and emits the accessed (virtual) address.  Given the accessed address, it is straightforward to map back to the associated allocation context.</p><p>The <em>hotness </em>of an allocation context is the frequency of these access events divided by the total size of all allocations in the context.  Time is divided into epochs.  During an epoch, the hotness of each allocation context is recalculated.  When a new allocation occurs, the hotness of the allocation context (from the previous epoch) is used to determine which memory tier to place the allocation into.</p><p>The paper only tracks large allocations (at least 64 bytes).  For smaller allocations, the juice is not worth the squeeze.  These allocations are assumed to be short-lived and frequently accessed.</p><h2>Kernel Mode Backstop</h2><p>This paper also describes a kernel component which complements the user space policy described so far.  Whereas the user space code deals with allocations, the kernel code deals with pages.  This is useful for allocations which do not access all pages uniformly.  It is also useful for detecting and correcting suboptimal initial placements. </p><p>All PTEs associated with all allocations are continually scanned.  The <em>accessed</em> bit determines if a page has been read since the last scan.  The <em>dirty </em>bit determines if a page has been written since the last scan.  After 10 scans, the system has a pretty good idea of how frequently a page is accessed.  These statistics are used to migrate pages between fast and slow tiers.</p><h2>Results</h2><p>Fig. 8 shows execution time for three benchmarks.  <code>hmalloc+Ambix</code> represents the user and kernel solutions described by this paper.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NFKr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NFKr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 424w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 848w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 1272w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NFKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png" width="707" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:707,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69918,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/184787069?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NFKr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 424w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 848w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 1272w, https://substackcdn.com/image/fetch/$s_!NFKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35dab34e-4cc7-4f36-bb31-ce031cb7f249_707x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3676151.3719378">https://dl.acm.org/doi/10.1145/3676151.3719378</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>I wasn&#8217;t able to find details in the paper about how PTE scanning works without interfering with other parts of the OS.  For example, doesn&#8217;t the OS use the dirty bit to determine if it needs to write pages back to disk?  I assume the PTE scanning described in this paper must reset the dirty bit on each scan.</p><p>The definition of an allocation context seems ripe for optimization.  I suspect that allowing some variability in call stack or allocation size would allow for better statistics.  Maybe this is a good use case for machine learning?</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p><p> </p><p></p>]]></content:encoded></item><item><title><![CDATA[Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters]]></title><description><![CDATA[Software and hardware optimizations to improve TLB coverage]]></description><link>https://danglingpointers.substack.com/p/contiguitas-the-pursuit-of-physical</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/contiguitas-the-pursuit-of-physical</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 17 Feb 2026 13:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!I5cH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3579371.3589079">Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters</a> Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, and Dimitrios Skarlatos <em>ISCA'23</em></p><p>This paper has a lot of great statistics from the Meta fleet.</p><h1>TLB Coverage Trends</h1><p>Memory capacity per server is growing over time, but TLB size is not, thus <em>TLB coverage </em>(the fraction of main memory that can be referenced by the TLB at any one time) is trending downward:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I5cH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I5cH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 424w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 848w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 1272w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I5cH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png" width="1388" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1388,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87919,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/168048566?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I5cH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 424w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 848w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 1272w, https://substackcdn.com/image/fetch/$s_!I5cH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f0378a-d636-4c3c-8d07-c516da464050_1388x645.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3579371.3589079">https://dl.acm.org/doi/10.1145/3579371.3589079</a></figcaption></figure></div><p>As TLB coverage decreases, the amount of time the CPU spends handling TLB misses increases.  With 4KiB pages, TLB misses can account for almost 20% of CPU cycles!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A7C8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A7C8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 424w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 848w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 1272w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A7C8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png" width="1420" height="673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:673,&quot;width&quot;:1420,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71513,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/168048566?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A7C8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 424w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 848w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 1272w, https://substackcdn.com/image/fetch/$s_!A7C8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf0ca4e-c036-4607-a6eb-a3e2d13cc565_1420x673.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3579371.3589079">https://dl.acm.org/doi/10.1145/3579371.3589079</a></figcaption></figure></div><p>A larger page size could increase TLB coverage, however large pages are only feasible if the OS can find (or create) contiguous ranges of physical memory.  This is shockingly difficult in the real world:</p><blockquote><p>We sample servers across the fleet and show that 23% of servers do not even have physical memory contiguity for a single 2MB huge page.</p></blockquote><p>The biggest culprit is unmovable (i.e., pinned) pages, for the NIC to access.  The reason these pages must be pinned is that the NIC cannot gracefully handle a page fault (<a href="https://danglingpointers.substack.com/p/to-pri-or-not-to-pri-thats-the-question">here </a>is a paper that describes some problems associated with PCIe devices causing page faults).  </p><p>The paper describes two solutions which enable the OS to defragment physical memory, thus making it feasible to use large pages.</p><h1>Segmentation</h1><p>The first solution only requires software changes.  The idea is to split main memory into two contiguous segments, one for unmovable allocations and one for movable allocations.  A movable allocation can become unmovable (e.g., when it is pinned), but allocations cannot migrate in the other direction.  A background resizing algorithm runs periodically to move the boundary between these two regions.  One drawback of this approach is that an unmovable allocation close to the boundary prevents the unmovable region from shrinking.  The paper doesn&#8217;t have a great software-only solution to this problem, other than making the memory allocator prefer allocations which are far from the boundary.  The ultimate solution is dedicated hardware support for moving allocations in the unmovable region. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uYdn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uYdn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 424w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 848w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 1272w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uYdn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png" width="1095" height="535" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:535,&quot;width&quot;:1095,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94672,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/168048566?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uYdn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 424w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 848w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 1272w, https://substackcdn.com/image/fetch/$s_!uYdn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f8e0f66-d1df-4a85-827a-0b1f265dcfcb_1095x535.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3579371.3589079">https://dl.acm.org/doi/10.1145/3579371.3589079</a></figcaption></figure></div><h1>Hardware Page Migration</h1><p>The paper proposes adding dedicated hardware support (to the LLC specifically) for migrating a physical page.  The OS can use this support to defragment memory by moving &#8220;unmovable&#8221; pages.  The LLC is responsible for moving the bytes from the source physical page to the destination physical page, and for transparently handling all memory accesses targeting the source page during the migration. </p><p>The page copy occurs one cache line at a time.  During the copy operation, accesses to the old page work fine.  When the access arrives at the LLC, the HW determines if the accessed cache line has been copied yet.  If the cache line has been copied, then the access is serviced from the destination page.  Otherwise, the access is serviced by the source page.  </p><p>Once the copy operation has completed, the OS can asynchronously invalidate references to the old page from all relevant TLBs (e.g., in CPU cores or the IOMMU).  Once those TLB invalidations have completed, the OS can reuse the old page.  Note that this has the side benefit of making TLB shoot-downs asynchronous, because they are no longer in the critical path of any memory allocation operation.</p><h1>Results</h1><p>Fig. 10 has results for memory segmentation.  &#8220;Partial&#8221; represents a case where physical memory is partially fragmented, whereas &#8220;full&#8221; represents full fragmentation. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LymJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LymJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 424w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 848w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 1272w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LymJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png" width="1023" height="507" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:507,&quot;width&quot;:1023,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62047,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/168048566?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LymJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 424w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 848w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 1272w, https://substackcdn.com/image/fetch/$s_!LymJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d74d6e-bd83-4ca1-8887-a1aa49ea23c4_1023x507.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3579371.3589079">https://dl.acm.org/doi/10.1145/3579371.3589079</a></figcaption></figure></div><h1>Dangling Pointers</h1><p>If the NIC is the primary culprit, some vertical integration might be called for here.  For example, allocations used to send packets cycle through three states:</p><ol><li><p>Empty</p></li><li><p>CPU writing to it</p></li><li><p>NIC reading from it</p></li></ol><p>It seems like it would be fine for the OS to migrate a packet when it is not in state #3.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs]]></title><description><![CDATA[A family of algorithms]]></description><link>https://danglingpointers.substack.com/p/efficient-lossless-compression-of</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/efficient-lossless-compression-of</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 12 Feb 2026 13:07:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pT2U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3669940.3707280">Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs</a> Noushin Azami, Alex Fallin, and Martin Burtscher <em>ASPLOS'25</em></p><p>This paper describes four (creatively named) lossless compression algorithms:</p><ul><li><p>SPspeed: 32-bit floating-point, optimized for speed</p></li><li><p>DPspeed: 64-bit floating-point, optimized for speed</p></li><li><p>SPratio: 32-bit floating-point, optimized for compression ratio</p></li><li><p>DPratio: 64-bit floating-point, optimized for compression ratio</p></li></ul><p>The claim to fame here is excellent performance on both CPUs and GPUs.</p><h2>Building Blocks</h2><p>Each compressor is implemented as a pipeline of transformations, illustrated in Fig. 1:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pT2U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pT2U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 424w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 848w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 1272w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pT2U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png" width="936" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pT2U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 424w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 848w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 1272w, https://substackcdn.com/image/fetch/$s_!pT2U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbfd655-c89f-4b48-b7de-d6263baacd24_936x523.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><p>Decompression is similar, but the order of stages is reversed.</p><h3>DIFFMS</h3><p>The goal of DIFFMS is to transform the data such that most of the upper bits of each element to be compressed are 0.  DIFFMS interprets inputs as integers (<code>int32 </code>or <code>int64</code>) and replaces element <code>N</code> with the difference between element <code>N</code> and element <code>N-1</code>.  Differences are stored in sign-magnitude format.  Neighboring values typically have small differences, so fewer bits are needed to store differences rather than raw values.  Converting to sign-magnitude causes the upper bits of negative differences (which are close to zero) to be zero.  The sign bit is stored in the least significant position, to ensure that it doesn&#8217;t contaminate the most-significant bit with a one.</p><p>The range of representable values in sign-magnitude format is one less than the range of values representable in two&#8217;s complement, but I suppose that is OK because that situation can only arise if an input float is NaN.</p><h3>MPLG</h3><p>Next, MPLG (introduced in a previous paper) operates on chunks of elements.  For each chunk, MPLG finds the element with the highest-order non-zero bit and uses that to determine the number of leading bits which can safely be removed from each element in the chunk.  This number of leading bits is stored in per-chunk metadata, and the input stream is bit-packed after removing those leading bits.  Fig. 3 illustrates the bit packing:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zmcw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zmcw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 424w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 848w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 1272w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zmcw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png" width="1361" height="372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:372,&quot;width&quot;:1361,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39935,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blakepelton.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zmcw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 424w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 848w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 1272w, https://substackcdn.com/image/fetch/$s_!Zmcw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a113fa-2ebd-440e-b377-3e7cd2fa11de_1361x372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><h3>BIT</h3><p>The MPLG stage has the &#8220;one bad apple spoils the bunch&#8221; property.  The BIT stage addresses that with a transpose.  Each chunk is interpreted as a 2D array of bits, and that 2D array is transposed.</p><p>Say most elements in a chunk only require the least significant 8 bits, but one element requires 10 bits.  Then after the MPLG stage, most elements will have 2 leading zeros in them.  After the BIT stage, these zeros will all be grouped together rather than spaced apart.</p><h3>Repeated Zero Elimination</h3><p>After BIT has arranged the data such that there are many long ranges of zero bits in it, Repeated Zero Elimination (RZE) finds and removes bytes which are equal to zero.  RZE produces both the output stream (with &#8220;zero bytes&#8221; removed), and a bitmap indicating which bytes were removed.</p><p>The authors found that RZE doesn&#8217;t work well for the low-order bits of double-precision data.  They address this with RAZE and RARE, which do not try to eliminate ranges of zeros from the low-order bits.</p><h2>Performance Characteristics</h2><p>Each stage in the pipeline operates on chunks of data.  The only interaction between chunks is that the size of the output data produced by chunk <em>N</em> is needed before the output offset of chunk <em>N+1</em> is known.  Efficient parallelization is possible in spite of this hazard on both CPU and GPU implementations.</p><p>As far as I can tell, there is no explicit attempt to ensure that the data passed between stages is kept on-chip.  It could be that CPU and GPU caches work well enough for this purpose.</p><h2>Results</h2><p>These algorithms extend the Pareto frontier for both CPU and GPU. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UuDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UuDq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 424w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 848w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 1272w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UuDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png" width="432" height="381.011673151751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:771,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:87231,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blakepelton.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UuDq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 424w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 848w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 1272w, https://substackcdn.com/image/fetch/$s_!UuDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bfcedb2-916d-4150-93fa-e82a5b629af5_771x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N4ms!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N4ms!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 424w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 848w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 1272w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N4ms!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png" width="434" height="342.1620314389359" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:827,&quot;resizeWidth&quot;:434,&quot;bytes&quot;:84338,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blakepelton.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N4ms!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 424w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 848w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 1272w, https://substackcdn.com/image/fetch/$s_!N4ms!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b0e7bc9-84c7-471c-9994-2d250cd0c605_827x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z3gL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z3gL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 424w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 848w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 1272w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z3gL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png" width="430" height="391.24463519313304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:636,&quot;width&quot;:699,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:76217,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blakepelton.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e8a5d79-fc5f-49e6-b7e0-db51b735f686_755x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z3gL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 424w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 848w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 1272w, https://substackcdn.com/image/fetch/$s_!z3gL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74feb2c8-1e97-4dbb-acaa-5a42694183aa_699x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CipT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CipT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 424w, https://substackcdn.com/image/fetch/$s_!CipT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 848w, https://substackcdn.com/image/fetch/$s_!CipT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 1272w, https://substackcdn.com/image/fetch/$s_!CipT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CipT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png" width="429" height="417.7270072992701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:685,&quot;resizeWidth&quot;:429,&quot;bytes&quot;:78223,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://blakepelton.substack.com/i/167760183?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54655020-440c-4bed-ba08-0450b5275155_738x667.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CipT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 424w, https://substackcdn.com/image/fetch/$s_!CipT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 848w, https://substackcdn.com/image/fetch/$s_!CipT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 1272w, https://substackcdn.com/image/fetch/$s_!CipT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F694846e8-05a9-4db8-8cd1-8d93c4c23cde_685x667.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707280">https://dl.acm.org/doi/10.1145/3669940.3707280</a></figcaption></figure></div><h1>Dangling Pointers</h1><p>The state of the art feels very hand-crafted, somewhat analogous to the state-of-the-art in image classification before <a href="https://www.bing.com/search?q=alexnet+paper&amp;cvid=c34b1aa875ca46ea9ad36d1ef9863d2c&amp;gs_lcrp=EgRlZGdlKgYIABBFGDkyBggAEEUYOTIGCAEQABhAMgYIAhAAGEAyBggDEC4YQDIGCAQQABhAMgYIBRAuGEAyBggGEAAYQDIGCAcQLhhAMgYICBAuGEDSAQgxNDMwajBqOagCCLACAQ&amp;FORM=ANAB01&amp;PC=DCTS">AlexNet</a> moved the state-of-the-art from handcrafted feature engineering to more generalized models.  In the text compression space, LZ-based compressors leave the same aftertaste.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation]]></title><description><![CDATA[Who needs CXL anyways?]]></description><link>https://danglingpointers.substack.com/p/edm-an-ultra-low-latency-ethernet</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/edm-an-ultra-low-latency-ethernet</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 10 Feb 2026 13:00:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!T0Vm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3669940.3707221">EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation</a> Weigao Su and Vishal Shrivastav <em>ASPLOS'25</em></p><p>This paper describes incremental changes to Ethernet NICs and switches to enable efficient disaggregation of memory without the need for a separate network (<em>e.g., </em>CXL) for memory traffic.</p><h2>Ethernet Fabric Latency</h2><p>Fig. 1 shows the north star:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T0Vm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T0Vm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 424w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 848w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 1272w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T0Vm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png" width="1045" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1045,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219525,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183936829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T0Vm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 424w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 848w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 1272w, https://substackcdn.com/image/fetch/$s_!T0Vm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3fd8c1-1d49-46fa-8d3d-525cba350c56_1045x813.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707221">https://dl.acm.org/doi/10.1145/3669940.3707221</a></figcaption></figure></div><p>Servers are partitioned into <em>Compute Nodes</em> and <em>Memory Nodes</em>.  When a compute node wants to access remote memory, it issues a request to its local NIC, which sends the request to the correct memory node (via a switch).</p><p>The key problem this paper addresses is <em>Ethernet fabric latency</em> (i.e., the time taken for requests/responses to flow between NICs and switches).  The paper assumes that the latency between the processor and the NIC is low (and cites other papers which describe techniques for reducing this latency to below 100ns).  Typical Ethernet fabric latency is measured in microseconds, which is much higher than a local memory access.</p><h2>Ethernet Hardware Stack Changes</h2><p>The Ethernet hardware stack can be decomposed into MAC and PHY layers.  The MAC is higher level and sits on top of the PHY.  The paper proposes implementing <em>EDM</em> (Ethernet Disaggregated Memory) with modifications to the PHY layer in both the NIC and the switch.  Normal network packets flow through the MAC and PHY as they usually would, but a side channel exists which allows remote memory accesses to be handled directly by the enhanced PHY layer.  Fig. 3 illustrates the hardware changes in Ethernet NICs and switches.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x44A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x44A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 424w, https://substackcdn.com/image/fetch/$s_!x44A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 848w, https://substackcdn.com/image/fetch/$s_!x44A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 1272w, https://substackcdn.com/image/fetch/$s_!x44A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x44A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png" width="1173" height="496" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:1173,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183936829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x44A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 424w, https://substackcdn.com/image/fetch/$s_!x44A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 848w, https://substackcdn.com/image/fetch/$s_!x44A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 1272w, https://substackcdn.com/image/fetch/$s_!x44A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680959aa-4e69-4595-bf50-717b45c2c59b_1173x496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707221">https://dl.acm.org/doi/10.1145/3669940.3707221</a></figcaption></figure></div><p>Remote memory access requests and responses are smaller than typical Ethernet packets.  Additionally, end-to-end application performance is more sensitive to remote memory access latency than the latency of regular network traffic.  The bulk of the paper describes how EDM achieves low latency for remote memory traffic.</p><h2>Preemption</h2><p>The EDM PHY modifications allow a memory request to preempt a non-memory packet.  Say the MAC sends a 1KiB packet to the PHY, which begins to send the packet over the wire in 66-bit blocks.  If a memory request shows up in the middle of transmitting the network packet, the PHY can sneak the memory request onto the wire between 66-bit blocks, rather than waiting for the whole 1KiB to be sent.</p><h2>Inter-Frame Gap</h2><p>Standard Ethernet requires 96 bits of zeros to be sent on the wire between each packet.  This overhead is small for large packets, but it is non-trivial for small packets (like remote memory access requests).  The EDM PHY modifications allow these idle bits to be used for remote memory accesses.  The MAC still sees the gaps, but the PHY does not.  If you ask an LLM what could possibly go wrong by trying to use the inter-frame gap to send useful data, it will spit out a long list.  I can&#8217;t find too much detail in the paper about how to ensure that this enhancement is robust.  The possible problems are limited to the PHY layer however, as the MAC still sees the zeros it expects.</p><h2>Scheduling</h2><p>To avoid congestion and dropping of memory requests, EDM uses an in-network scheduling algorithm somewhat like PFC.  The EDM scheduler is in the PHY layer of the switch.  Senders <em>notify</em> the switch when they have memory traffic to send, and the switch responds later with a <em>grant</em>, allowing a certain amount of data to be sent.</p><h2>Results</h2><p>The authors implemented EDM on FPGAs (acting as both NIC and switch).  Table 1 compares latencies for TCP/IP, RDMA, raw Ethernet packets, and EDM, breaking down latencies at each step:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fr_t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fr_t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 424w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 848w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 1272w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fr_t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png" width="1456" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183936829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fr_t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 424w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 848w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 1272w, https://substackcdn.com/image/fetch/$s_!fr_t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4620c6e0-b699-4c3b-aee9-783c06c7102c_1497x995.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707221">https://dl.acm.org/doi/10.1145/3669940.3707221</a></figcaption></figure></div><p>Fig. 7 throws CXL into the mix:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pQwn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pQwn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 424w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 848w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 1272w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pQwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png" width="730" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:730,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183936829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pQwn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 424w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 848w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 1272w, https://substackcdn.com/image/fetch/$s_!pQwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c55e072-d5d7-4f44-8298-cea7bae0906b_730x421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3669940.3707221">https://dl.acm.org/doi/10.1145/3669940.3707221</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>Section 3.3 &#8220;Practical Concerns&#8221; has a discussion of what could go wrong (<em>e.g.,</em> fault tolerance and data corruption).  It is hard to judge how much work is needed to make this into something that industry could rely on.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[An Analysis of User-space Idle State Instructions on x86 Processors]]></title><description><![CDATA[Guilt-free busy waiting]]></description><link>https://danglingpointers.substack.com/p/an-analysis-of-user-space-idle-state</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/an-analysis-of-user-space-idle-state</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 05 Feb 2026 13:04:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FVjv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3676151.3719370">An Analysis of User-space Idle State Instructions on x86 Processors</a> Malte-Christian Kuns, Hannes Tr&#246;pgen, and Robert Sch&#246;ne <em>ICPE'25</em></p><p>I&#8217;ve long believed that busy waiting is poor form.  The closest thing you should ever come to busy waiting is to lock a <code>futex/CRITICAL_SECTION</code>, which will busy wait for a short while on your behalf.</p><p>If your primary concern is power consumption, then busy waiting may be less offensive on a modern processor.  This paper describes newly added x86 instructions to enable low power busy waiting from user space, and has a ton of data to help you sleep better at night.</p><h2>New Instructions</h2><p><code>TPAUSE</code> puts the processor into a low power state for a user-specified amount of time.  <code>TPAUSE</code> supports two low power states (<code>C0.1</code> and <code>C0.2</code>), which trade power consumption for wake-up latency.  TPAUSE can be called in user space but doesn&#8217;t wrest control of the core away from the OS.  The trick is the OS can set a maximum timeout value, which gives the OS a chance to switch away from the busy waiting thread.</p><p><code>UMONITOR</code> and <code>UMWAIT</code> instructions are similar to <code>TPAUSE</code> but allow the processor to be woken up when a write occurs in a specified memory range.  <code>UMONITOR</code> sets up the memory range to be monitored, and <code>UMWAIT</code> causes the processor to enter a low power state.  <code>UMWAIT</code> accepts a timeout value and a target power state (just like <code>TPAUSE</code>).  AMD supports similar functionality via the <code>MONITORX</code> and <code>MWAITX</code> instructions.</p><h2>Results</h2><p>A key question the paper investigates is how closely the user-specified timeout is honored.  Fig. 1 shows results for three Intel cores:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FVjv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FVjv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 424w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 848w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 1272w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FVjv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png" width="1456" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211952,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183717210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FVjv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 424w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 848w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 1272w, https://substackcdn.com/image/fetch/$s_!FVjv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e61473a-c72c-4384-9662-9674008d9bd9_2127x862.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3676151.3719370">https://dl.acm.org/doi/10.1145/3676151.3719370</a></figcaption></figure></div><p>Times are measured in timestamp counter cycles (roughly 3 GHz for Alder Lake).  The plateau at the top right is caused by the OS-specified maximum timeout.  The authors find that timeout values are quantized (<em>e.g.,</em> 83 cycles on Alder Lake P-core).  Additionally, for short timeouts the processor may ignore the user-requested power state (presumably because it doesn&#8217;t make sense to enter a deep sleep for a short amount of time).  On Alder Lake P-cores, the threshold below which the processor will not enter the lowest power state is around 23,000 TSC cycles.  Alder Lake E-cores seem to only support one low power state.</p><p>Fig. 3 measures how much the processor can &#8220;oversleep&#8221; (wake up later than requested) depending on processor frequency and requested power state:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sflO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sflO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 424w, https://substackcdn.com/image/fetch/$s_!sflO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 848w, https://substackcdn.com/image/fetch/$s_!sflO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 1272w, https://substackcdn.com/image/fetch/$s_!sflO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sflO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png" width="1163" height="952" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/febc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:952,&quot;width&quot;:1163,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183717210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sflO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 424w, https://substackcdn.com/image/fetch/$s_!sflO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 848w, https://substackcdn.com/image/fetch/$s_!sflO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 1272w, https://substackcdn.com/image/fetch/$s_!sflO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffebc3bd8-4cb9-4955-9937-f40b96c13493_1163x952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3676151.3719370">https://dl.acm.org/doi/10.1145/3676151.3719370</a></figcaption></figure></div><p>And finally, table 2 shows measured power consumption for these new instructions vs old-fashioned busy wait loops that uses the PAUSE instruction (which does not support a user-specified timeout):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!haDm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!haDm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 424w, https://substackcdn.com/image/fetch/$s_!haDm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 848w, https://substackcdn.com/image/fetch/$s_!haDm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 1272w, https://substackcdn.com/image/fetch/$s_!haDm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!haDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png" width="1129" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:1129,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143248,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/183717210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!haDm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 424w, https://substackcdn.com/image/fetch/$s_!haDm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 848w, https://substackcdn.com/image/fetch/$s_!haDm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 1272w, https://substackcdn.com/image/fetch/$s_!haDm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F073da315-9081-4b8f-989a-bd3bf49cde19_1129x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3676151.3719370">https://dl.acm.org/doi/10.1145/3676151.3719370</a></figcaption></figure></div><p>I&#8217;m shocked by the advantage that AMD has here.  If CPU core power during busy waiting is your primary concern, then you should choose your chip carefully.  </p><h2>Dangling Pointers</h2><p><a href="https://en.cppreference.com/w/cpp/thread/condition_variable.html">Condition variables</a> are a general and useful abstraction.  It would be nice if code that used condition variables could automatically benefit from these instructions.  Maybe some compiler and/or hardware assistance is necessary to enable that.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading]]></title><description><![CDATA[CXL Deep Dive]]></description><link>https://danglingpointers.substack.com/p/re-architecting-end-host-networking</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/re-architecting-end-host-networking</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 03 Feb 2026 13:02:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aFs2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3725843.3756102">Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading</a> Houxiang Ji, Yifan Yuan, Yang Zhou, Ipoom Jeong, Ren Wang, Saksham Agarwal, and Nam Sung Kim <em>MICRO'25</em></p><p>This paper is the third one that I&#8217;ve posted about which deals with the subtleties of interfacing a NIC and a host CPU.  Here are links to the previous posts on this subject:</p><p><a href="https://danglingpointers.substack.com/p/disentangling-the-dual-role-of-nic">Disentangling the Dual Role of NIC Receive Rings</a></p><p><a href="https://danglingpointers.substack.com/p/ceio-a-cache-efficient-network-io">CEIO vs rxBisect: Fixing DDIO&#8217;s Leaky DMA Problem</a></p><p>The authors bring a new hammer to the construction site: <a href="https://en.wikipedia.org/wiki/Compute_Express_Link">CXL</a>, which offers some interesting efficiencies and simplifications.</p><h2>Two PCIe Problems</h2><p>This paper shows how CXL can address two specific problems with the HW/SW interface of a typical PCIe NIC:</p><ul><li><p>After the host prepares a packet to be transmitted, it notifies the NIC with a MMIO write.  This MMIO write is expensive because it introduces serialization into the host processor pipeline.</p></li><li><p>When the NIC sends a received packet to the host, ideally it would write data to the LLC rather than host DRAM.  However, if the host CPU cannot keep up, then the NIC should have a graceful fallback.</p></li></ul><h2>CXL Type-1</h2><p>CXL Type-1 devices are asymmetric: the device has coherent access to host memory, but the host does not have coherent access to device memory.  Practically speaking, both packet descriptors and packet payloads must still be stored in host memory (no change from PCIe based NICs).  </p><p>Because the NIC has coherent access to host memory, it can safely prefetch receive descriptors (RxDs) into an on-NIC cache.  When a packet arrives, the NIC can grab a descriptor from the cache and thus avoid an expensive host memory read to determine where to write packet data.  If the host CPU updates a RxD after the NIC has prefetched it, the CXL cache coherence protocol will notify the NIC that it must invalidate its cached data.</p><p>Coherence also enables the tail pointers for transmit ring buffers to be safely stored in host memory.  The host networking stack can update a tail pointer with a regular store instruction (rather than an MMIO write).  The NIC can continually poll this value, using coherent reads.  If the tail index pointer has not been updated since the last poll, the NIC will read a cached value and not generate any PCIe traffic.</p><h2>CXL Type-2</h2><p>CXL Type-2 NICs allow packets and descriptors to be stored in NIC memory.  The host CPU can cache data read from the NIC, as the NIC will generate the necessary coherence traffic when it reads or writes this data.  The design space (what data goes into what memory) is large, and the results section has numbers for many possible configurations.</p><p>Section 5.3 of the paper describes how a type-2 NIC can intelligently use the <code>NC-P</code> CXL operation to write received packet data directly into the host LLC.  This is similar to DDIO (described in the two papers linked at the top of this post), but the key difference is that the NIC is in the driver&#8217;s seat.  </p><p>The CEIO paper proposes monitoring LLC usage and falling back to storing received packets in DRAM local to the NIC if the LLC is too full.  With CXL, the NIC has the option to write data to host memory directly (bypassing the LLC), thus avoiding the need for DRAM attached to the NIC.</p><h2>Results</h2><p>The authors implemented a CXL NIC on an Altera FPGA.  They compared results against an nVidia BlueField-3 PCIe NIC.  Fig. 10 compares loopback latency for the two devices, normalized to the BlueField-3 latency (lower is better) for a variety of CXL configurations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aFs2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aFs2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 424w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 848w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 1272w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aFs2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png" width="859" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:859,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94278,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181730844?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aFs2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 424w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 848w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 1272w, https://substackcdn.com/image/fetch/$s_!aFs2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b9f54d-ea63-4924-86f3-547bcdff870a_859x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/pdf/10.1145/3725843.3756102">https://dl.acm.org/doi/pdf/10.1145/3725843.3756102</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>One fact I took away from this paper is that CXL coherence messages are much cheaper than MMIOs and interrupts.  Burning a CPU core polling a memory location seems wasteful to me.  It would be nice if that CPU core could at least go into a low power state until a relevant coherence message arrives.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Forest: Access-aware GPU UVM Management]]></title><description><![CDATA[Plugging holes in a leaky abstraction]]></description><link>https://danglingpointers.substack.com/p/forest-access-aware-gpu-uvm-management</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/forest-access-aware-gpu-uvm-management</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Thu, 29 Jan 2026 13:03:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ybf9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3695053.3731047">Forest: Access-aware GPU UVM Management</a> Mao Lin, Yuan Feng, Guilherme Cox, and Hyeran Jeon <em>ISCA'25</em></p><h2>Unified Virtual Memory</h2><p>Unified virtual memory is an abstraction which presents a single unified address space for both the CPU and GPU.  This is a convenient programming model because it allows one device to create a complex data structure (with pointers) and pass that directly to the other device.</p><p>Maintaining this illusion in systems with discrete GPUs is a complex task.  The state-of-the-art involves initially placing allocations in host memory and then copying them to GPU memory to resolve GPU page faults (<em>far-faults</em>).  Prefetching can help avoid some far-faults.  A state-of-the-art prefetcher is called the <em>tree-based neighboring prefetcher</em> (TBNp).  Fig. 3 shows the TBNp data structure for a 512KiB allocation:</p><h2>Tree-based Neighboring Prefetcher</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ybf9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ybf9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 424w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 848w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 1272w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ybf9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png" width="784" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57abd938-72d4-4944-beb8-39b2350620b0_784x409.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:784,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181355196?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ybf9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 424w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 848w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 1272w, https://substackcdn.com/image/fetch/$s_!ybf9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57abd938-72d4-4944-beb8-39b2350620b0_784x409.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3695053.3731047">https://dl.acm.org/doi/10.1145/3695053.3731047</a></figcaption></figure></div><p>TBNp tracks each 2MiB of virtual memory with a binary tree containing 64KB leaf nodes.  When a far-fault occurs on a 4 KiB page (yellow &#8220;1&#8221; in Fig. 3) the remaining 60 KiB (blue &#8220;2&#8221; in Fig. 3) in the leaf are prefetched.  </p><p>Once the leaves contained by a particular sub-tree are &gt;= 50% resident in GPU memory, the remainder of the sub-tree is prefetched.  In Fig. 3 that situation occurs when the three left-most leaf nodes have been transferred to GPU memory.  At that point, the remaining leaf node in the subtree (green &#8220;7&#8221; in Fig. 3) is prefetched.</p><h2>Access Pattern Profiling</h2><p>This paper proposes customizing the TBNp trees associated with each object, with the help of access pattern profiling hardware.  Fig. 7 illustrates the design:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eh_P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eh_P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 424w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 848w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 1272w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eh_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png" width="772" height="769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:769,&quot;width&quot;:772,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154234,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181355196?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Eh_P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 424w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 848w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 1272w, https://substackcdn.com/image/fetch/$s_!Eh_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9b6a0d5-30e3-41e1-90ea-b6ff74ace9dd_772x769.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3695053.3731047">https://dl.acm.org/doi/10.1145/3695053.3731047</a></figcaption></figure></div><p>The GPU MMU tracks access times for each object (i.e., allocation created by <code>cudaMallocManaged)</code> and for each page.  Once a healthy amount of profiling data has been collected, the GPU fires an interrupt which notifies the driver that it should grab the latest statistics.</p><h2>Access Pattern Classification</h2><p>The driver classifies access patterns for each object into one of the following four buckets:</p><ol><li><p>Linear/Streaming</p></li><li><p>Non-Linear, High-Coverage, High-Intensity</p></li><li><p>Non-Linear, High-Coverage, Low-Intensity</p></li><li><p>Non-Linear, Low-Coverage</p></li></ol><p>The paper uses tree-based prefetching like TBNp, but configures the trees differently for each object depending on the access pattern bucket.  This is where the name &#8220;Forest&#8221; comes from: each tree has a maximum size, so large objects are chopped up and tracked with multiple trees.</p><p>Streaming accesses are detected with linear regression.  If the R<sup>2</sup> value is close to 1, then the driver classifies the accesses as streaming.  The prefetching trees used for objects accessed in a streaming manner are 4MiB in size and contain 256KiB leaf nodes (relatively large).</p><p>If the driver determines accesses are not streaming, then the choice between the remaining three buckets is determined by the <em>access coverage</em> and <em>access intensity</em>.  Access coverage is computed based on the minimum and maximum page numbers accessed during profiling.  Access intensity is based on the number of accesses during profiling.</p><p>For objects with high access coverage and high access intensity, the associated prefetching trees are 512KiB in size and contain 64KiB leaf nodes.</p><p>For objects with high access coverage and low access intensity, the associated prefetching trees are 512KiB in size and contain 16KiB leaf nodes.</p><p>Finally, for objects with low access intensity, the associated prefetching trees are 2MiB in size and contain 64KiB leaf nodes.</p><h2>Results</h2><p>Fig. 12 contains simulation results for a number of benchmarks:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oE2n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oE2n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 424w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 848w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 1272w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oE2n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png" width="1456" height="326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:326,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45259,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181355196?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oE2n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 424w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 848w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 1272w, https://substackcdn.com/image/fetch/$s_!oE2n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f9cc4b-4536-40d2-aa08-1724e75f8c6c_1885x422.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3695053.3731047">https://dl.acm.org/doi/10.1145/3695053.3731047</a></figcaption></figure></div><p><code>Forest</code> is the design described in this paper.  <code>SpecForest</code> is a modification that avoids the overheads associated with initial profiling by trying to make better initial guesses about access patterns before profiling data is available.  </p><h2>Dangling Pointers</h2><p>I wonder how much vertical integration can help here.  Certainly, a number of applications have enough context to make smarter decisions than the driver relying on profiling information.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[UPP: Universal Predicate Pushdown to Smart Storage]]></title><description><![CDATA[Best-effort filtering for OLAP]]></description><link>https://danglingpointers.substack.com/p/upp-universal-predicate-pushdown</link><guid isPermaLink="false">https://danglingpointers.substack.com/p/upp-universal-predicate-pushdown</guid><dc:creator><![CDATA[Blake Pelton]]></dc:creator><pubDate>Tue, 27 Jan 2026 13:03:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ISya!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://dl.acm.org/doi/10.1145/3695053.3731005">UPP: Universal Predicate Pushdown to Smart Storage</a> Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park <em>ISCA'25</em></p><p>Working on hardware acceleration requires a healthy dose of honesty.  If you try hard enough, you can find clever ways to accelerate a given application.  Once that point is reached, it is helpful to step back and ask yourself: &#8220;are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?&#8221;</p><p>This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation.  I wonder if these ideas would also work well on other chips.</p><h2>Per-row Bitmaps</h2><p>This paper rests on two assumptions:</p><ol><li><p>Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries</p></li><li><p>Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)</p></li></ol><p>A preprocessing step generates a 256-bit <em>row vector</em> (RV) associated with each row.  These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row).  When a query is run, the relevant filters from the query are converted into a set of 256-bit <em>query vectors</em> (QVs) and simple instructions which perform logical operations between the row vectors and query vectors.  The result of those instructions is a single bit per row which determines if the row can be safely removed.</p><h2>Numerical Filters</h2><p>Numerical expressions (e.g., <code>l_quantity &gt;= 20 and l_quantity &lt;= 30</code>) are supported for <a href="https://en.wikipedia.org/wiki/Monotonic_function">monotone</a> functions.  During the preprocessing step, the lower and upper bounds of each column are computed.  This space is divided into a fixed number of buckets.  For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1.  Hashing is used to handle the case where there are more buckets than bits allocated for a given column.</p><p>When a query is executed, the software can determine the set of buckets which the query references.  For example, say the filter expression is: </p><p><code>l_quantity &gt;= 20 and l_quantity &lt;= 30</code> </p><p>and the buckets for <code>l_quantity</code> are: </p><p><code>[0, 10), [10, 20), [20, 30), [30, 40), [40, 50)</code>. </p><p>The query vector which selects rows which should not be filtered is (LSB first):</p><p><code>00110</code></p><p>To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector.  If all bits of the result are zero, then the row can be removed.</p><h2>String Filters</h2><p>To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set.  This means that multiple bits in a row vector can be set (one bit per word).  Only tokens which appear frequently in the dataset are hashed, the rest are ignored.</p><p>A query expression like <code>l_shipinstruct = 'DELIVER IN PERSON' </code>is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set.  Rows are accepted if they have all 3 bits set.  Note that this is best-effort filtering.  For example, if a row contains the string <code>'PERSON DELIVER IN'</code> in the <code>l_shipinstruct </code>column, that row will <strong>not</strong> be removed.</p><h2>Results</h2><p>Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps.  These numbers seem pretty modest (i.e., good) to me.  The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xp6J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xp6J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 424w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 848w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 1272w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xp6J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png" width="853" height="266" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:266,&quot;width&quot;:853,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46826,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181086978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xp6J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 424w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 848w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 1272w, https://substackcdn.com/image/fetch/$s_!Xp6J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ccd8b5-cef8-4b4f-8479-b784206ed8d5_853x266.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3695053.3731005">https://dl.acm.org/doi/10.1145/3695053.3731005</a></figcaption></figure></div><p>Fig. 6 shows TPC-H performance results.  Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD.  Q21 doesn&#8217;t see a speedup because it is bound by join and aggregation, not filtering.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ISya!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ISya!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 424w, https://substackcdn.com/image/fetch/$s_!ISya!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 848w, https://substackcdn.com/image/fetch/$s_!ISya!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 1272w, https://substackcdn.com/image/fetch/$s_!ISya!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ISya!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png" width="1456" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:313,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101786,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://danglingpointers.substack.com/i/181086978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ISya!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 424w, https://substackcdn.com/image/fetch/$s_!ISya!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 848w, https://substackcdn.com/image/fetch/$s_!ISya!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 1272w, https://substackcdn.com/image/fetch/$s_!ISya!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16920960-b7bb-4ef2-a131-4f6bc5d5083d_1463x315.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: <a href="https://dl.acm.org/doi/10.1145/3695053.3731005">https://dl.acm.org/doi/10.1145/3695053.3731005</a></figcaption></figure></div><h2>Dangling Pointers</h2><p>I wonder how much of this is overfitting to TPC-H.  If you look at the <a href="https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp">TPC-H spec</a>, a lot of string columns are generated by randomly sampling from a very small set of possible tokens.  It would be great if the industry had a &#8220;held out test set&#8221; which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://danglingpointers.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://danglingpointers.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item></channel></rss>