Jekyll 2026-02-09T18:35:25+00:00 https://maknee.github.io/ Henry Zhu Personal website for some random tidbits I work on Henry Zhu [email protected] NVIDIA TileIR Internals: from CuTile to MLIR/LLVM to SASS 2026-01-30T06:00:00+00:00 2026-01-30T06:00:00+00:00 https://maknee.github.io/blog/2026/NVIDIA-TileIR-Internals-from-CuTile-to-MLIR-LLVM-to-SASS <p>In this post, we’ll dig deep into how TileIR works, from how it generates instructions to analyzing its different passes. We’ll trace how a Mixture-of-Experts (MoE) kernel written in CuTile gets compiled down through <code class="language-plaintext highlighter-rouge">cuda_tile</code> → <code class="language-plaintext highlighter-rouge">nv_tileaa</code> → <code class="language-plaintext highlighter-rouge">nv_tileas</code> → NVVM → LLVM → SASS.</p> <p>Here’s what to expect:</p> <ul> <li><a href="#what-is-cutile"><strong>What is CuTile?</strong></a> — The tile-centric programming model</li> <li><a href="#running-example-moe-kernel"><strong>Running Example</strong></a> — An MoE kernel we’ll trace through every stage</li> <li><a href="#the-dialects"><strong>The Dialects</strong></a> — From <code class="language-plaintext highlighter-rouge">cuda_tile</code> through <code class="language-plaintext highlighter-rouge">nv_tileaa</code> and <code class="language-plaintext highlighter-rouge">nv_tileas</code> to NVVM/LLVM</li> <li><a href="#the-passes"><strong>The Passes</strong></a> — TileIR passes: what they do and when they run</li> </ul> <p><em>Based on CUDA 13.1. Some details are undocumented and may change in future releases.</em></p> <h1 id="what-is-cutile">What is CuTile?</h1> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/cutile.png" width="100%" alt="" /> <div class="caption"> <em>CuTile separates user responsibility (splitting work into blocks and tiles) from system responsibility (mapping to threads) (Image source: <a href="https://youtu.be/_b4I4rKpsGA?t=406" rel="external nofollow noopener" target="_blank">GPU MODE</a>) </em> </div> </div> <p><a href="https://github.com/NVIDIA/cutile-python">CuTile</a> is NVIDIA’s new “tile-centric” programming model for modern NVIDIA GPUs. This abstraction is powerful: CuTile lets the programmer think in terms of tiles rather than threads, while the compiler handles the complexity of coordinating hundreds of threads across fragmented data. A single CuTile line <code class="language-plaintext highlighter-rouge">ct.mma(a, b, acc)</code> could get transformed to many tensor core instructions.</p> <h2 id="what-is-tileir">What is TileIR?</h2> <p>TileIR is NVIDIA’s MLIR-based compiler infrastructure that powers CuTile. It progressively lowers your high-level tensor operations through multiple MLIR dialects and NVIDIA specific tools:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/pipeline_overview.svg" width="100%" alt="" /> <div class="caption"> <em>TileIR compilation pipeline: Python → SASS </em> </div> </div> <p>The user-facing tool is <code class="language-plaintext highlighter-rouge">tileiras</code><span class="sidenote-ref"></span><span class="sidenote">Like <code class="language-plaintext highlighter-rouge">ptxas</code> but for TileIR. Yes, NVIDIA named it “tile-ir-as” (tile IR assembler).</span>, which orchestrates this entire pipeline.</p> <hr /> <h1 id="running-example-moe-kernel">Running Example: MoE Kernel</h1> <p>Throughout this post, we’ll trace this <strong>MoE (Mixture of Experts) kernel</strong> through every compilation stage. This is code from <a href="https://github.com/NVIDIA/cutile-python/blob/main/samples/MoE.py">NVIDIA’s cutile-python samples</a><span class="sidenote-ref"></span><span class="sidenote">There’s also a C++ API: <a href="https://github.com/NVIDIA/cuda-tile">NVIDIA/cuda-tile</a>. Operations like <code class="language-plaintext highlighter-rouge">ct.gather</code>, <code class="language-plaintext highlighter-rouge">ct.mma</code>, <code class="language-plaintext highlighter-rouge">cuda_tile.load_view_tko</code> documented in <a href="https://docs.nvidia.com/cuda/tile-ir/13.1/sections/operations.html">TileIR docs</a>.</span>:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ct.kernel</span> <span class="k">def</span> <span class="nf">fused_moe_kernel</span><span class="p">(</span> <span class="n">A</span><span class="p">,</span> <span class="c1"># Input tokens, shape (batch, K) </span> <span class="n">B</span><span class="p">,</span> <span class="c1"># Expert weights, shape (num_experts, N, K) </span> <span class="n">C</span><span class="p">,</span> <span class="c1"># Output tensor, shape (num_tokens * topk, N) </span> <span class="n">topk_weights</span><span class="p">,</span> <span class="c1"># Router weights for each token-expert pair </span> <span class="n">sorted_token_ids</span><span class="p">,</span> <span class="c1"># Token indices sorted by expert assignment </span> <span class="n">sorted_expert_ids</span><span class="p">,</span> <span class="c1"># Expert index for each TILE_M </span> <span class="n">num_token_replicas</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">mul_routed_weight</span><span class="p">:</span> <span class="n">ConstBool</span><span class="p">,</span> <span class="n">TILE_M</span><span class="p">:</span> <span class="n">ConstInt</span><span class="p">,</span> <span class="n">TILE_N</span><span class="p">:</span> <span class="n">ConstInt</span><span class="p">,</span> <span class="n">TILE_K</span><span class="p">:</span> <span class="n">ConstInt</span><span class="p">,</span> <span class="p">):</span> <span class="n">M</span> <span class="o">=</span> <span class="n">sorted_token_ids</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">N</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="n">K</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="n">GROUP_SIZE_M</span> <span class="o">=</span> <span class="mi">8</span> <span class="n">bid_m</span><span class="p">,</span> <span class="n">bid_n</span> <span class="o">=</span> <span class="nf">swizzle_2d</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">TILE_M</span><span class="p">,</span> <span class="n">TILE_N</span><span class="p">,</span> <span class="n">GROUP_SIZE_M</span><span class="p">)</span> <span class="c1"># → cuda_tile.get_tile_block_id </span> <span class="c1"># Gather token indices for this block </span> <span class="n">token_id_indices</span> <span class="o">=</span> <span class="n">bid_m</span> <span class="o">*</span> <span class="n">TILE_M</span> <span class="o">+</span> <span class="n">ct</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span><span class="n">TILE_M</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">ct</span><span class="p">.</span><span class="n">int32</span><span class="p">)</span> <span class="n">token_ids</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span><span class="n">sorted_token_ids</span><span class="p">,</span> <span class="n">token_id_indices</span><span class="p">)</span> <span class="c1"># → cuda_tile.load_view_tko </span> <span class="n">a_row_indices</span> <span class="o">=</span> <span class="n">token_ids</span> <span class="o">//</span> <span class="n">num_token_replicas</span> <span class="n">expert_id</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">sorted_expert_ids</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">bid_m</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">())</span> <span class="c1"># → cuda_tile.load_ptr_tko </span> <span class="c1"># Initialize accumulator </span> <span class="n">accumulator</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">full</span><span class="p">((</span><span class="n">TILE_M</span><span class="p">,</span> <span class="n">TILE_N</span><span class="p">),</span> <span class="mf">0.0</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">ct</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span> <span class="c1"># → cuda_tile.constant </span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">ct</span><span class="p">.</span><span class="nf">cdiv</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">TILE_K</span><span class="p">)):</span> <span class="c1"># → cuda_tile.for </span> <span class="c1"># Load A tile (gathered by token indices) </span> <span class="n">a_col_indices</span> <span class="o">=</span> <span class="n">k</span> <span class="o">*</span> <span class="n">TILE_K</span> <span class="o">+</span> <span class="n">ct</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span><span class="n">TILE_K</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">ct</span><span class="p">.</span><span class="n">int32</span><span class="p">)</span> <span class="n">a</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="p">(</span><span class="n">a_row_indices</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">],</span> <span class="n">a_col_indices</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]))</span> <span class="c1"># → cuda_tile.load_view_tko </span> <span class="c1"># Load B tile (expert weights) </span> <span class="n">b</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="p">(</span><span class="n">expert_id</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">bid_n</span><span class="p">),</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">TILE_K</span><span class="p">,</span> <span class="n">TILE_N</span><span class="p">),</span> <span class="n">order</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)).</span><span class="nf">reshape</span><span class="p">((</span><span class="n">TILE_K</span><span class="p">,</span> <span class="n">TILE_N</span><span class="p">))</span> <span class="c1"># → cuda_tile.load_ptr_tko </span> <span class="n">accumulator</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">mma</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">accumulator</span><span class="p">)</span> <span class="c1"># → cuda_tile.mmaf ← THE COMPUTE! </span> <span class="k">if</span> <span class="n">mul_routed_weight</span><span class="p">:</span> <span class="n">moe_weight</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span><span class="n">topk_weights</span><span class="p">,</span> <span class="n">token_ids</span><span class="p">)</span> <span class="n">accumulator</span> <span class="o">=</span> <span class="n">accumulator</span> <span class="o">*</span> <span class="n">moe_weight</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="c1"># → cuda_tile.mulf </span> <span class="c1"># Scatter results back to output </span> <span class="n">c_col_indices</span> <span class="o">=</span> <span class="n">bid_n</span> <span class="o">*</span> <span class="n">TILE_N</span> <span class="o">+</span> <span class="n">ct</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span><span class="n">TILE_N</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">ct</span><span class="p">.</span><span class="n">int32</span><span class="p">)</span> <span class="n">accumulator</span> <span class="o">=</span> <span class="n">ct</span><span class="p">.</span><span class="nf">astype</span><span class="p">(</span><span class="n">accumulator</span><span class="p">,</span> <span class="n">C</span><span class="p">.</span><span class="n">dtype</span><span class="p">)</span> <span class="c1"># → cuda_tile.ftof </span> <span class="n">ct</span><span class="p">.</span><span class="nf">scatter</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="p">(</span><span class="n">token_ids</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">],</span> <span class="n">c_col_indices</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]),</span> <span class="n">accumulator</span><span class="p">)</span> <span class="c1"># → cuda_tile.store_ptr_tko </span></code></pre></div></div> <p><strong>The three key operations we’ll trace:</strong></p> <table style="width: 100%; border-collapse: collapse; font-family: monospace; font-size: 0.9em;"> <thead> <tr style="background: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);"> <th style="padding: 12px; text-align: left; color: #76b900; border-bottom: 2px solid #76b900;">Python</th> <th style="padding: 12px; text-align: left; color: #76b900; border-bottom: 2px solid #76b900;">cuda_tile</th> <th style="padding: 12px; text-align: left; color: #76b900; border-bottom: 2px solid #76b900;">What it does</th> </tr> </thead> <tbody> <tr style="background: rgba(118, 185, 0, 0.1);"> <td style="padding: 10px; border-bottom: 1px solid #333;">ct.gather(A, indices)</td> <td style="padding: 10px; border-bottom: 1px solid #333;">load_view_tko</td> <td style="padding: 10px; border-bottom: 1px solid #333; font-family: sans-serif;">Gather tokens by expert assignment (indirect load)</td> </tr> <tr style="background: rgba(0, 150, 255, 0.1);"> <td style="padding: 10px; border-bottom: 1px solid #333;">ct.load(B, ...)</td> <td style="padding: 10px; border-bottom: 1px solid #333;">load_ptr_tko</td> <td style="padding: 10px; border-bottom: 1px solid #333; font-family: sans-serif;">Load expert weights (direct load)</td> </tr> <tr style="background: rgba(255, 100, 100, 0.1);"> <td style="padding: 10px;">ct.mma(a, b, acc)</td> <td style="padding: 10px;">mmaf</td> <td style="padding: 10px; font-family: sans-serif;">Matrix multiply-accumulate on tensor cores</td> </tr> </tbody> </table> <p>Watch how these transform through <code class="language-plaintext highlighter-rouge">nv_tileaa</code>, <code class="language-plaintext highlighter-rouge">nv_tileas</code> and finally to SASS instructions.</p> <hr /> <h1 id="compiling-with-tileiras">Compiling with tileiras</h1> <p>The <code class="language-plaintext highlighter-rouge">tileiras</code> command-line tool is the ahead-of-time compiler that transforms <code class="language-plaintext highlighter-rouge">.cutile</code> bytecode into GPU binaries.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>tileiras <span class="nt">--gpu-name</span> sm_120 MoE.cutile <span class="nt">-o</span> moe.cubin </code></pre></div></div> <h2 id="undocumented-environment-variables">Undocumented Environment Variables</h2> <p>These TileIR-specific environment variables affect compilation:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="env-vars-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="env-vars-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Variable </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Description </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="env-vars-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">TILEIR_ALWAYS_SWIZZLE</span> </td> <td id="env-vars-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Force swizzle mode</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="env-vars-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">TILEIR_PREFER_TMA_FOR_LOAD_STORE</span> </td> <td id="env-vars-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Prefer TMA for all load/store operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="env-vars-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">TILEIR_DELAY_TMA_STORE_WAIT</span> </td> <td id="env-vars-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Delay TMA store wait (optimization for overlapping compute)</span> </td> </tr> </tbody> </table> </div> <h2 id="interesting-undocumented-cli-options">Interesting undocumented CLI options</h2> <p>The <code class="language-plaintext highlighter-rouge">--print-before-all</code> flag dumps LLVM IR before each compilation pass.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tileiras <span class="nt">--print-before-all</span> <span class="nt">--gpu-name</span><span class="o">=</span>sm_120 MoE.cutile <span class="nt">-o</span> moe.cubin 2&gt;&amp;1 </code></pre></div></div> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">***</span> <span class="err">IR</span> <span class="err">Dump</span> <span class="err">Before</span> <span class="err">Add</span> <span class="err">__emutls_</span><span class="p">[</span><span class="err">vt</span><span class="p">].</span> <span class="err">variables</span> <span class="err">for</span> <span class="err">emultated</span> <span class="err">TLS</span> <span class="err">model</span> <span class="p">(</span><span class="err">lower-emutls</span><span class="p">)</span> <span class="p">***</span> <span class="c1">; ModuleID = 'LLVMDialectModule'</span> <span class="k">source_filename</span> <span class="p">=</span> <span class="s">"LLVMDialectModule"</span> <span class="k">target</span> <span class="k">datalayout</span> <span class="p">=</span> <span class="s">"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"</span> <span class="vg">@__CUDA_TILEIR_FUNC_NAME_0</span> <span class="p">=</span> <span class="k">internal</span> <span class="k">constant</span> <span class="p">[</span><span class="m">17</span> <span class="p">x</span> <span class="kt">i8</span><span class="p">]</span> <span class="s">c"fused_moe_kernel\00"</span> <span class="p">...</span> </code></pre></div></div> <details> <summary><strong>All LLVM passes dumped (27 unique passes)</strong></summary> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*** IR Dump Before Add __emutls_[vt]. variables for emultated TLS model (lower-emutls) *** *** IR Dump Before Canonicalize natural loops (loop-simplify) *** *** IR Dump Before CodeGen Prepare (codegenprepare) *** *** IR Dump Before Constant Hoisting (consthoist) *** *** IR Dump Before Exception handling preparation (dwarf-eh-prepare) *** *** IR Dump Before Expand Atomic instructions (atomic-expand) *** *** IR Dump Before Expand fp (expand-fp) *** *** IR Dump Before Expand indirectbr instructions (indirectbr-expand) *** *** IR Dump Before Expand large div/rem (expand-large-div-rem) *** *** IR Dump Before Expand memcmp() to load/stores (expand-memcmp) *** *** IR Dump Before Expand reduction intrinsics (expand-reductions) *** *** IR Dump Before Instrument function entry/exit with calls to e.g. mcount() (post-inline-ee-instrument) *** *** IR Dump Before Interleaved Access Pass (interleaved-access) *** *** IR Dump Before Lower AMX intrinsics (lower-amx-intrinsics) *** *** IR Dump Before Lower AMX type for load/store (lower-amx-type) *** *** IR Dump Before Lower Garbage Collection Instructions (gc-lowering) *** *** IR Dump Before Merge contiguous icmps into a memcmp (mergeicmps) *** *** IR Dump Before ObjC ARC contraction (objc-arc-contract) *** *** IR Dump Before Partially inline calls to library functions (partially-inline-libcalls) *** *** IR Dump Before Pre-ISel Intrinsic Lowering (pre-isel-intrinsic-lowering) *** *** IR Dump Before Prepare callbr (callbrprepare) *** *** IR Dump Before Remove unreachable blocks from the CFG (unreachableblockelim) *** *** IR Dump Before Replace intrinsics with calls to vector library (replace-with-veclib) *** *** IR Dump Before Safe Stack instrumentation pass (safe-stack) *** *** IR Dump Before Scalarize Masked Memory Intrinsics (scalarize-masked-mem-intrin) *** *** IR Dump Before Shadow Stack GC Lowering (shadow-stack-gc-lowering) *** *** IR Dump Before X86 Partial Reduction (x86-partial-reduction) *** </code></pre></div> </div> </details> <hr /> <h1 id="pipeline-overview">Pipeline Overview</h1> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/pipeline_overview.svg" width="100%" alt="" /> <div class="caption"> <em>TileIR compilation pipeline: Python → SASS </em> </div> </div> <!-- Excalidraw diagram: Pipeline Flow - Python → cuda_tile → nv_tileaa → nv_tileas → NVVM → LLVM → PTX → SASS --> <p>TileIR takes your CuTile Python code through a series of progressive lowerings:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="pipeline-stages-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="pipeline-stages-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Stage </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Format </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Description </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Python</span> </td> <td id="pipeline-stages-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">CuTile API</span> </td> <td id="pipeline-stages-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-level tensor operations (make_tensor_view; mmaf)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">.cutile</span> </td> <td id="pipeline-stages-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Bytecode</span> </td> <td id="pipeline-stages-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Serialized representation of the kernel</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cuda_tile</span> </td> <td id="pipeline-stages-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MLIR Dialect</span> </td> <td id="pipeline-stages-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-level tensor ops; architecture-independent</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileaa</span> </td> <td id="pipeline-stages-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MLIR Dialect</span> </td> <td id="pipeline-stages-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Tile-level ops; explicit memory references</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileas</span> </td> <td id="pipeline-stages-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MLIR Dialect</span> </td> <td id="pipeline-stages-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Scheduled ops; async pipelines</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">LLVM/NVVM</span> </td> <td id="pipeline-stages-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">LLVM IR</span> </td> <td id="pipeline-stages-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Standard LLVM with NVIDIA intrinsics</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">PTX</span> </td> <td id="pipeline-stages-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Assembly</span> </td> <td id="pipeline-stages-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Virtual GPU assembly</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="pipeline-stages-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">SASS</span> </td> <td id="pipeline-stages-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Machine Code</span> </td> <td id="pipeline-stages-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Native GPU instructions (sm_120)</span> </td> </tr> </tbody> </table> </div> <p>Each stage removes abstraction and adds architecture-specific detail. By the time we reach SASS, every memory access pattern, tensor core instruction, and synchronization barrier is explicit.</p> <hr /> <h1 id="the-dialects">The Dialects</h1> <p>TileIR uses three main MLIR dialects to represent computations at different abstraction levels. Let’s trace our MoE kernel through each one:</p> <!-- Excalidraw diagram: MoE Operation Mapping - shows gather/load/mma traced through each dialect --> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="moe-ops-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="moe-ops-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Python </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> cuda_tile </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> nv_tileaa </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> nv_tileas </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> SASS </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="moe-ops-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.gather(A&#44; idx)</span> </td> <td id="moe-ops-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">load_view_tko</span> </td> <td id="moe-ops-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileaa.load_view</span> </td> <td id="moe-ops-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas.utcpglobalmem</span> </td> <td id="moe-ops-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">UTCPMULTI / LDG</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="moe-ops-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.load(B&#44; ...)</span> </td> <td id="moe-ops-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">load_ptr_tko</span> </td> <td id="moe-ops-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileaa.load_tko</span> </td> <td id="moe-ops-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas.tcgen05_ld</span> </td> <td id="moe-ops-table-row1-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">TCGEN05.LD.S</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="moe-ops-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.mma(a&#44; b&#44; c)</span> </td> <td id="moe-ops-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">mmaf</span> </td> <td id="moe-ops-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileaa.mmaf_tko</span> </td> <td id="moe-ops-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas.tcgen05_mma</span> </td> <td id="moe-ops-table-row2-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">TCGEN05.MMA</span> </td> </tr> </tbody> </table> </div> <h2 id="cuda_tile-high-level-tensor-operations">cuda_tile: High-Level Tensor Operations</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/cuda_tile_dialect.svg" width="100%" alt="" /> <div class="caption"> <em>cuda_tile dialect operations </em> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">cuda_tile</code> dialect is closest to your Python code. Operations work on abstract tensor views without worrying about memory layout or hardware details.</p> <p><strong>Key operations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">make_tensor_view</code> - Create a view into a tensor with shape and strides</li> <li><code class="language-plaintext highlighter-rouge">get_tile_block_id</code> - Get the current thread block’s position in the grid</li> <li><code class="language-plaintext highlighter-rouge">load_view_tko</code> / <code class="language-plaintext highlighter-rouge">store_view_tko</code> - Load/store tiles with token-based ordering</li> <li><code class="language-plaintext highlighter-rouge">mmaf</code> - Matrix multiply-accumulate (targets tensor cores)</li> <li><code class="language-plaintext highlighter-rouge">for</code> / <code class="language-plaintext highlighter-rouge">continue</code> - Loop constructs for K-dimension iteration</li> </ul> <h3 id="moe-in-cuda_tile">MoE in cuda_tile</h3> <p>Recall our <a href="#running-example-moe-kernel">MoE kernel above</a>. Here’s how the key operations map to <code class="language-plaintext highlighter-rouge">cuda_tile</code> IR:</p> <p><strong>Python → cuda_tile mapping:</strong></p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="python-ir-mapping-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="python-ir-mapping-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Python (CuTile) </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> cuda_tile IR </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Purpose </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.gather()</span> </td> <td id="python-ir-mapping-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">load_view_tko</span> </td> <td id="python-ir-mapping-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gather elements by indices</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.load()</span> </td> <td id="python-ir-mapping-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">load_ptr_tko</span> </td> <td id="python-ir-mapping-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Load contiguous tile from memory</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.mma()</span> </td> <td id="python-ir-mapping-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">mmaf</span> </td> <td id="python-ir-mapping-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Matrix multiply-accumulate (tensor cores)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.scatter()</span> </td> <td id="python-ir-mapping-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">store_ptr_tko</span> </td> <td id="python-ir-mapping-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Scatter elements to output</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.full()</span> </td> <td id="python-ir-mapping-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">constant</span> </td> <td id="python-ir-mapping-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Initialize accumulator</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">for k in range()</span> </td> <td id="python-ir-mapping-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">for/continue</span> </td> <td id="python-ir-mapping-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">K-dimension iteration loop</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="python-ir-mapping-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.astype()</span> </td> <td id="python-ir-mapping-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ftof</span> </td> <td id="python-ir-mapping-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Type conversion (F32 → output dtype)</span> </td> </tr> </tbody> </table> </div> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see cuda_tile IR from MoE kernel key sections</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">cuda_tile</span> <span class="err">dialect</span> <span class="err">-</span> <span class="err">MoE</span> <span class="err">kernel</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">TILE_M</span> <span class="nv">%2</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">TILE_N</span> <span class="nv">%3</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">TILE_K</span> <span class="nv">%4</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg0</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%5</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%10</span> <span class="p">=</span> <span class="s">"cuda_tile.make_tensor_view"</span><span class="p">(</span><span class="nv">%4</span><span class="p">,</span> <span class="nv">%5</span><span class="p">,</span> <span class="nv">%6</span><span class="p">,</span> <span class="nv">%7</span><span class="p">,</span> <span class="nv">%8</span><span class="p">,</span> <span class="nv">%9</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="nv">%11</span> <span class="p">=</span> <span class="s">"cuda_tile.make_tensor_view"</span><span class="p">(</span><span class="nv">%arg2</span><span class="p">,</span> <span class="nv">%arg3</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="nv">%12</span> <span class="p">=</span> <span class="s">"cuda_tile.make_token"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%20</span><span class="p">,</span> <span class="nv">%21</span><span class="p">,</span> <span class="nv">%22</span> <span class="p">=</span> <span class="s">"cuda_tile.get_tile_block_id"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%23</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%4</span><span class="p">,</span> <span class="nv">%1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">M</span> <span class="err">/</span> <span class="err">TILE_M</span> <span class="nv">%24</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%1</span><span class="p">,</span> <span class="nv">%23</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%25</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%20</span><span class="p">,</span> <span class="nv">%24</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%30</span> <span class="p">=</span> <span class="s">"cuda_tile.remi"</span><span class="p">(</span><span class="nv">%20</span><span class="p">,</span> <span class="nv">%25</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">expert</span> <span class="err">routing</span> <span class="nv">%31</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%30</span><span class="p">,</span> <span class="nv">%1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%32</span> <span class="p">=</span> <span class="s">"cuda_tile.select"</span><span class="p">(</span><span class="nv">%31</span><span class="p">,</span> <span class="nv">%30</span><span class="p">,</span> <span class="nv">%25</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%40</span> <span class="p">=</span> <span class="s">"cuda_tile.iota"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%41</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%24</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%42</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%41</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%43</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%42</span><span class="p">,</span> <span class="nv">%40</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%44</span> <span class="p">=</span> <span class="s">"cuda_tile.offset"</span><span class="p">(</span><span class="nv">%42</span><span class="p">,</span> <span class="nv">%43</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%50</span><span class="p">,</span> <span class="nv">%51</span> <span class="p">=</span> <span class="s">"cuda_tile.load_ptr_tko"</span><span class="p">(</span><span class="nv">%44</span><span class="p">,</span> <span class="nv">%31</span><span class="p">,</span> <span class="nv">%42</span><span class="p">,</span> <span class="nv">%12</span><span class="p">)</span> <span class="err">//</span> <span class="err">ct</span><span class="p">.</span><span class="k">load</span><span class="p">()</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%52</span> <span class="p">=</span> <span class="s">"cuda_tile.make_partition_view"</span><span class="p">(</span><span class="nv">%10</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">)</span> <span class="nv">%53</span><span class="p">,</span> <span class="nv">%54</span> <span class="p">=</span> <span class="s">"cuda_tile.load_view_tko"</span><span class="p">(</span><span class="nv">%52</span><span class="p">,</span> <span class="nv">%43</span><span class="p">,</span> <span class="nv">%12</span><span class="p">)</span> <span class="err">//</span> <span class="err">ct</span><span class="p">.</span><span class="err">gather</span><span class="p">()</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%60</span> <span class="p">=</span> <span class="s">"cuda_tile.for"</span><span class="p">(</span><span class="nv">%1</span><span class="p">,</span> <span class="nv">%23</span><span class="p">,</span> <span class="nv">%3</span><span class="p">,</span> <span class="nv">%arg4</span><span class="p">)</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">//</span> <span class="nl">K-loop :</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%61</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%iter</span><span class="p">,</span> <span class="nv">%3</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%62</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%61</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%63</span><span class="p">,</span> <span class="nv">%64</span> <span class="p">=</span> <span class="s">"cuda_tile.load_ptr_tko"</span><span class="p">(</span><span class="nv">%62</span><span class="p">,</span> <span class="nv">%31</span><span class="p">,</span> <span class="nv">%42</span><span class="p">,</span> <span class="nv">%12</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%65</span><span class="p">,</span> <span class="nv">%66</span> <span class="p">=</span> <span class="s">"cuda_tile.load_view_tko"</span><span class="p">(</span><span class="nv">%52</span><span class="p">,</span> <span class="nv">%62</span><span class="p">,</span> <span class="nv">%12</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%67</span> <span class="p">=</span> <span class="s">"cuda_tile.mmaf"</span><span class="p">(</span><span class="nv">%63</span><span class="p">,</span> <span class="nv">%65</span><span class="p">,</span> <span class="nv">%acc</span><span class="p">)</span> <span class="err">//</span> <span class="err">ct</span><span class="p">.</span><span class="err">mma</span><span class="p">()</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="s">"cuda_tile.continue"</span><span class="p">(</span><span class="nv">%67</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="nv">%70</span> <span class="p">=</span> <span class="s">"cuda_tile.ftof"</span><span class="p">(</span><span class="nv">%60</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">//</span> <span class="err">ct</span><span class="p">.</span><span class="err">astype</span><span class="p">()</span> <span class="nv">%71</span> <span class="p">=</span> <span class="s">"cuda_tile.store_ptr_tko"</span><span class="p">(</span><span class="nv">%44</span><span class="p">,</span> <span class="nv">%70</span><span class="p">,</span> <span class="nv">%31</span><span class="p">,</span> <span class="nv">%12</span><span class="p">)</span> <span class="err">//</span> <span class="err">ct</span><span class="p">.</span><span class="err">scatter</span><span class="p">()</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="s">"cuda_tile.return"</span><span class="p">()</span> </code></pre></div> </div> </details> <h2 id="nv_tileaa">nv_tileaa</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/nv_tileaa_dialect.svg" width="100%" alt="" /> <div class="caption"> <em>nv_tileaa dialect operations </em> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">nv_tileaa</code> dialect lowers tensor views to concrete memory references. This is where we start seeing explicit memory operations.</p> <p><strong>Key changes from cuda_tile:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">make_tensor_view</code> → <code class="language-plaintext highlighter-rouge">make_memref</code> (explicit memory references)</li> <li><code class="language-plaintext highlighter-rouge">get_tile_block_id</code> → <code class="language-plaintext highlighter-rouge">get_program_id</code> (program-centric naming)</li> <li><code class="language-plaintext highlighter-rouge">mmaf</code> → <code class="language-plaintext highlighter-rouge">dot</code> (more explicit accumulation)</li> <li>Explicit <code class="language-plaintext highlighter-rouge">tiled_load</code> / <code class="language-plaintext highlighter-rouge">tiled_store</code> with memory tokens</li> <li>New ops: <code class="language-plaintext highlighter-rouge">splat</code>, <code class="language-plaintext highlighter-rouge">broadcast</code>, <code class="language-plaintext highlighter-rouge">addptr</code> for memory address calculations</li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see nv_tileaa IR from MoE kernel key sections</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">nv_tileaa</span> <span class="err">dialect</span> <span class="err">-</span> <span class="err">MoE</span> <span class="err">kernel</span> <span class="err">//</span> <span class="err">Tile-level</span> <span class="err">ops</span> <span class="p">(</span><span class="err">architecture-independent</span><span class="p">)</span> <span class="s">"nv_tileaa.func"</span><span class="p">()</span> <span class="p">{</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">kernel_spec</span><span class="p">}</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">//</span> <span class="err">Input</span> <span class="err">validation</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg0</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%2</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%3</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%2</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">//</span> <span class="nl">Splat:</span> <span class="err">scalar</span> <span class="err">→</span> <span class="err">tensor</span> <span class="p">(</span><span class="err">for</span> <span class="err">broadcasting</span><span class="p">)</span> <span class="nv">%10</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%3</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%11</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%2</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">Memory</span> <span class="err">reference</span> <span class="err">creation</span> <span class="p">(</span><span class="err">lowered</span> <span class="k">from</span> <span class="err">make_tensor_view</span><span class="p">)</span> <span class="nv">%20</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_memref"</span><span class="p">(</span><span class="nv">%1</span><span class="p">,</span> <span class="nv">%2</span><span class="p">,</span> <span class="nv">%3</span><span class="p">,</span> <span class="nv">%4</span><span class="p">,</span> <span class="nv">%5</span><span class="p">,</span> <span class="nv">%6</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="nv">%21</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_memref"</span><span class="p">(</span><span class="nv">%1</span><span class="p">,</span> <span class="nv">%2</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="nv">%22</span> <span class="p">=</span> <span class="s">"nv_tileaa.create_mem_token"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">//</span> <span class="err">Program</span> <span class="err">indexing</span> <span class="nv">%30</span> <span class="p">=</span> <span class="s">"nv_tileaa.get_program_id"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%31</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%30</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%32</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_range"</span><span class="p">(</span><span class="nv">%c0</span><span class="p">,</span> <span class="nv">%c128</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%33</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%32</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">//</span> <span class="err">Pointer</span> <span class="err">arithmetic</span> <span class="nv">%40</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%41</span> <span class="p">=</span> <span class="s">"nv_tileaa.addptr"</span><span class="p">(</span><span class="nv">%40</span><span class="p">,</span> <span class="nv">%33</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">Masked</span> <span class="err">loads</span> <span class="nv">%50</span><span class="p">,</span> <span class="nv">%51</span> <span class="p">=</span> <span class="s">"nv_tileaa.load"</span><span class="p">(</span><span class="nv">%41</span><span class="p">,</span> <span class="nv">%mask</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">,</span> <span class="nv">%22</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">//</span> <span class="err">Tiled</span> <span class="err">memory</span> <span class="err">operations</span> <span class="nv">%60</span> <span class="p">=</span> <span class="s">"nv_tileaa.block_tile"</span><span class="p">(</span><span class="nv">%20</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">)</span> <span class="nv">%61</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%32</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%62</span><span class="p">,</span> <span class="nv">%63</span> <span class="p">=</span> <span class="s">"nv_tileaa.tiled_load"</span><span class="p">(</span><span class="nv">%60</span><span class="p">,</span> <span class="nv">%61</span><span class="p">,</span> <span class="nv">%22</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%64</span> <span class="p">=</span> <span class="s">"nv_tileaa.view"</span><span class="p">(</span><span class="nv">%62</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">Shape</span> <span class="err">manipulation</span> <span class="nv">%70</span> <span class="p">=</span> <span class="s">"nv_tileaa.expand_dims"</span><span class="p">(</span><span class="nv">%33</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%71</span> <span class="p">=</span> <span class="s">"nv_tileaa.broadcast"</span><span class="p">(</span><span class="nv">%70</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">DOT</span> <span class="err">OPERATION</span> <span class="p">(</span><span class="err">lowered</span> <span class="k">from</span> <span class="err">cuda_tile</span><span class="p">.</span><span class="err">mmaf</span><span class="p">)</span> <span class="nv">%80</span> <span class="p">=</span> <span class="s">"nv_tileaa.dot"</span><span class="p">(</span><span class="nv">%50</span><span class="p">,</span> <span class="nv">%64</span><span class="p">,</span> <span class="nv">%acc</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">Output</span> <span class="nv">%90</span> <span class="p">=</span> <span class="s">"nv_tileaa.fp_to_fp"</span><span class="p">(</span><span class="nv">%80</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%91</span> <span class="p">=</span> <span class="s">"nv_tileaa.store"</span><span class="p">(</span><span class="nv">%41</span><span class="p">,</span> <span class="nv">%90</span><span class="p">,</span> <span class="nv">%mask</span><span class="p">,</span> <span class="nv">%22</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="s">"nv_tileaa.return"</span><span class="p">()</span> </code></pre></div> </div> <p><strong>Key transformations from cuda_tile → nv_tileaa:</strong></p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="dialect-comparison-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="dialect-comparison-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> cuda_tile </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> nv_tileaa </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Change </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">make_tensor_view</span> </td> <td id="dialect-comparison-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">make_memref</span> </td> <td id="dialect-comparison-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Abstract view → concrete memory ref</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">get_tile_block_id</span> </td> <td id="dialect-comparison-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">get_program_id</span> </td> <td id="dialect-comparison-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Tile-centric → program-centric naming</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">mmaf</span> </td> <td id="dialect-comparison-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">dot</span> </td> <td id="dialect-comparison-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-level MMA → explicit dot product</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">load_view_tko</span> </td> <td id="dialect-comparison-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tiled_load + view</span> </td> <td id="dialect-comparison-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Decomposed into separate ops</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.view types</span> </td> <td id="dialect-comparison-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tensor&lt;...&gt;</span> </td> <td id="dialect-comparison-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Abstract → explicit tensor shapes</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="dialect-comparison-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ct.token</span> </td> <td id="dialect-comparison-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">aa.btile; aa.mtoken</span> </td> <td id="dialect-comparison-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Memory tokens more specific</span> </td> </tr> </tbody> </table> </div> <p><strong>Pass #12 observation:</strong> The 32 <code class="language-plaintext highlighter-rouge">fp_to_fp</code> operations suggest this MoE kernel produces 32 output tiles that need precision conversion from F32 accumulator to the output dtype.</p> </details> <h2 id="nv_tileas">nv_tileas</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/nv_tileas_tcgen05.svg" width="100%" alt="" /> <div class="caption"> <em>nv_tileas dialect with tcgen05 operations </em> </div> </div> <p>The <code class="language-plaintext highlighter-rouge">nv_tileas</code> dialect is where architecture-specific code generation happens.</p> <p>This dialect introduces:</p> <p><strong>Async Pipeline Operations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">async.pipeline.create</code> - Create a software pipeline for overlapping compute/memory</li> <li><code class="language-plaintext highlighter-rouge">producer_acquire</code> / <code class="language-plaintext highlighter-rouge">producer_commit</code> - Acquire/release pipeline stages</li> <li><code class="language-plaintext highlighter-rouge">consumer_wait</code> / <code class="language-plaintext highlighter-rouge">consumer_release</code> - Synchronize consumers with producers</li> </ul> <p><strong>Tensor Memory Operations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">tcgen05.alloc</code> - Allocate dedicated tensor memory</li> <li><code class="language-plaintext highlighter-rouge">tmem_load</code> / <code class="language-plaintext highlighter-rouge">tmem_store</code> - Access tensor memory</li> </ul> <p><strong>Tensor Core Operations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">tcgen05.mma</code> - Matrix Multiply-Accumulate</li> <li><code class="language-plaintext highlighter-rouge">block_scaled_mma</code> - Block-scaled MMA for mixed precision</li> <li><code class="language-plaintext highlighter-rouge">mma.fence</code> - Memory fence for MMA operations</li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see nv_tileas IR from MoE kernel key sections</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">nv_tileas</span> <span class="err">dialect</span> <span class="err">-</span> <span class="err">MoE</span> <span class="err">kernel</span> <span class="err">//</span> <span class="err">Tile-level</span> <span class="err">Scheduled</span> <span class="err">Assembly</span> <span class="err">//</span> <span class="err">Layout</span> <span class="err">conversion</span> <span class="k">and</span> <span class="err">view</span> <span class="err">operations</span> <span class="nv">%1</span><span class="p">,</span> <span class="nv">%2</span> <span class="p">=</span> <span class="s">"nv_tileas.load"</span><span class="p">(</span><span class="nv">%ptr</span><span class="p">,</span> <span class="nv">%mask</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">,</span> <span class="nv">%token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%3</span><span class="p">,</span> <span class="nv">%4</span> <span class="p">=</span> <span class="s">"nv_tileas.tiled_load"</span><span class="p">(</span><span class="nv">%btile</span><span class="p">,</span> <span class="nv">%idx</span><span class="p">,</span> <span class="nv">%token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%5</span> <span class="p">=</span> <span class="s">"nv_tileas.view"</span><span class="p">(</span><span class="nv">%3</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">Convert</span> <span class="err">layout</span> <span class="err">for</span> <span class="err">tensor</span> <span class="err">cores</span> <span class="nv">%10</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%bcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%11</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%5</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%12</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">DOT</span> <span class="err">with</span> <span class="err">input</span> <span class="err">allowances</span> <span class="nv">%20</span> <span class="p">=</span> <span class="s">"nv_tileas.dot"</span><span class="p">(</span><span class="nv">%10</span><span class="p">,</span> <span class="nv">%11</span><span class="p">,</span> <span class="nv">%12</span><span class="p">,</span> <span class="nv">%c1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">TMA</span> <span class="err">descriptor</span> <span class="nv">%25</span> <span class="p">=</span> <span class="s">"nv_tileas.make_tiled_tma_desc"</span><span class="p">(</span><span class="nv">%memref</span><span class="p">)</span> <span class="p">{</span><span class="err">tmaIdx</span><span class="p">=</span><span class="m">0</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!tma.desc</span><span class="p">)</span> <span class="err">//</span> <span class="err">ASYNC</span> <span class="err">PIPELINE</span> <span class="p">(</span><span class="err">producer-consumer</span> <span class="err">model</span><span class="p">)</span> <span class="err">//</span> <span class="err">Pipeline</span> <span class="k">and</span> <span class="err">iterator</span> <span class="err">creation</span> <span class="nv">%30</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_pipeline"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">)</span> <span class="nv">%31</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_pipeline"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">)</span> <span class="nv">%32</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%30</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!iter</span><span class="p">)</span> <span class="nv">%33</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%31</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!iter</span><span class="p">)</span> <span class="err">//</span> <span class="err">Agent</span> <span class="k">switch</span> <span class="p">(</span><span class="m">4</span> <span class="err">regions</span> <span class="err">for</span> <span class="err">producer/consumer</span> <span class="err">roles</span><span class="p">)</span> <span class="s">"nv_tileas.async.pipeline.agent_switch"</span><span class="p">(</span><span class="nv">%arg0</span><span class="p">,</span> <span class="nv">%30</span><span class="p">,</span> <span class="nv">%32</span><span class="p">,</span> <span class="nv">%31</span><span class="p">,</span> <span class="nv">%33</span><span class="p">)</span> <span class="p">{</span><span class="m">4</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">,</span> <span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!iter</span><span class="p">,</span> <span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!iter</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="err">Tensor</span> <span class="err">allocation</span> <span class="p">(</span><span class="kt">double</span><span class="err">-buffering</span><span class="p">)</span> <span class="nv">%40</span> <span class="p">=</span> <span class="s">"nv_tileas.alloc_tensor"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;</span><span class="m">128</span><span class="p">x</span><span class="m">64</span><span class="p">x</span><span class="err">bf16</span><span class="p">&gt;)</span> <span class="nv">%41</span> <span class="p">=</span> <span class="s">"nv_tileas.alloc_tensor"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;</span><span class="m">64</span><span class="p">x</span><span class="m">128</span><span class="p">x</span><span class="err">bf16</span><span class="p">&gt;)</span> <span class="err">//</span> <span class="err">Slice</span> <span class="err">operations</span> <span class="nv">%50</span> <span class="p">=</span> <span class="s">"nv_tileas.extract_slice"</span><span class="p">(</span><span class="nv">%40</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%51</span> <span class="p">=</span> <span class="s">"nv_tileas.insert_slice"</span><span class="p">(</span><span class="nv">%data</span><span class="p">,</span> <span class="nv">%40</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">,</span> <span class="nv">%c64</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="nl">PRODUCER:</span> <span class="k">acquire</span> <span class="err">→</span> <span class="err">write</span> <span class="err">→</span> <span class="err">commit</span> <span class="nv">%60</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.producer_acquire"</span><span class="p">(</span><span class="nv">%30</span><span class="p">,</span> <span class="nv">%32</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!iter</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">)</span> <span class="nv">%61</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.producer_write"</span><span class="p">(</span><span class="nv">%60</span><span class="p">,</span> <span class="nv">%30</span><span class="p">)</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">,</span> <span class="nv">!pipeline</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">)</span> <span class="nv">%62</span> <span class="p">=</span> <span class="s">"nv_tileas.async.load"</span><span class="p">(</span><span class="nv">%51</span><span class="p">,</span> <span class="nv">%ptr</span><span class="p">,</span> <span class="nv">%mask</span><span class="p">,</span> <span class="nv">%c16</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!async</span><span class="p">)</span> <span class="s">"nv_tileas.async.pipeline.yield"</span><span class="p">(</span><span class="nv">%62</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!async</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="s">"nv_tileas.async.pipeline.producer_commit"</span><span class="p">(</span><span class="nv">%30</span><span class="p">,</span> <span class="nv">%61</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!stage</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="nl">CONSUMER:</span> <span class="err">wait</span> <span class="err">→</span> <span class="err">read</span> <span class="err">→</span> <span class="k">release</span> <span class="nv">%70</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.consumer_wait"</span><span class="p">(</span><span class="nv">%31</span><span class="p">,</span> <span class="nv">%33</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!iter</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">)</span> <span class="nv">%71</span><span class="p">,</span> <span class="nv">%72</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.consumer_read"</span><span class="p">(</span><span class="nv">%70</span><span class="p">,</span> <span class="nv">%31</span><span class="p">)</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">,</span> <span class="nv">!pipeline</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!stage</span><span class="p">,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%73</span> <span class="p">=</span> <span class="s">"nv_tileas.copy"</span><span class="p">(</span><span class="nv">%buf</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="s">"nv_tileas.async.pipeline.yield"</span><span class="p">(</span><span class="nv">%73</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="s">"nv_tileas.async.pipeline.consumer_release"</span><span class="p">(</span><span class="nv">%31</span><span class="p">,</span> <span class="nv">%71</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="nv">!pipeline</span><span class="p">,</span> <span class="nv">!stage</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="err">Matrix</span> <span class="err">multiply</span> <span class="p">(</span><span class="m">100</span><span class="err">+</span> <span class="err">ops</span> <span class="err">for</span> <span class="err">tiled</span> <span class="err">GEMM</span><span class="p">)</span> <span class="nv">%80</span> <span class="p">=</span> <span class="s">"nv_tileas.dot"</span><span class="p">(</span><span class="nv">%50</span><span class="p">,</span> <span class="nv">%72</span><span class="p">,</span> <span class="nv">%acc</span><span class="p">,</span> <span class="nv">%c1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%81</span> <span class="p">=</span> <span class="s">"nv_tileas.dot"</span><span class="p">(</span><span class="nv">%50</span><span class="p">,</span> <span class="nv">%72</span><span class="p">,</span> <span class="nv">%80</span><span class="p">,</span> <span class="nv">%c1</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="err">TMA</span> <span class="k">load</span> <span class="nv">%90</span> <span class="p">=</span> <span class="s">"nv_tileas.async.tiled_tma_load"</span><span class="p">(</span><span class="nv">%btile</span><span class="p">,</span> <span class="nv">%buf</span><span class="p">,</span> <span class="nv">%25</span><span class="p">,</span> <span class="nv">%idx</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">,</span> <span class="nv">%c64</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="nv">!tma.desc</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="nv">!async</span><span class="p">)</span> <span class="err">//</span> <span class="err">Output</span> <span class="nv">%100</span> <span class="p">=</span> <span class="s">"nv_tileas.insert_slice"</span><span class="p">(</span><span class="nv">%result</span><span class="p">,</span> <span class="nv">%41</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">,</span> <span class="nv">%c0</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%101</span> <span class="p">=</span> <span class="s">"nv_tileas.view"</span><span class="p">(</span><span class="nv">%100</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%102</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%101</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> </code></pre></div> </div> </details> <h2 id="nvvm--llvm">NVVM + LLVM</h2> <p>After <code class="language-plaintext highlighter-rouge">nv_tileas</code>, the compiler lowers to NVVM (NVIDIA’s LLVM dialect) and then to standard LLVM IR.</p> <p><strong>Key NVVM intrinsics:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">@llvm.nvvm.mma.sync.*</code> - Tensor core matrix multiply</li> <li><code class="language-plaintext highlighter-rouge">@llvm.nvvm.ldmatrix.*</code> - Load matrix fragments from shared memory</li> <li><code class="language-plaintext highlighter-rouge">@llvm.nvvm.cp.async.*</code> - Asynchronous memory copy</li> <li><code class="language-plaintext highlighter-rouge">@llvm.nvvm.bar.warp.sync</code> - Warp-level synchronization</li> <li><code class="language-plaintext highlighter-rouge">@llvm.nvvm.tcgen05.*</code> - Tensor core intrinsics</li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see NVVM/LLVM IR key sections</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; Thread ID and warp-level operations</span> <span class="nv">%233</span> <span class="p">=</span> <span class="k">call</span> <span class="err">range</span><span class="p">(</span><span class="kt">i32</span> <span class="m">0</span><span class="p">,</span> <span class="m">1024</span><span class="p">)</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.read.ptx.sreg.tid.x</span><span class="p">()</span> <span class="nv">%234</span> <span class="p">=</span> <span class="k">icmp</span> <span class="k">eq</span> <span class="kt">i32</span> <span class="nv">%233</span><span class="p">,</span> <span class="m">0</span> <span class="nv">%235</span> <span class="p">=</span> <span class="k">ashr</span> <span class="kt">i32</span> <span class="nv">%233</span><span class="p">,</span> <span class="m">5</span> <span class="nv">%236</span> <span class="p">=</span> <span class="k">call</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.shfl.sync.idx.i32</span><span class="p">(</span><span class="kt">i32</span> <span class="m">-1</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%235</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">0</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">31</span><span class="p">)</span> <span class="nv">%237</span> <span class="p">=</span> <span class="k">call</span> <span class="p">{</span> <span class="kt">i32</span><span class="p">,</span> <span class="kt">i1</span> <span class="p">}</span> <span class="vg">@llvm.nvvm.elect.sync</span><span class="p">(</span><span class="kt">i32</span> <span class="m">-1</span><span class="p">)</span> <span class="c1">; Mbarrier initialization (async pipeline synchronization)</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.mbarrier.init.shared</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="k">nuw</span> <span class="p">(</span><span class="kt">i8</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="vg">@global_smem</span><span class="p">,</span> <span class="kt">i64</span> <span class="m">82000</span><span class="p">),</span> <span class="kt">i32</span> <span class="nv">%241</span><span class="p">)</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.mbarrier.init.shared</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="k">nuw</span> <span class="p">(</span><span class="kt">i8</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="vg">@global_smem</span><span class="p">,</span> <span class="kt">i64</span> <span class="m">82008</span><span class="p">),</span> <span class="kt">i32</span> <span class="nv">%241</span><span class="p">)</span> <span class="c1">; Cluster-wide fence and barrier</span> <span class="k">call</span> <span class="kt">void</span> <span class="k">asm</span> <span class="k">sideeffect</span> <span class="s">"fence.mbarrier_init.release.cluster;"</span><span class="p">,</span> <span class="s">"n"</span><span class="p">(</span><span class="kt">i32</span> <span class="m">0</span><span class="p">)</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.barrier.cta.sync.aligned.all</span><span class="p">(</span><span class="kt">i32</span> <span class="m">0</span><span class="p">)</span> <span class="c1">; Async copy from global to shared memory (cp.async)</span> <span class="nv">%1478</span> <span class="p">=</span> <span class="k">select</span> <span class="kt">i1</span> <span class="nv">%1459</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">16</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">0</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.cp.async.cg.shared.global.16.s</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%1477</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%1451</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%1478</span><span class="p">)</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.cp.async.cg.shared.global.16.s</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%1485</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%1452</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%1486</span><span class="p">)</span> <span class="c1">; Signal mbarrier arrival after async copy</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared</span><span class="p">(</span><span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%1535</span><span class="p">)</span> <span class="c1">; TCGEN05 tensor core intrinsics</span> <span class="c1">; Allocate tensor memory</span> <span class="nv">%tmem</span> <span class="p">=</span> <span class="k">call</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.tcgen05.alloc</span><span class="p">(</span><span class="kt">i32</span> <span class="m">65536</span><span class="p">)</span> <span class="c1">; Load data into tensor memory</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.tcgen05.ld</span><span class="p">(</span><span class="kt">i32</span> <span class="nv">%tmem</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%smem_ptr</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%size</span><span class="p">)</span> <span class="c1">; Execute TCGEN05 MMA (128x256x64 tile)</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.tcgen05.mma</span><span class="p">(</span><span class="kt">i32</span> <span class="nv">%tmem_a</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%tmem_b</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%tmem_c</span><span class="p">)</span> <span class="c1">; Fence and wait for tensor core completion</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.tcgen05.fence</span><span class="p">()</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.tcgen05.wait</span><span class="p">()</span> </code></pre></div> </div> </details> <h2 id="sass">SASS</h2> <p>The final output is SASS.</p> <p><strong>Key SASS instructions:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">HMMA.16816.F32.BF16</code> - Half-precision matrix multiply-accumulate</li> <li><code class="language-plaintext highlighter-rouge">TCGEN05.MMA</code> - Tensor core MMA</li> <li><code class="language-plaintext highlighter-rouge">TCGEN05.LD.S</code> - Tensor memory load</li> <li><code class="language-plaintext highlighter-rouge">UTCPMULTI</code> / <code class="language-plaintext highlighter-rouge">LDG</code> - Global memory loads</li> <li><code class="language-plaintext highlighter-rouge">SYNCS.EXCH</code> - Async synchronization exchange</li> <li><code class="language-plaintext highlighter-rouge">FENCE.VIEW.ASYNC.S</code> - Async memory fence</li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see SASS key sections</summary> <div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; SASS - MoE kernel (fused_moe_kernel)</span> <span class="c1">; Target: sm_120a</span> <span class="c1">; Thread ID and CTA setup</span> <span class="o">/*</span><span class="err">0020</span><span class="o">*/</span> <span class="nf">S2R</span> <span class="nv">R0</span><span class="p">,</span> <span class="nv">SR_TID.X</span> <span class="c1">; ; Get thread ID</span> <span class="o">/*</span><span class="err">0060</span><span class="o">*/</span> <span class="nf">S2UR</span> <span class="nv">UR8</span><span class="p">,</span> <span class="nv">SR_CgaCtaId</span> <span class="c1">; ; Get CTA ID (uniform reg)</span> <span class="c1">; Async fence and mbarrier sync (cluster sync)</span> <span class="o">/*</span><span class="err">0110</span><span class="o">*/</span> <span class="nf">FENCE.VIEW.ASYNC.S</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">0120</span><span class="o">*/</span> <span class="nf">SYNCS.EXCH.64</span> <span class="nv">URZ</span><span class="p">,</span> <span class="p">[</span><span class="nv">UR8</span><span class="o">+</span><span class="mh">0x14050</span><span class="p">],</span> <span class="nv">UR4</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">0130</span><span class="o">*/</span> <span class="nf">SYNCS.EXCH.64</span> <span class="nv">URZ</span><span class="p">,</span> <span class="p">[</span><span class="nv">UR8</span><span class="o">+</span><span class="mh">0x14058</span><span class="p">],</span> <span class="nv">UR4</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">0140</span><span class="o">*/</span> <span class="nf">SYNCS.EXCH.64</span> <span class="nv">URZ</span><span class="p">,</span> <span class="p">[</span><span class="nv">UR8</span><span class="o">+</span><span class="mh">0x14060</span><span class="p">],</span> <span class="nv">UR6</span> <span class="c1">;</span> <span class="c1">; ... (data loading, address calculation) ...</span> <span class="c1">; Tensor core HMMA - 16x8x16 BF16→F32 matrix multiply</span> <span class="c1">; R156 = A matrix fragment (reused across 7 HMMAs)</span> <span class="c1">; R124,R120,R116,R112,R108,R104,R100 = B matrix fragments</span> <span class="c1">; R200,R204,R64,R60,R56,R52,R48 = accumulator tiles</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a00</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R200</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R124</span><span class="p">,</span> <span class="nv">R200</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a10</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R204</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R120</span><span class="p">,</span> <span class="nv">R204</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a20</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R64</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R116</span><span class="p">,</span> <span class="nv">R64</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a30</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R60</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R112</span><span class="p">,</span> <span class="nv">R60</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a40</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R56</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R108</span><span class="p">,</span> <span class="nv">R56</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a50</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R52</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R104</span><span class="p">,</span> <span class="nv">R52</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a60</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R48</span><span class="p">,</span> <span class="nv">R156</span><span class="p">,</span> <span class="nv">R100</span><span class="p">,</span> <span class="nv">R48</span> <span class="c1">;</span> <span class="c1">; Second A fragment (R148) with different B fragments</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a70</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R200</span><span class="p">,</span> <span class="nv">R148</span><span class="p">,</span> <span class="nv">R126</span><span class="p">,</span> <span class="nv">R200</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a80</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R204</span><span class="p">,</span> <span class="nv">R148</span><span class="p">,</span> <span class="nv">R122</span><span class="p">,</span> <span class="nv">R204</span> <span class="c1">;</span> <span class="o">/*</span><span class="err">4</span><span class="nf">a90</span><span class="o">*/</span> <span class="nv">HMMA.16816.F32.BF16</span> <span class="nv">R64</span><span class="p">,</span> <span class="nv">R148</span><span class="p">,</span> <span class="nv">R118</span><span class="p">,</span> <span class="nv">R64</span> <span class="c1">;</span> </code></pre></div> </div> </details> <hr /> <h1 id="the-tileir-passes">The TileIR passes</h1> <p>TileIR runs multiple passes to transform your code. The passes are grouped by the scope they operate on:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/pass_flow.svg" width="100%" alt="" /> <div class="caption"> <em>TileIR pass pipeline </em> </div> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2026-01-29/pass_glossary.svg" width="100%" alt="" /> <div class="caption"> <em>Detailed pass pipeline: cuda_tile.entry → nv_tileaa.func (×12) → builtin.module → gpu.module </em> </div> </div> <hr /> <h3 id="pass-1-cuda_tileentry">Pass 1: <code class="language-plaintext highlighter-rouge">cuda_tile.entry</code></h3> <p>Entry point canonicalization—validates kernel structure, emits compile-time constants for tile sizes/strides, propagates input constraints via <code class="language-plaintext highlighter-rouge">assume</code> operations, creates tensor views, and establishes memory ordering via <code class="language-plaintext highlighter-rouge">make_token</code>.</p> <hr /> <h3 id="pass-2-nv_tileaafunc-12-iterations">Pass 2: <code class="language-plaintext highlighter-rouge">nv_tileaa.func</code> (×12 iterations)</h3> <p>Iterative lowering from cuda_tile to nv_tileaa. First iteration converts <code class="language-plaintext highlighter-rouge">make_tensor_view</code> → <code class="language-plaintext highlighter-rouge">make_memref</code>, <code class="language-plaintext highlighter-rouge">get_tile_block_id</code> → <code class="language-plaintext highlighter-rouge">get_program_id</code>, <code class="language-plaintext highlighter-rouge">mmaf</code> → <code class="language-plaintext highlighter-rouge">dot</code>, decomposes <code class="language-plaintext highlighter-rouge">load_view_tko</code> into <code class="language-plaintext highlighter-rouge">block_tile</code> + <code class="language-plaintext highlighter-rouge">tiled_load</code> + <code class="language-plaintext highlighter-rouge">view</code>. Subsequent iterations perform refinement and optimization. Final iteration emits precision conversions (<code class="language-plaintext highlighter-rouge">fp_to_fp</code>), adds kernel metadata, and prepares for async pipeline lowering.</p> <hr /> <h3 id="pass-3-builtinmodule">Pass 3: <code class="language-plaintext highlighter-rouge">builtin.module</code></h3> <p>Module-level transforms and nv_tileas emission—creates async pipeline operations, software pipelines for overlapping compute/memory, producer-consumer synchronization, TMA descriptors, and double buffers.</p> <hr /> <h3 id="pass-4-gpumodule">Pass 4: <code class="language-plaintext highlighter-rouge">gpu.module</code></h3> <p>Final lowering to NVVM/LLVM—converts <code class="language-plaintext highlighter-rouge">nv_tileas.dot</code> → <code class="language-plaintext highlighter-rouge">nvvm.mma.sync</code>, lowers async ops to barrier/fence instructions, converts memory ops to NVVM intrinsics (<code class="language-plaintext highlighter-rouge">ldmatrix</code>, <code class="language-plaintext highlighter-rouge">cp.async</code>, <code class="language-plaintext highlighter-rouge">mbarrier.*</code>), and emits address space annotations.</p> <h2 id="complete-pass-catalog">Complete Pass Catalog</h2> <p>Below is a catalog of passes that run within the TileIR pipeline.</p> <h3 id="conversion-passes">Conversion Passes</h3> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="conversion-passes-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="conversion-passes-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Pass Name </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Source </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Target </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Description </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-cudatile-to-tileaa</span> </td> <td id="conversion-passes-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cuda_tile</span> </td> <td id="conversion-passes-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileaa</span> </td> <td id="conversion-passes-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Frontend: CuTile DSL to TileAA abstract assembly</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-tileaa-to-tileas</span> </td> <td id="conversion-passes-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileaa</span> </td> <td id="conversion-passes-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileas</span> </td> <td id="conversion-passes-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Middle-end: Abstract to scheduled assembly</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-nv-tileas-to-llvm</span> </td> <td id="conversion-passes-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tileas</span> </td> <td id="conversion-passes-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Backend: TileAS to LLVM IR</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-nv-tile-func-to-llvm</span> </td> <td id="conversion-passes-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tile</span> </td> <td id="conversion-passes-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row3-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Convert tile function ops to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-gpu-to-nvvm</span> </td> <td id="conversion-passes-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">gpu</span> </td> <td id="conversion-passes-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nvvm</span> </td> <td id="conversion-passes-table-row4-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">GPU dialect to NVVM intrinsics</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-scf-to-cf</span> </td> <td id="conversion-passes-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">scf</span> </td> <td id="conversion-passes-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cf</span> </td> <td id="conversion-passes-table-row5-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Structured control flow to basic blocks</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv-tile-ir-convert-target-to-nvvm</span> </td> <td id="conversion-passes-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nv_tile</span> </td> <td id="conversion-passes-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nvvm</span> </td> <td id="conversion-passes-table-row6-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Target-specific ops to NVVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-pipeline-to-nvvm</span> </td> <td id="conversion-passes-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">pipeline</span> </td> <td id="conversion-passes-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nvvm</span> </td> <td id="conversion-passes-table-row7-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Async pipeline ops to NVVM barriers</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row8-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-arith-to-llvm</span> </td> <td id="conversion-passes-table-row8-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">arith</span> </td> <td id="conversion-passes-table-row8-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row8-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Arithmetic operations to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row9-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-cf-to-llvm</span> </td> <td id="conversion-passes-table-row9-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cf</span> </td> <td id="conversion-passes-table-row9-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row9-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Control flow to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row10-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-to-llvm</span> </td> <td id="conversion-passes-table-row10-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">*</span> </td> <td id="conversion-passes-table-row10-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row10-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Generic catch-all LLVM conversion</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row11-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-math-to-llvm</span> </td> <td id="conversion-passes-table-row11-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">math</span> </td> <td id="conversion-passes-table-row11-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row11-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Math operations to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row12-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-nvvm-to-llvm</span> </td> <td id="conversion-passes-table-row12-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nvvm</span> </td> <td id="conversion-passes-table-row12-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row12-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">NVVM intrinsics to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row13-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-ub-to-llvm</span> </td> <td id="conversion-passes-table-row13-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ub</span> </td> <td id="conversion-passes-table-row13-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row13-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Undefined behavior ops to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row14-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-vector-to-llvm</span> </td> <td id="conversion-passes-table-row14-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">vector</span> </td> <td id="conversion-passes-table-row14-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row14-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Vector ops to LLVM</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="conversion-passes-table-row15-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">convert-debuginfo-to-llvm</span> </td> <td id="conversion-passes-table-row15-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">debug</span> </td> <td id="conversion-passes-table-row15-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llvm</span> </td> <td id="conversion-passes-table-row15-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Debug info to LLVM metadata</span> </td> </tr> </tbody> </table> </div> <h3 id="tileas-optimization-passes">TileAS Optimization Passes</h3> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="tileas-passes-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-8 overflow-x-auto"> <table id="tileas-passes-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Pass Name </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Description </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-assign-dot-layouts</span> </td> <td id="tileas-passes-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Assign optimal data layouts for dot (MMA) operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-assign-pipeline-layouts</span> </td> <td id="tileas-passes-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Assign layouts for async pipeline stages</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-assign-load-store-layouts</span> </td> <td id="tileas-passes-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Assign layouts for memory operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-attach-tma-desc-args</span> </td> <td id="tileas-passes-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Attach TMA descriptor arguments to kernel signature</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-dynamic-persistent</span> </td> <td id="tileas-passes-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Enable dynamic persistent kernel execution</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-insert-OCG-knobs</span> </td> <td id="tileas-passes-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Insert Online Code Generation tuning knobs</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-legalize-tmem-copy</span> </td> <td id="tileas-passes-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Legalize tensor memory copy operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-plan-cta</span> </td> <td id="tileas-passes-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Plan CTA (thread block) configuration</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row8-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-remove-buffer-alias</span> </td> <td id="tileas-passes-table-row8-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Remove buffer aliasing for optimization</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row9-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-remove-dead-args</span> </td> <td id="tileas-passes-table-row9-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Dead argument elimination</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row10-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-remove-layout-conversions</span> </td> <td id="tileas-passes-table-row10-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Remove unnecessary layout conversions</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row11-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-resolve-agent-boundary</span> </td> <td id="tileas-passes-table-row11-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Resolve warp specialization agent boundaries</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row12-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-slicing</span> </td> <td id="tileas-passes-table-row12-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Tensor slicing for pipelining</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row13-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-materialize-async</span> </td> <td id="tileas-passes-table-row13-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Materialize async load/store/dot operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row14-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-materialize-convert-layout</span> </td> <td id="tileas-passes-table-row14-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Materialize layout conversion copy atoms</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row15-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-materialize-schedule</span> </td> <td id="tileas-passes-table-row15-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Materialize schedule to warp-specialized IR</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row16-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-unroll-register-loops</span> </td> <td id="tileas-passes-table-row16-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Unroll loops at register level</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row17-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-unspecialized-pipeline</span> </td> <td id="tileas-passes-table-row17-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Handle non-warp-specialized pipelines</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row18-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-optimize-alloc-tensor</span> </td> <td id="tileas-passes-table-row18-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Optimize tensor allocation placement</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row19-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-optimize-reduce</span> </td> <td id="tileas-passes-table-row19-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Optimize reduction operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row20-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-recompute-for-scheduling</span> </td> <td id="tileas-passes-table-row20-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Recompute values for better scheduling</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row21-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-legalize-fma-dot</span> </td> <td id="tileas-passes-table-row21-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Legalize FMA in dot products</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row22-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-legalize-reduce</span> </td> <td id="tileas-passes-table-row22-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Legalize reduction operations</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row23-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-slice-and-fuse</span> </td> <td id="tileas-passes-table-row23-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Slice and fuse operations for locality</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row24-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-refine-atom-by-resource</span> </td> <td id="tileas-passes-table-row24-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Refine copy atoms based on resource constraints</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row25-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-generate-schedule</span> </td> <td id="tileas-passes-table-row25-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Generate execution schedule (Serial or CostBased)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row26-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-prepare-for-scheduling</span> </td> <td id="tileas-passes-table-row26-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Prepare IR for scheduling pass</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row27-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-optimize-dot-accumulation</span> </td> <td id="tileas-passes-table-row27-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Optimize dot product accumulation</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row28-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">lower-tma-load-store-to-async</span> </td> <td id="tileas-passes-table-row28-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Lower TMA ops to async variants</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="tileas-passes-table-row29-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tileas-print-decomposed-tv-layout</span> </td> <td id="tileas-passes-table-row29-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Debug: print decomposed tensor view layouts</span> </td> </tr> </tbody> </table> </div> <hr /> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Conversion Patterns Registered</summary> <p>The TileAA→TileAS conversion registers 20+ patterns:</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TileAAToTileASTiledLoadOpPattern</span> <span class="c1">// Tiled load conversion</span> <span class="n">TileAAToTileASDotOpPattern</span> <span class="c1">// Dot product conversion</span> <span class="n">TileAAToTileASExtractOpPattern</span> <span class="c1">// Extraction conversion</span> <span class="n">TileAAToTileASBroadcastOpPattern</span> <span class="c1">// Broadcast conversion</span> <span class="n">TileAAToTileASGatherLoadOpPattern</span> <span class="c1">// Gather load conversion</span> <span class="n">TileAAToTileASScatterStoreOpPattern</span> <span class="c1">// Scatter store conversion</span> <span class="n">TileAAToTileASExpandDimsOpPattern</span> <span class="c1">// Dimension expansion</span> <span class="n">TileAAToTileASExtractSliceOpPattern</span> <span class="c1">// Slice extraction</span> <span class="n">TileAAToTileASGenerateOpPattern</span> <span class="c1">// Generate conversion</span> <span class="n">TileAAToTileASLoadOpPattern</span> <span class="c1">// Load conversion</span> <span class="n">TileAAToTileASPermuteOpPattern</span> <span class="c1">// Permute conversion</span> <span class="n">TileAAToTileASReduceOpPattern</span> <span class="c1">// Reduce conversion</span> <span class="n">TileAAToTileASScanOpPattern</span> <span class="c1">// Scan conversion</span> <span class="n">TileAAToTileASStoreOpPattern</span> <span class="c1">// Store conversion</span> <span class="n">TileAAToTileASTiledAtomicRMWOpPattern</span> <span class="c1">// Atomic RMW conversion</span> <span class="n">TileAAToTileASTiledStoreOpPattern</span> <span class="c1">// Tiled store conversion</span> <span class="n">TileAAToTileASViewOpPattern</span> <span class="c1">// View conversion</span> <span class="n">TileAAToTileASYieldOpPattern</span> <span class="c1">// Yield conversion</span> </code></pre></div> </div> </details> <hr /> <h1 id="conclusion">Conclusion</h1> <p>TileIR is a sophisticated MLIR-based compiler that progressively lowers high-level tensor operations to optimized GPU machine code. It’s an interesting piece of software that combines MLIR and the rest of NVIDIA’s toolchain to make the tile abstraction work.</p> <p><strong>Resources:</strong></p> <ul> <li><a href="https://github.com/NVIDIA/cutile-python">CuTile Python</a></li> <li><a href="https://github.com/NVIDIA/cuda-tile">CUDA Tile</a></li> <li><a href="https://docs.nvidia.com/cuda/tile-ir/">NVIDIA TileIR Documentation</a></li> </ul> <hr /> <h1 id="appendix-tileir-passes-reference">Appendix: TileIR Passes Reference</h1> <p>This appendix documents the TileIR-specific passes in the compilation pipeline. Passes are organized into categories: <strong>Conversion</strong> and <strong>TileAS Optimization</strong></p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Conversion Passes (16)</summary> <p>Conversion passes transform IR between MLIR dialects.</p> <h3 id="convert-cudatile-to-tileaa">convert-cudatile-to-tileaa</h3> <p>Converts high-level <code class="language-plaintext highlighter-rouge">cuda_tile</code> dialect to <code class="language-plaintext highlighter-rouge">nv_tileaa</code>.</p> <p><strong>Key transformations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">cuda_tile.mmaf</code> → <code class="language-plaintext highlighter-rouge">nv_tileaa.dot</code></li> <li><code class="language-plaintext highlighter-rouge">cuda_tile.load_view_tko</code> → <code class="language-plaintext highlighter-rouge">nv_tileaa.tiled_load</code></li> <li><code class="language-plaintext highlighter-rouge">cuda_tile.store_ptr_tko</code> → <code class="language-plaintext highlighter-rouge">nv_tileaa.tiled_store</code></li> <li><code class="language-plaintext highlighter-rouge">cuda_tile.for</code> → <code class="language-plaintext highlighter-rouge">scf.for</code> + <code class="language-plaintext highlighter-rouge">nv_tileaa.yield</code></li> </ul> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ConvertCudaTileToTileAA</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">ModuleOp</span> <span class="k">module</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">ConversionTarget</span> <span class="n">target</span><span class="p">(</span><span class="n">getContext</span><span class="p">());</span> <span class="n">target</span><span class="p">.</span><span class="n">addLegalDialect</span><span class="o">&lt;</span><span class="n">nv_tileaa</span><span class="o">::</span><span class="n">NVTileAADialect</span><span class="o">&gt;</span><span class="p">();</span> <span class="n">target</span><span class="p">.</span><span class="n">addIllegalDialect</span><span class="o">&lt;</span><span class="n">cuda_tile</span><span class="o">::</span><span class="n">CudaTileDialect</span><span class="o">&gt;</span><span class="p">();</span> <span class="n">RewritePatternSet</span> <span class="n">patterns</span><span class="p">(</span><span class="o">&amp;</span><span class="n">getContext</span><span class="p">());</span> <span class="c1">// Register 20+ conversion patterns</span> <span class="n">patterns</span><span class="p">.</span><span class="n">add</span><span class="o">&lt;</span><span class="n">ConvertMmafToDot</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="n">patterns</span><span class="p">.</span><span class="n">add</span><span class="o">&lt;</span><span class="n">ConvertLoadViewTko</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="n">patterns</span><span class="p">.</span><span class="n">add</span><span class="o">&lt;</span><span class="n">ConvertStorePtr</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="n">applyPartialConversion</span><span class="p">(</span><span class="k">module</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">patterns</span><span class="p">));</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="convert-tileaa-to-tileas">convert-tileaa-to-tileas</h3> <p>Main middle-end conversion: <code class="language-plaintext highlighter-rouge">nv_tileaa</code> → <code class="language-plaintext highlighter-rouge">nv_tileas</code> (Tile Assembly).</p> <p><strong>Key transformations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">nv_tileaa.tiled_load</code> → <code class="language-plaintext highlighter-rouge">nv_tileas.async_load</code> + pipeline ops</li> <li><code class="language-plaintext highlighter-rouge">nv_tileaa.dot</code> → <code class="language-plaintext highlighter-rouge">nv_tileas.dot</code> with layout annotations</li> <li>Inserts shared memory allocations</li> </ul> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ConvertTileAAToTileAS</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="c1">// Walk all tileaa operations</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">nv_tileaa</span><span class="o">::</span><span class="n">TiledLoadOp</span> <span class="n">loadOp</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Create async copy with TMA descriptor</span> <span class="k">auto</span> <span class="n">asyncCopy</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">nv_tileas</span><span class="o">::</span><span class="n">AsyncCopyOp</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="c1">// Allocate shared memory buffer</span> <span class="k">auto</span> <span class="n">smemAlloc</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">nv_tileas</span><span class="o">::</span><span class="n">AllocSharedOp</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="p">});</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">nv_tileaa</span><span class="o">::</span><span class="n">DotOp</span> <span class="n">dotOp</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Convert to tileas.dot with layout attributes</span> <span class="k">auto</span> <span class="n">tiledDot</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">nv_tileas</span><span class="o">::</span><span class="n">DotOp</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="n">tiledDot</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"lhs_layout"</span><span class="p">,</span> <span class="n">selectMMALayout</span><span class="p">(...));</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="convert-nv-tileas-to-llvm">convert-nv-tileas-to-llvm</h3> <p>Backend code generation: <code class="language-plaintext highlighter-rouge">nv_tileas</code> → LLVM IR with NVVM intrinsics.</p> <p><strong>Key transformations:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">tileas.tcgen05_mma</code> → <code class="language-plaintext highlighter-rouge">@llvm.nvvm.tcgen05.mma.*</code></li> <li><code class="language-plaintext highlighter-rouge">tileas.tcgen05_ld</code> → <code class="language-plaintext highlighter-rouge">@llvm.nvvm.tcgen05.ld.*</code></li> <li><code class="language-plaintext highlighter-rouge">tileas.async_copy</code> → <code class="language-plaintext highlighter-rouge">@llvm.nvvm.cp.async.*</code></li> <li>Barrier ops → <code class="language-plaintext highlighter-rouge">@llvm.nvvm.barrier.*</code></li> </ul> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ConvertTileASToLLVM</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">ModuleOp</span> <span class="k">module</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">ConversionTarget</span> <span class="n">target</span><span class="p">(</span><span class="n">getContext</span><span class="p">());</span> <span class="n">target</span><span class="p">.</span><span class="n">addLegalDialect</span><span class="o">&lt;</span><span class="n">LLVM</span><span class="o">::</span><span class="n">LLVMDialect</span><span class="o">&gt;</span><span class="p">();</span> <span class="n">RewritePatternSet</span> <span class="n">patterns</span><span class="p">(</span><span class="o">&amp;</span><span class="n">getContext</span><span class="p">());</span> <span class="c1">// MMA operations</span> <span class="n">patterns</span><span class="p">.</span><span class="n">add</span><span class="o">&lt;</span><span class="n">Tcgen05MMAToNVVM</span><span class="o">&gt;</span><span class="p">([](</span><span class="n">tcgen05</span><span class="o">::</span><span class="n">MMAOp</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Generate NVVM MMA intrinsic</span> <span class="k">return</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">NVVM</span><span class="o">::</span><span class="n">Tcgen05MMAOp</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="p">});</span> <span class="c1">// Memory operations with TMA</span> <span class="n">patterns</span><span class="p">.</span><span class="n">add</span><span class="o">&lt;</span><span class="n">Tcgen05LoadToNVVM</span><span class="o">&gt;</span><span class="p">([](</span><span class="n">tcgen05</span><span class="o">::</span><span class="n">LoadOp</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">NVVM</span><span class="o">::</span><span class="n">Tcgen05LoadOp</span><span class="o">&gt;</span><span class="p">(...);</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="convert-gpu-to-nvvm">convert-gpu-to-nvvm</h3> <p>Converts GPU dialect operations to NVVM intrinsics.</p> <table> <thead> <tr> <th>GPU Op</th> <th>NVVM Intrinsic</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">gpu.thread_id</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.read.ptx.sreg.tid.*</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">gpu.block_id</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.read.ptx.sreg.ctaid.*</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">gpu.block_dim</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.read.ptx.sreg.ntid.*</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">gpu.barrier</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.barrier0</code></td> </tr> </tbody> </table> <hr /> <h3 id="convert-pipeline-to-nvvm">convert-pipeline-to-nvvm</h3> <p>Converts async pipeline operations to NVVM barrier intrinsics.</p> <table> <thead> <tr> <th>Pipeline Op</th> <th>NVVM Op</th> </tr> </thead> <tbody> <tr> <td><code class="language-plaintext highlighter-rouge">pipeline.producer_acquire</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.mbarrier.arrive.*</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">pipeline.producer_commit</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.mbarrier.arrive.*</code> + phase</td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">pipeline.consumer_wait</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.mbarrier.wait.*</code></td> </tr> <tr> <td><code class="language-plaintext highlighter-rouge">pipeline.consumer_release</code></td> <td><code class="language-plaintext highlighter-rouge">nvvm.mbarrier.arrive.*</code></td> </tr> </tbody> </table> <hr /> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">TileAS Optimization Passes (30)</summary> <p>TileAS passes optimize and schedule tile operations.</p> <h3 id="tileas-assign-dot-layouts">tileas-assign-dot-layouts</h3> <p>Assigns MMA-compatible layouts to dot product operands.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AssignDotLayouts</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">DotOp</span> <span class="n">dotOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">lhsType</span> <span class="o">=</span> <span class="n">dotOp</span><span class="p">.</span><span class="n">getLhs</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="k">auto</span> <span class="n">rhsType</span> <span class="o">=</span> <span class="n">dotOp</span><span class="p">.</span><span class="n">getRhs</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="c1">// Select MMA shape based on types</span> <span class="n">MMAShape</span> <span class="n">mmaShape</span> <span class="o">=</span> <span class="n">selectMMAShape</span><span class="p">(</span><span class="n">lhsType</span><span class="p">,</span> <span class="n">rhsType</span><span class="p">);</span> <span class="c1">// Assign layouts for operands</span> <span class="n">Layout</span> <span class="n">lhsLayout</span> <span class="o">=</span> <span class="n">computeLhsLayout</span><span class="p">(</span><span class="n">mmaShape</span><span class="p">,</span> <span class="n">lhsType</span><span class="p">);</span> <span class="n">Layout</span> <span class="n">rhsLayout</span> <span class="o">=</span> <span class="n">computeRhsLayout</span><span class="p">(</span><span class="n">mmaShape</span><span class="p">,</span> <span class="n">rhsType</span><span class="p">);</span> <span class="n">dotOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"lhs_layout"</span><span class="p">,</span> <span class="n">lhsLayout</span><span class="p">);</span> <span class="n">dotOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"rhs_layout"</span><span class="p">,</span> <span class="n">rhsLayout</span><span class="p">);</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <p><strong>MMA shapes:</strong> <code class="language-plaintext highlighter-rouge">m16n8k16</code>, <code class="language-plaintext highlighter-rouge">m16n16k16</code>, <code class="language-plaintext highlighter-rouge">m64n256k64</code></p> <hr /> <h3 id="tileas-assign-load-store-layouts">tileas-assign-load-store-layouts</h3> <p>Optimizes memory access patterns for coalesced loads.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AssignLoadStoreLayouts</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">LoadOp</span> <span class="n">loadOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">tensorType</span> <span class="o">=</span> <span class="n">loadOp</span><span class="p">.</span><span class="n">getResult</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="c1">// Check for TMA opportunity</span> <span class="k">if</span> <span class="p">(</span><span class="n">canUseTMA</span><span class="p">(</span><span class="n">loadOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">Layout</span> <span class="n">tmaLayout</span> <span class="o">=</span> <span class="n">computeTMALayout</span><span class="p">(</span><span class="n">tensorType</span><span class="p">);</span> <span class="n">loadOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"layout"</span><span class="p">,</span> <span class="n">tmaLayout</span><span class="p">);</span> <span class="n">loadOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"use_tma"</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="c1">// Vectorized load layout</span> <span class="n">Layout</span> <span class="n">vecLayout</span> <span class="o">=</span> <span class="n">computeVectorizedLayout</span><span class="p">(</span><span class="n">tensorType</span><span class="p">);</span> <span class="n">loadOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"layout"</span><span class="p">,</span> <span class="n">vecLayout</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-assign-pipeline-layouts">tileas-assign-pipeline-layouts</h3> <p>Assigns layouts for async pipeline buffers.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AssignPipelineLayouts</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">PipelineOp</span> <span class="n">pipelineOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">stage</span> <span class="o">:</span> <span class="n">pipelineOp</span><span class="p">.</span><span class="n">getStages</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// Assign shared memory layouts for buffers</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">buffer</span> <span class="o">:</span> <span class="n">stage</span><span class="p">.</span><span class="n">getBuffers</span><span class="p">())</span> <span class="p">{</span> <span class="n">Layout</span> <span class="n">smemLayout</span> <span class="o">=</span> <span class="n">computeSwizzledLayout</span><span class="p">(</span><span class="n">buffer</span><span class="p">);</span> <span class="n">buffer</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"layout"</span><span class="p">,</span> <span class="n">smemLayout</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-generate-schedule">tileas-generate-schedule</h3> <p>Generates execution schedule using cost-based or serial scheduler.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">GenerateSchedule</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="c1">// Build dependency graph</span> <span class="n">DependencyGraph</span> <span class="n">depGraph</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="c1">// Select scheduler based on options</span> <span class="n">Scheduler</span><span class="o">*</span> <span class="n">scheduler</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">useCostBasedScheduler</span><span class="p">)</span> <span class="p">{</span> <span class="n">scheduler</span> <span class="o">=</span> <span class="k">new</span> <span class="n">CostBasedScheduler</span><span class="p">(</span><span class="n">depGraph</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">scheduler</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SerialScheduler</span><span class="p">(</span><span class="n">depGraph</span><span class="p">);</span> <span class="p">}</span> <span class="c1">// Generate schedule</span> <span class="n">Schedule</span> <span class="n">schedule</span> <span class="o">=</span> <span class="n">scheduler</span><span class="o">-&gt;</span><span class="n">generateSchedule</span><span class="p">();</span> <span class="c1">// Apply schedule to IR</span> <span class="n">applySchedule</span><span class="p">(</span><span class="n">funcOp</span><span class="p">,</span> <span class="n">schedule</span><span class="p">);</span> <span class="p">}</span> </code></pre></div> </div> <p><strong>Scheduler types:</strong></p> <ul> <li><code class="language-plaintext highlighter-rouge">Serial</code>: Topological order</li> <li><code class="language-plaintext highlighter-rouge">CostBased</code>: Latency-aware with heuristics</li> </ul> <hr /> <h3 id="tileas-materialize-schedule">tileas-materialize-schedule</h3> <p>Materializes abstract schedule into warp-specialized IR.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MaterializeSchedule</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">Schedule</span> <span class="n">schedule</span> <span class="o">=</span> <span class="n">getSchedule</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">schedule</span><span class="p">.</span><span class="n">getStrategy</span><span class="p">()</span> <span class="o">==</span> <span class="n">Strategy</span><span class="o">::</span><span class="n">WarpSpecialize</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Split into producer/consumer</span> <span class="k">auto</span> <span class="p">[</span><span class="n">producerOps</span><span class="p">,</span> <span class="n">consumerOps</span><span class="p">]</span> <span class="o">=</span> <span class="n">partitionOps</span><span class="p">(</span><span class="n">funcOp</span><span class="p">,</span> <span class="n">schedule</span><span class="p">);</span> <span class="c1">// Create agent regions</span> <span class="n">createAgentRegion</span><span class="p">(</span><span class="n">producerOps</span><span class="p">,</span> <span class="n">AgentRole</span><span class="o">::</span><span class="n">Producer</span><span class="p">);</span> <span class="n">createAgentRegion</span><span class="p">(</span><span class="n">consumerOps</span><span class="p">,</span> <span class="n">AgentRole</span><span class="o">::</span><span class="n">Consumer</span><span class="p">);</span> <span class="c1">// Insert synchronization</span> <span class="n">insertBarriers</span><span class="p">(</span><span class="n">funcOp</span><span class="p">,</span> <span class="n">schedule</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-materialize-async">tileas-materialize-async</h3> <p>Creates async pipeline structure with multi-buffering.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MaterializeAsync</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="kt">int</span> <span class="n">numStages</span> <span class="o">=</span> <span class="n">getOption</span><span class="p">(</span><span class="s">"num-stages"</span><span class="p">);</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">scf</span><span class="o">::</span><span class="n">ForOp</span> <span class="n">forOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">canPipeline</span><span class="p">(</span><span class="n">forOp</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// Create N buffers for N-stage pipeline</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">Value</span><span class="o">&gt;</span> <span class="n">buffers</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">numStages</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">buffers</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">allocateBuffer</span><span class="p">(</span><span class="n">forOp</span><span class="p">));</span> <span class="p">}</span> <span class="c1">// Transform loop body</span> <span class="n">emitPrologue</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">buffers</span><span class="p">);</span> <span class="n">emitSteadyState</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">buffers</span><span class="p">);</span> <span class="n">emitEpilogue</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">buffers</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-materialize-convert-layout">tileas-materialize-convert-layout</h3> <p>Expands layout conversions to actual data movement.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">MaterializeConvertLayout</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">ConvertLayoutOp</span> <span class="n">convertOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">srcLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">convertOp</span><span class="p">.</span><span class="n">getSource</span><span class="p">());</span> <span class="k">auto</span> <span class="n">dstLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">convertOp</span><span class="p">.</span><span class="n">getResult</span><span class="p">());</span> <span class="c1">// Generate shuffle or shared memory path</span> <span class="k">if</span> <span class="p">(</span><span class="n">canUseShuffles</span><span class="p">(</span><span class="n">srcLayout</span><span class="p">,</span> <span class="n">dstLayout</span><span class="p">))</span> <span class="p">{</span> <span class="n">emitShuffleConversion</span><span class="p">(</span><span class="n">convertOp</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">emitSharedMemoryConversion</span><span class="p">(</span><span class="n">convertOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-attach-tma-desc-args">tileas-attach-tma-desc-args</h3> <p>Injects TMA descriptor arguments into kernel signatures.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">AttachTMADescArgs</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">TMAOp</span><span class="o">&gt;</span> <span class="n">tmaOps</span><span class="p">;</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">usesTMA</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="n">tmaOps</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="p">});</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">tmaOp</span> <span class="o">:</span> <span class="n">tmaOps</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Create TMA descriptor type</span> <span class="k">auto</span> <span class="n">descType</span> <span class="o">=</span> <span class="n">TMADescriptorType</span><span class="o">::</span><span class="n">get</span><span class="p">(</span> <span class="n">tmaOp</span><span class="p">.</span><span class="n">getShape</span><span class="p">(),</span> <span class="n">tmaOp</span><span class="p">.</span><span class="n">getElementType</span><span class="p">(),</span> <span class="n">tmaOp</span><span class="p">.</span><span class="n">getSwizzle</span><span class="p">()</span> <span class="p">);</span> <span class="c1">// Add to function arguments</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">insertArgument</span><span class="p">(</span><span class="n">descType</span><span class="p">,</span> <span class="s">"tma_desc"</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-slicing">tileas-slicing</h3> <p>Slices tensors for pipelined execution.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASSlicing</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">LoadOp</span> <span class="n">loadOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">tensorType</span> <span class="o">=</span> <span class="n">loadOp</span><span class="p">.</span><span class="n">getResult</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="kt">int</span> <span class="n">sliceDim</span> <span class="o">=</span> <span class="n">getSliceDimension</span><span class="p">(</span><span class="n">loadOp</span><span class="p">);</span> <span class="kt">int</span> <span class="n">sliceSize</span> <span class="o">=</span> <span class="n">computeSliceSize</span><span class="p">(</span><span class="n">tensorType</span><span class="p">,</span> <span class="n">sliceDim</span><span class="p">);</span> <span class="c1">// Replace single load with sliced loads</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">Value</span><span class="o">&gt;</span> <span class="n">slices</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">numSlices</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">slice</span> <span class="o">=</span> <span class="n">builder</span><span class="p">.</span><span class="n">create</span><span class="o">&lt;</span><span class="n">SlicedLoadOp</span><span class="o">&gt;</span><span class="p">(</span> <span class="n">loadOp</span><span class="p">.</span><span class="n">getSource</span><span class="p">(),</span> <span class="n">sliceDim</span><span class="p">,</span> <span class="n">i</span> <span class="o">*</span> <span class="n">sliceSize</span><span class="p">,</span> <span class="n">sliceSize</span> <span class="p">);</span> <span class="n">slices</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">slice</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-plan-cta">tileas-plan-cta</h3> <p>Plans CTA (thread block) configuration.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">PlanCTA</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="c1">// Analyze resource requirements</span> <span class="kt">int</span> <span class="n">smemRequired</span> <span class="o">=</span> <span class="n">analyzeSharedMemory</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="kt">int</span> <span class="n">regsRequired</span> <span class="o">=</span> <span class="n">analyzeRegisters</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="c1">// Compute optimal CTA shape</span> <span class="n">CTAConfig</span> <span class="n">config</span> <span class="o">=</span> <span class="n">computeCTAConfig</span><span class="p">(</span> <span class="n">smemRequired</span><span class="p">,</span> <span class="n">regsRequired</span><span class="p">,</span> <span class="n">targetOccupancy</span> <span class="p">);</span> <span class="n">funcOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"cta_shape"</span><span class="p">,</span> <span class="n">config</span><span class="p">.</span><span class="n">toAttribute</span><span class="p">());</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-resolve-agent-boundary">tileas-resolve-agent-boundary</h3> <p>Resolves data flow across warp specialization boundaries.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ResolveAgentBoundary</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">AgentSwitchOp</span> <span class="n">switchOp</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Identify values crossing boundary</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">Value</span><span class="o">&gt;</span> <span class="n">crossingValues</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">Value</span> <span class="n">v</span> <span class="o">:</span> <span class="n">switchOp</span><span class="p">.</span><span class="n">getOperands</span><span class="p">())</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">crossesBoundary</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">switchOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">crossingValues</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">v</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// Insert shared memory communication</span> <span class="k">for</span> <span class="p">(</span><span class="n">Value</span> <span class="n">v</span> <span class="o">:</span> <span class="n">crossingValues</span><span class="p">)</span> <span class="p">{</span> <span class="n">insertSharedMemoryTransfer</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">switchOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-remove-buffer-alias">tileas-remove-buffer-alias</h3> <p>Removes buffer aliasing using fixed-point iteration.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">RemoveBufferAlias</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="kt">bool</span> <span class="n">changed</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="k">while</span> <span class="p">(</span><span class="n">changed</span><span class="p">)</span> <span class="p">{</span> <span class="n">changed</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">AllocTensorOp</span> <span class="n">allocOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">use</span> <span class="o">:</span> <span class="n">allocOp</span><span class="p">.</span><span class="n">getResult</span><span class="p">().</span><span class="n">getUses</span><span class="p">())</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">isAliasingUse</span><span class="p">(</span><span class="n">use</span><span class="p">))</span> <span class="p">{</span> <span class="n">createNonAliasingBuffer</span><span class="p">(</span><span class="n">use</span><span class="p">);</span> <span class="n">changed</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-remove-dead-args">tileas-remove-dead-args</h3> <p>Removes unused arguments from region operations.</p> <hr /> <h3 id="tileas-remove-layout-conversions">tileas-remove-layout-conversions</h3> <p>Eliminates redundant layout conversions.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">RemoveLayoutConversions</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">ConvertLayoutOp</span> <span class="n">convertOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">srcLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">convertOp</span><span class="p">.</span><span class="n">getSource</span><span class="p">());</span> <span class="k">auto</span> <span class="n">dstLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">convertOp</span><span class="p">.</span><span class="n">getResult</span><span class="p">());</span> <span class="c1">// Remove identity conversions</span> <span class="k">if</span> <span class="p">(</span><span class="n">srcLayout</span> <span class="o">==</span> <span class="n">dstLayout</span><span class="p">)</span> <span class="p">{</span> <span class="n">convertOp</span><span class="p">.</span><span class="n">replaceAllUsesWith</span><span class="p">(</span><span class="n">convertOp</span><span class="p">.</span><span class="n">getSource</span><span class="p">());</span> <span class="n">convertOp</span><span class="p">.</span><span class="n">erase</span><span class="p">();</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-optimize-alloc-tensor">tileas-optimize-alloc-tensor</h3> <p>Optimizes tensor allocations through reuse and elimination.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OptimizeAllocTensor</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">LivenessAnalysis</span> <span class="n">liveness</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">AllocTensorOp</span><span class="o">&gt;</span> <span class="n">allocs</span><span class="p">;</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">AllocTensorOp</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="n">allocs</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="p">});</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">alloc</span> <span class="o">:</span> <span class="n">allocs</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Find reusable buffer</span> <span class="k">if</span> <span class="p">(</span><span class="k">auto</span> <span class="n">reusable</span> <span class="o">=</span> <span class="n">findReusableBuffer</span><span class="p">(</span><span class="n">alloc</span><span class="p">,</span> <span class="n">liveness</span><span class="p">))</span> <span class="p">{</span> <span class="n">reuseBuffer</span><span class="p">(</span><span class="n">alloc</span><span class="p">,</span> <span class="n">reusable</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-optimize-reduce">tileas-optimize-reduce</h3> <p>Optimizes reduction operations with warp shuffle or shared memory.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OptimizeReduce</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">ReduceOp</span> <span class="n">reduceOp</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">reductionSize</span> <span class="o">=</span> <span class="n">getReductionSize</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">reductionSize</span> <span class="o">&lt;=</span> <span class="mi">32</span><span class="p">)</span> <span class="p">{</span> <span class="n">setAtom</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">,</span> <span class="s">"warp_shuffle"</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">reductionSize</span> <span class="o">&lt;=</span> <span class="mi">1024</span><span class="p">)</span> <span class="p">{</span> <span class="n">setAtom</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">,</span> <span class="s">"shared_memory"</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">setAtom</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">,</span> <span class="s">"multi_stage"</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-optimize-dot-accumulation">tileas-optimize-dot-accumulation</h3> <p>Optimizes MMA accumulation patterns for better register utilization.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">OptimizeDotAccumulation</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">DotOp</span> <span class="n">dotOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">accumPattern</span> <span class="o">=</span> <span class="n">analyzeAccumulationPattern</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="k">switch</span> <span class="p">(</span><span class="n">accumPattern</span><span class="p">)</span> <span class="p">{</span> <span class="k">case</span> <span class="n">AccumPattern</span><span class="o">::</span><span class="n">SimpleLoop</span><span class="p">:</span> <span class="n">optimizeSimpleAccumulation</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="n">AccumPattern</span><span class="o">::</span><span class="n">SplitK</span><span class="p">:</span> <span class="n">optimizeSplitKAccumulation</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="n">AccumPattern</span><span class="o">::</span><span class="n">StreamK</span><span class="p">:</span> <span class="n">optimizeStreamKAccumulation</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-recompute-for-scheduling">tileas-recompute-for-scheduling</h3> <p>Trades recomputation for reduced register pressure.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASRecomputeForScheduling</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">RegisterPressureAnalysis</span> <span class="n">regPressure</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="n">Value</span> <span class="n">result</span> <span class="o">:</span> <span class="n">op</span><span class="o">-&gt;</span><span class="n">getResults</span><span class="p">())</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">shouldRecompute</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">regPressure</span><span class="p">))</span> <span class="p">{</span> <span class="n">markForRecomputation</span><span class="p">(</span><span class="n">result</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="n">applyRecomputations</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="p">}</span> <span class="kt">bool</span> <span class="nf">shouldRecompute</span><span class="p">(</span><span class="n">Value</span> <span class="n">v</span><span class="p">,</span> <span class="n">RegisterPressureAnalysis</span><span class="o">&amp;</span> <span class="n">rpa</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Recompute if value is cheap but keeping it live causes spills</span> <span class="kt">int</span> <span class="n">computeCost</span> <span class="o">=</span> <span class="n">estimateComputeCost</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">getDefiningOp</span><span class="p">());</span> <span class="kt">int</span> <span class="n">spillCost</span> <span class="o">=</span> <span class="n">rpa</span><span class="p">.</span><span class="n">estimateSpillCost</span><span class="p">(</span><span class="n">v</span><span class="p">);</span> <span class="k">return</span> <span class="n">computeCost</span> <span class="o">&lt;</span> <span class="n">spillCost</span><span class="p">;</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-legalize-fma-dot">tileas-legalize-fma-dot</h3> <p>Ensures FMA operations match hardware capabilities.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">LegalizeFmaDot</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">DotOp</span> <span class="n">dotOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">hasFmaAccumulation</span><span class="p">(</span><span class="n">dotOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">legalizeFma</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">legalizeFma</span><span class="p">(</span><span class="n">DotOp</span> <span class="n">dotOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">accType</span> <span class="o">=</span> <span class="n">dotOp</span><span class="p">.</span><span class="n">getAccumulator</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">isLegalAccumulatorType</span><span class="p">(</span><span class="n">accType</span><span class="p">))</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">legalType</span> <span class="o">=</span> <span class="n">getLegalAccumulatorType</span><span class="p">(</span><span class="n">accType</span><span class="p">);</span> <span class="n">insertAccumulatorConversion</span><span class="p">(</span><span class="n">dotOp</span><span class="p">,</span> <span class="n">legalType</span><span class="p">);</span> <span class="p">}</span> <span class="k">if</span> <span class="p">(</span><span class="n">isMixedPrecision</span><span class="p">(</span><span class="n">dotOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">legalizeMixedPrecisionFma</span><span class="p">(</span><span class="n">dotOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-legalize-reduce">tileas-legalize-reduce</h3> <p>Ensures reductions use supported types and sizes.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">LegalizeReduce</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">ReduceOp</span> <span class="n">reduceOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">isLegalReduction</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">legalizeReduction</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">legalizeReduction</span><span class="p">(</span><span class="n">ReduceOp</span> <span class="n">reduceOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">inputType</span> <span class="o">=</span> <span class="n">reduceOp</span><span class="p">.</span><span class="n">getInput</span><span class="p">().</span><span class="n">getType</span><span class="p">();</span> <span class="k">auto</span> <span class="n">reductionKind</span> <span class="o">=</span> <span class="n">reduceOp</span><span class="p">.</span><span class="n">getReductionKind</span><span class="p">();</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">isSupportedElementType</span><span class="p">(</span><span class="n">inputType</span><span class="p">.</span><span class="n">getElementType</span><span class="p">()))</span> <span class="p">{</span> <span class="n">insertTypeConversion</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">);</span> <span class="p">}</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">isSupportedReductionSize</span><span class="p">(</span><span class="n">inputType</span><span class="p">,</span> <span class="n">reduceOp</span><span class="p">.</span><span class="n">getReductionDim</span><span class="p">()))</span> <span class="p">{</span> <span class="n">splitReduction</span><span class="p">(</span><span class="n">reduceOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-legalize-tmem-copy">tileas-legalize-tmem-copy</h3> <p>Legalizes tensor memory (tmem) copy operations. Tensor memory is dedicated storage for tensor core operands.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASLegalizeTmemCopy</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="k">auto</span> <span class="n">copyOp</span> <span class="o">=</span> <span class="n">dyn_cast</span><span class="o">&lt;</span><span class="n">CopyOp</span><span class="o">&gt;</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">involvesTmem</span><span class="p">(</span><span class="n">copyOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">legalizeTmemCopy</span><span class="p">(</span><span class="n">copyOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">legalizeTmemCopy</span><span class="p">(</span><span class="n">CopyOp</span> <span class="n">copyOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">srcLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">copyOp</span><span class="p">.</span><span class="n">getSource</span><span class="p">());</span> <span class="k">auto</span> <span class="n">dstLayout</span> <span class="o">=</span> <span class="n">getLayout</span><span class="p">(</span><span class="n">copyOp</span><span class="p">.</span><span class="n">getDest</span><span class="p">());</span> <span class="c1">// Infer register layout from tmem layout</span> <span class="k">auto</span> <span class="n">regLayout</span> <span class="o">=</span> <span class="n">inferRegisterLayoutFromTmem</span><span class="p">(</span><span class="n">srcLayout</span><span class="p">);</span> <span class="c1">// Insert necessary layout conversions</span> <span class="k">if</span> <span class="p">(</span><span class="n">needsConversion</span><span class="p">(</span><span class="n">srcLayout</span><span class="p">,</span> <span class="n">regLayout</span><span class="p">))</span> <span class="p">{</span> <span class="n">insertLayoutConversion</span><span class="p">(</span><span class="n">copyOp</span><span class="p">,</span> <span class="n">srcLayout</span><span class="p">,</span> <span class="n">regLayout</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-slice-and-fuse">tileas-slice-and-fuse</h3> <p>Applies loop tiling (slicing) and fusion for improved data locality.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">SliceAndFuse</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">SmallVector</span><span class="o">&lt;</span><span class="n">FusionGroup</span><span class="o">&gt;</span> <span class="n">fusionGroups</span><span class="p">;</span> <span class="n">collectFusionCandidates</span><span class="p">(</span><span class="n">funcOp</span><span class="p">,</span> <span class="n">fusionGroups</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">group</span> <span class="o">:</span> <span class="n">fusionGroups</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">sliceSize</span> <span class="o">=</span> <span class="n">computeOptimalSliceSize</span><span class="p">(</span><span class="n">group</span><span class="p">);</span> <span class="n">sliceOperations</span><span class="p">(</span><span class="n">group</span><span class="p">,</span> <span class="n">sliceSize</span><span class="p">);</span> <span class="n">fuseOperations</span><span class="p">(</span><span class="n">group</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">fuseOperations</span><span class="p">(</span><span class="n">FusionGroup</span><span class="o">&amp;</span> <span class="n">group</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Create fused loop nest</span> <span class="c1">// - Single loop iterating over slices</span> <span class="c1">// - Multiple operations per slice iteration</span> <span class="k">auto</span> <span class="n">fusedLoop</span> <span class="o">=</span> <span class="n">createFusedLoop</span><span class="p">(</span><span class="n">group</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span><span class="o">*</span> <span class="n">op</span> <span class="o">:</span> <span class="n">group</span><span class="p">.</span><span class="n">getOperations</span><span class="p">())</span> <span class="p">{</span> <span class="n">moveIntoFusedLoop</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">fusedLoop</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-refine-atom-by-resource">tileas-refine-atom-by-resource</h3> <p>Adjusts operation granularity (“atom”) based on available hardware resources.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">RefineAtomByResource</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="k">auto</span> <span class="n">resources</span> <span class="o">=</span> <span class="n">getTargetResources</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">hasAtomAttribute</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="p">{</span> <span class="n">refineAtom</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">resources</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">refineAtom</span><span class="p">(</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">,</span> <span class="n">ResourceConstraints</span><span class="o">&amp;</span> <span class="n">resources</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">currentAtom</span> <span class="o">=</span> <span class="n">getAtom</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="kt">int</span> <span class="n">smemRequired</span> <span class="o">=</span> <span class="n">estimateSmemUsage</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">currentAtom</span><span class="p">);</span> <span class="kt">int</span> <span class="n">regsRequired</span> <span class="o">=</span> <span class="n">estimateRegUsage</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">currentAtom</span><span class="p">);</span> <span class="c1">// Refine if over resource limits (SM120: 228KB smem, 65536 regs)</span> <span class="k">if</span> <span class="p">(</span><span class="n">smemRequired</span> <span class="o">&gt;</span> <span class="n">resources</span><span class="p">.</span><span class="n">maxSmem</span> <span class="o">||</span> <span class="n">regsRequired</span> <span class="o">&gt;</span> <span class="n">resources</span><span class="p">.</span><span class="n">maxRegs</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">refinedAtom</span> <span class="o">=</span> <span class="n">findSmallerAtom</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">resources</span><span class="p">);</span> <span class="n">setAtom</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">refinedAtom</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-prepare-for-scheduling">tileas-prepare-for-scheduling</h3> <p>Normalizes IR and annotates operation latencies for the scheduler.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">PrepareForScheduling</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">normalizeLoops</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">insertSchedulingAnchors</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">annotateLatencies</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">identifyBarriers</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">annotateLatencies</span><span class="p">(</span><span class="n">FuncOp</span> <span class="n">funcOp</span><span class="p">)</span> <span class="p">{</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">latency</span> <span class="o">=</span> <span class="n">estimateLatency</span><span class="p">(</span><span class="n">op</span><span class="p">);</span> <span class="n">op</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"sched.latency"</span><span class="p">,</span> <span class="n">builder</span><span class="p">.</span><span class="n">getI64IntegerAttr</span><span class="p">(</span><span class="n">latency</span><span class="p">));</span> <span class="p">});</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-unroll-register-loops">tileas-unroll-register-loops</h3> <p>Unrolls loops that access register-resident tensors (required since GPU registers cannot be dynamically indexed).</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASUnrollRegisterLoops</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">scf</span><span class="o">::</span><span class="n">ForOp</span> <span class="n">forOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">accessesRegisterTensors</span><span class="p">(</span><span class="n">forOp</span><span class="p">))</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">canAvoidUnroll</span><span class="p">(</span><span class="n">forOp</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// Must unroll - register tensors require static indexing</span> <span class="n">unrollLoop</span><span class="p">(</span><span class="n">forOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">bool</span> <span class="nf">accessesRegisterTensors</span><span class="p">(</span><span class="n">scf</span><span class="o">::</span><span class="n">ForOp</span> <span class="n">forOp</span><span class="p">)</span> <span class="p">{</span> <span class="kt">bool</span> <span class="n">accessesRegs</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="n">forOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="n">Value</span> <span class="n">operand</span> <span class="o">:</span> <span class="n">op</span><span class="o">-&gt;</span><span class="n">getOperands</span><span class="p">())</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">isRegisterTensor</span><span class="p">(</span><span class="n">operand</span><span class="p">))</span> <span class="p">{</span> <span class="n">accessesRegs</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="p">});</span> <span class="k">return</span> <span class="n">accessesRegs</span><span class="p">;</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-unspecialized-pipeline">tileas-unspecialized-pipeline</h3> <p>Implements software pipelining without warp specialization (all warps do both load and compute).</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASUnspecializedPipeline</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="kt">int</span> <span class="n">numStages</span> <span class="o">=</span> <span class="n">getOption</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;</span><span class="p">(</span><span class="s">"unspecialized-pipeline-num-stages"</span><span class="p">);</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">scf</span><span class="o">::</span><span class="n">ForOp</span> <span class="n">forOp</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">canPipeline</span><span class="p">(</span><span class="n">forOp</span><span class="p">))</span> <span class="p">{</span> <span class="n">applySoftwarePipelining</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">numStages</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">applySoftwarePipelining</span><span class="p">(</span><span class="n">scf</span><span class="o">::</span><span class="n">ForOp</span> <span class="n">forOp</span><span class="p">,</span> <span class="kt">int</span> <span class="n">numStages</span><span class="p">)</span> <span class="p">{</span> <span class="n">emitPrologue</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">numStages</span><span class="p">);</span> <span class="c1">// Pre-load data for first N iterations</span> <span class="n">emitSteadyState</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">numStages</span><span class="p">);</span> <span class="c1">// Overlap load(i+N) with compute(i)</span> <span class="n">emitEpilogue</span><span class="p">(</span><span class="n">forOp</span><span class="p">,</span> <span class="n">numStages</span><span class="p">);</span> <span class="c1">// Drain remaining computations</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-dynamic-persistent">tileas-dynamic-persistent</h3> <p>Transforms kernels into dynamic persistent kernels that process work items from a queue.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASDynamicPersistent</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="k">if</span> <span class="p">(</span><span class="n">funcOp</span><span class="o">-&gt;</span><span class="n">hasAttr</span><span class="p">(</span><span class="s">"dynamic_persistent"</span><span class="p">))</span> <span class="p">{</span> <span class="n">emitWarning</span><span class="p">(</span><span class="s">"Kernel is already dynamic persistent"</span><span class="p">);</span> <span class="k">return</span><span class="p">;</span> <span class="p">}</span> <span class="n">transformToPersistent</span><span class="p">(</span><span class="n">funcOp</span><span class="p">);</span> <span class="n">funcOp</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"dynamic_persistent"</span><span class="p">,</span> <span class="n">builder</span><span class="p">.</span><span class="n">getUnitAttr</span><span class="p">());</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">transformToPersistent</span><span class="p">(</span><span class="n">FuncOp</span> <span class="n">funcOp</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Insert outer loop that fetches work items:</span> <span class="c1">// while (workAvailable()) {</span> <span class="c1">// workItem = fetchWork();</span> <span class="c1">// processWorkItem(workItem);</span> <span class="c1">// signalCompletion();</span> <span class="c1">// }</span> <span class="p">}</span> </code></pre></div> </div> <hr /> <h3 id="tileas-insert-ocg-knobs">tileas-insert-OCG-knobs</h3> <p>Inserts OCG (Optimizing Code Generator) hints for the PTXAS backend.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">TileASInsertOCGKnobs</span><span class="o">::</span><span class="n">runOnOperation</span><span class="p">()</span> <span class="p">{</span> <span class="n">FuncOp</span> <span class="n">funcOp</span> <span class="o">=</span> <span class="n">getOperation</span><span class="p">();</span> <span class="n">funcOp</span><span class="p">.</span><span class="n">walk</span><span class="p">([</span><span class="o">&amp;</span><span class="p">](</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="k">auto</span> <span class="n">loopOp</span> <span class="o">=</span> <span class="n">dyn_cast</span><span class="o">&lt;</span><span class="n">LoopOp</span><span class="o">&gt;</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="p">{</span> <span class="n">insertOCGDirectives</span><span class="p">(</span><span class="n">loopOp</span><span class="p">);</span> <span class="p">}</span> <span class="k">if</span> <span class="p">(</span><span class="k">auto</span> <span class="n">mmaOp</span> <span class="o">=</span> <span class="n">dyn_cast</span><span class="o">&lt;</span><span class="n">DotOp</span><span class="o">&gt;</span><span class="p">(</span><span class="n">op</span><span class="p">))</span> <span class="p">{</span> <span class="n">insertMMAOptimizationHints</span><span class="p">(</span><span class="n">mmaOp</span><span class="p">);</span> <span class="p">}</span> <span class="p">});</span> <span class="p">}</span> <span class="kt">void</span> <span class="nf">insertOCGDirectives</span><span class="p">(</span><span class="n">Operation</span><span class="o">*</span> <span class="n">op</span><span class="p">)</span> <span class="p">{</span> <span class="n">op</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"ocgEnterDirectives"</span><span class="p">,</span> <span class="n">buildOCGDirectives</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="cm">/*enter=*/</span><span class="nb">true</span><span class="p">));</span> <span class="n">op</span><span class="o">-&gt;</span><span class="n">setAttr</span><span class="p">(</span><span class="s">"ocgLeaveDirectives"</span><span class="p">,</span> <span class="n">buildOCGDirectives</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="cm">/*enter=*/</span><span class="nb">false</span><span class="p">));</span> <span class="p">}</span> </code></pre></div> </div> </details> <hr /> <h1 id="appendix-ir-dumps">Appendix: IR Dumps</h1> <p>This appendix contains the IR dumps from the MoE kernel compilation. Some of the IR below uses <code class="language-plaintext highlighter-rouge">%0</code> placeholders.</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">cuda_tile IR</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">cuda_tile</span> <span class="err">dialect</span> <span class="err">operations</span> <span class="err">//</span> <span class="err">High-level</span> <span class="err">tensor</span> <span class="err">operations</span> <span class="k">from</span> <span class="err">CuTile</span> <span class="err">Python</span> <span class="err">API</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#1</span> <span class="err">scope</span><span class="p">=</span><span class="err">cuda_tile</span><span class="p">.</span><span class="err">entry</span> <span class="p">===</span> <span class="s">"cuda_tile.module"</span><span class="p">()</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="s">"cuda_tile.entry"</span><span class="p">()</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.constant"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.make_tensor_view"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.make_tensor_view"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.make_token"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span><span class="p">,</span> <span class="nv">%2</span> <span class="p">=</span> <span class="s">"cuda_tile.get_tile_block_id"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.constant</span><span class="p">,</span> <span class="nv">%cuda_tile.divi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.get_tile_block_id</span><span class="p">,</span> <span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.divi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.subi"</span><span class="p">(</span><span class="nv">%cuda_tile.divi</span><span class="p">,</span> <span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.mini"</span><span class="p">(</span><span class="nv">%cuda_tile.subi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.remi"</span><span class="p">(</span><span class="nv">%cuda_tile.get_tile_block_id</span><span class="p">,</span> <span class="nv">%cuda_tile.mini</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.mini</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.xori"</span><span class="p">(</span><span class="nv">%cuda_tile.cmpi</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.andi"</span><span class="p">(</span><span class="nv">%cuda_tile.xori</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.mini</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.select"</span><span class="p">(</span><span class="nv">%cuda_tile.andi</span><span class="p">,</span> <span class="nv">%cuda_tile.addi</span><span class="p">,</span> <span class="nv">%cuda_tile.remi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">,</span> <span class="nv">%cuda_tile.select</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.remi"</span><span class="p">(</span><span class="nv">%cuda_tile.get_tile_block_id</span><span class="p">,</span> <span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.xori"</span><span class="p">(</span><span class="nv">%cuda_tile.cmpi</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.andi"</span><span class="p">(</span><span class="nv">%cuda_tile.xori</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.remi</span><span class="p">,</span> <span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.select"</span><span class="p">(</span><span class="nv">%cuda_tile.andi</span><span class="p">,</span> <span class="nv">%cuda_tile.addi</span><span class="p">,</span> <span class="nv">%cuda_tile.remi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.select</span><span class="p">,</span> <span class="nv">%cuda_tile.mini</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.addi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.iota"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.iota</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.offset"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"cuda_tile.load_ptr_tko"</span><span class="p">(</span><span class="nv">%cuda_tile.offset</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.make_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.load_ptr_tko</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.make_partition_view"</span><span class="p">(</span><span class="nv">%cuda_tile.make_tensor_view</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"cuda_tile.load_view_tko"</span><span class="p">(</span><span class="nv">%cuda_tile.make_partition_view</span><span class="p">,</span> <span class="nv">%cuda_tile.addi</span><span class="p">,</span> <span class="nv">%cuda_tile.make_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.load_view_tko</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.divi"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.iota"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.divi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.for"</span><span class="p">(</span><span class="nv">%cuda_tile.constant</span><span class="p">,</span> <span class="nv">%cuda_tile.divi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.iota</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.andi"</span><span class="p">(</span><span class="nv">%cuda_tile.cmpi</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.offset"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"cuda_tile.load_ptr_tko"</span><span class="p">(</span><span class="nv">%cuda_tile.offset</span><span class="p">,</span> <span class="nv">%cuda_tile.andi</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.make_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.make_partition_view"</span><span class="p">(</span><span class="nv">%cuda_tile.make_tensor_view</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">token</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"cuda_tile.load_view_tko"</span><span class="p">(</span><span class="nv">%cuda_tile.make_partition_view</span><span class="p">,</span> <span class="nv">%cuda_tile.reshape</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%cuda_tile.divi</span><span class="p">,</span> <span class="nv">%cuda_tile.make_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">part</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.load_view_tko</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.mmaf"</span><span class="p">(</span><span class="nv">%cuda_tile.load_ptr_tko</span><span class="p">,</span> <span class="nv">%cuda_tile.reshape</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="s">"cuda_tile.continue"</span><span class="p">(</span><span class="nv">%cuda_tile.mmaf</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.divi</span><span class="p">,</span> <span class="nv">%cuda_tile.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.iota"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.iota</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.ftof"</span><span class="p">(</span><span class="nv">%cuda_tile.for</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.load_ptr_tko</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.muli"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.exti"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.exti</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.cmpi"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.andi"</span><span class="p">(</span><span class="nv">%cuda_tile.cmpi</span><span class="p">,</span> <span class="nv">%cuda_tile.cmpi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.addi"</span><span class="p">(</span><span class="nv">%cuda_tile.muli</span><span class="p">,</span> <span class="nv">%cuda_tile.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.reshape"</span><span class="p">(</span><span class="nv">%cuda_tile.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.broadcast"</span><span class="p">(</span><span class="nv">%cuda_tile.reshape</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.offset"</span><span class="p">(</span><span class="nv">%cuda_tile.broadcast</span><span class="p">,</span> <span class="nv">%cuda_tile.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"cuda_tile.store_ptr_tko"</span><span class="p">(</span><span class="nv">%cuda_tile.offset</span><span class="p">,</span> <span class="nv">%cuda_tile.ftof</span><span class="p">,</span> <span class="nv">%cuda_tile.andi</span><span class="p">,</span> <span class="nv">%cuda_tile.make_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="err">view</span><span class="p">,</span> <span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">ct</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="s">"cuda_tile.return"</span><span class="p">()</span> </code></pre></div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">nv_tileaa IR</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">nv_tileaa</span> <span class="err">dialect</span> <span class="err">operations</span> <span class="err">//</span> <span class="err">Tile-level</span> <span class="err">ops</span> <span class="p">(</span><span class="err">architecture-independent</span><span class="p">)</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#1</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="s">"nv_tileaa.func"</span><span class="p">()</span> <span class="p">{</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">kernel_spec</span><span class="p">}</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.assume"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_memref"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_memref"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">,</span> <span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.create_mem_token"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.get_program_id"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.get_program_id</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_range"</span><span class="p">(</span><span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.extract</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.extract</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.addptr"</span><span class="p">(</span><span class="nv">%nv_tileaa.splat</span><span class="p">,</span> <span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileaa.load"</span><span class="p">(</span><span class="nv">%nv_tileaa.addptr</span><span class="p">,</span> <span class="nv">%arith.cmpi</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.block_tile"</span><span class="p">(</span><span class="nv">%nv_tileaa.make_memref</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileaa.tiled_load"</span><span class="p">(</span><span class="nv">%nv_tileaa.block_tile</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.view"</span><span class="p">(</span><span class="nv">%nv_tileaa.tiled_load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_range"</span><span class="p">(</span><span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.expand_dims"</span><span class="p">(</span><span class="nv">%arith.floordivsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.ceildivsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.extract</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.expand_dims"</span><span class="p">(</span><span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.broadcast"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.broadcast"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.addptr"</span><span class="p">(</span><span class="nv">%nv_tileaa.splat</span><span class="p">,</span> <span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileaa.load"</span><span class="p">(</span><span class="nv">%nv_tileaa.addptr</span><span class="p">,</span> <span class="nv">%arith.andi</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.block_tile"</span><span class="p">(</span><span class="nv">%nv_tileaa.make_memref</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%nv_tileas.convert_layout</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.floordivsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileaa.tiled_load"</span><span class="p">(</span><span class="nv">%nv_tileaa.block_tile</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.view"</span><span class="p">(</span><span class="nv">%nv_tileaa.tiled_load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.dot"</span><span class="p">(</span><span class="nv">%nv_tileaa.load</span><span class="p">,</span> <span class="nv">%nv_tileaa.view</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.make_range"</span><span class="p">(</span><span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.extract"</span><span class="p">(</span><span class="nv">%arith.muli</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.extract</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.fp_to_fp"</span><span class="p">(</span><span class="nv">%scf.for</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.expand_dims"</span><span class="p">(</span><span class="nv">%nv_tileaa.load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.expand_dims"</span><span class="p">(</span><span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.broadcast"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.broadcast"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%arith.extsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.splat"</span><span class="p">(</span><span class="nv">%nv_tileaa.assume</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">memref</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.addptr"</span><span class="p">(</span><span class="nv">%nv_tileaa.splat</span><span class="p">,</span> <span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileaa.store"</span><span class="p">(</span><span class="nv">%nv_tileaa.addptr</span><span class="p">,</span> <span class="nv">%nv_tileaa.fp_to_fp</span><span class="p">,</span> <span class="nv">%arith.andi</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="s">"nv_tileaa.return"</span><span class="p">()</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#2</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#3</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#4</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#5</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#6</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#7</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#8</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#9</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#10</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#11</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Pass</span> <span class="vg">#12</span> <span class="err">scope</span><span class="p">=</span><span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="p">===</span> <span class="err">//</span> <span class="p">(</span><span class="err">Lines</span> <span class="m">193-352</span> <span class="err">-</span> <span class="err">final</span> <span class="err">assembly</span> <span class="err">with</span> <span class="err">fp_to_fp</span> <span class="err">conversions</span><span class="p">)</span> <span class="err">//</span> <span class="err">See</span> <span class="err">dump</span> <span class="err">for</span> <span class="err">complete</span> <span class="err">content</span> <span class="nl">including:</span> <span class="err">//</span> <span class="err">-</span> <span class="m">32</span> <span class="err">fp_to_fp</span> <span class="err">operations</span> <span class="err">for</span> <span class="err">output</span> <span class="err">precision</span> <span class="err">conversion</span> <span class="err">//</span> <span class="err">-</span> <span class="err">Multiple</span> <span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="err">declarations</span> <span class="err">with</span> <span class="err">kernel</span> <span class="kt">metadata</span> <span class="err">//</span> <span class="err">-</span> <span class="err">Final</span> <span class="err">memory</span> <span class="err">layout</span> <span class="err">preparation</span> </code></pre></div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">nv_tileas IR</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">nv_tileas</span> <span class="err">dialect</span> <span class="err">operations</span> <span class="err">//</span> <span class="err">Tile-level</span> <span class="err">Scheduled</span> <span class="err">Assembly</span> <span class="p">(</span><span class="err">architecture-specific</span><span class="p">)</span> <span class="err">//</span> <span class="p">[</span><span class="k">within</span> <span class="err">nv_tileaa</span><span class="p">.</span><span class="err">func</span> <span class="err">pass</span><span class="p">]</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileas.load"</span><span class="p">(</span><span class="nv">%nv_tileaa.addptr</span><span class="p">,</span> <span class="nv">%arith.cmpi</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileas.tiled_load"</span><span class="p">(</span><span class="nv">%nv_tileaa.block_tile</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.view"</span><span class="p">(</span><span class="nv">%nv_tileas.tiled_load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.expand_dims"</span><span class="p">(</span><span class="nv">%arith.floordivsi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.expand_dims"</span><span class="p">(</span><span class="nv">%arith.addi</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%nv_tileaa.broadcast</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileas.load"</span><span class="p">(</span><span class="nv">%nv_tileaa.addptr</span><span class="p">,</span> <span class="nv">%arith.andi</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%nv_tileas.view</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileas.tiled_load"</span><span class="p">(</span><span class="nv">%nv_tileaa.block_tile</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%nv_tileaa.create_mem_token</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">mtoken</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">aa</span><span class="p">.</span><span class="kt">ptr</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.view"</span><span class="p">(</span><span class="nv">%nv_tileas.tiled_load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%nv_tileas.load</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%nv_tileas.view</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.dot"</span><span class="p">(</span><span class="nv">%nv_tileas.convert_layout</span><span class="p">,</span> <span class="nv">%nv_tileas.convert_layout</span><span class="p">,</span> <span class="nv">%nv_tileas.convert_layout</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.convert_layout"</span><span class="p">(</span><span class="nv">%nv_tileas.dot</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.make_tiled_tma_desc"</span><span class="p">(</span><span class="nv">%nv_tileaa.make_memref</span><span class="p">)</span> <span class="p">{</span><span class="err">tmaIdx</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">//</span> <span class="p">[</span><span class="k">within</span> <span class="k">builtin</span><span class="p">.</span><span class="k">module</span> <span class="err">pass</span><span class="p">]</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_pipeline"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_pipeline"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_pipeline"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.create_iterator"</span><span class="p">(</span><span class="nv">%nv_tileas.async.pipeline.create_pipeline</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="s">"nv_tileas.async.pipeline.agent_switch"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="p">...)</span> <span class="p">{</span><span class="m">4</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(...)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="err">Producer-Consumer</span> <span class="err">Pattern</span> <span class="p">(</span><span class="err">repeated</span> <span class="err">throughout</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.producer_acquire"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.inc_iter"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.producer_write"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileas.async.pipeline.producer_acquire</span><span class="p">)</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="s">"nv_tileas.async.pipeline.producer_commit"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileas.async.pipeline.producer_write</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.consumer_wait"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span><span class="p">,</span> <span class="nv">%1</span> <span class="p">=</span> <span class="s">"nv_tileas.async.pipeline.consumer_read"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileas.async.pipeline.consumer_wait</span><span class="p">)</span> <span class="p">{</span><span class="err">consumer_idx</span><span class="p">}</span> <span class="p">{</span><span class="m">1</span> <span class="err">regions</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="s">"nv_tileas.async.pipeline.consumer_release"</span><span class="p">(</span><span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileas.async.pipeline.consumer_read</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">,</span> <span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="err">Dot</span> <span class="err">operations</span> <span class="p">(</span><span class="m">100</span><span class="err">+</span> <span class="err">for</span> <span class="err">tiled</span> <span class="err">matrix</span> <span class="err">multiply</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.dot"</span><span class="p">(</span><span class="nv">%nv_tileas.extract_slice</span><span class="p">,</span> <span class="nv">%nv_tileas.extract_slice</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="p">...</span> <span class="p">(</span><span class="err">repeated</span> <span class="err">for</span> <span class="err">all</span> <span class="err">tile</span> <span class="err">partitions</span><span class="p">)</span> <span class="err">//</span> <span class="err">TMA</span> <span class="err">operations</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.make_tiled_tma_desc"</span><span class="p">(</span><span class="nv">%nv_tileaa.make_memref</span><span class="p">)</span> <span class="p">{</span><span class="err">tmaIdx</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">aa</span><span class="p">.</span><span class="err">btile</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.async.tiled_tma_load"</span><span class="p">(</span><span class="nv">%nv_tileaa.block_tile</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileas.make_tiled_tma_desc</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">,</span> <span class="nv">%arg</span><span class="p">,</span> <span class="nv">%nv_tileaa.extract</span><span class="p">)</span> <span class="err">:</span> <span class="p">(...)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">?</span><span class="k">type</span><span class="p">)</span> <span class="err">//</span> <span class="err">Output</span> <span class="err">assembly</span> <span class="p">(</span><span class="m">32</span> <span class="err">insert_slice</span> <span class="err">for</span> <span class="err">output</span> <span class="err">tiles</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nv_tileas.insert_slice"</span><span class="p">(</span><span class="nv">%nv_tileaa.fp_to_fp</span><span class="p">,</span> <span class="nv">%nv_tileas.alloc_tensor</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">,</span> <span class="nv">%arith.constant</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">tensor</span><span class="p">&lt;...&gt;,</span> <span class="err">iN</span><span class="p">,</span> <span class="err">iN</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="err">tensor</span><span class="p">&lt;...&gt;)</span> <span class="err">//</span> <span class="p">...</span> <span class="p">(</span><span class="err">repeated</span> <span class="m">32</span> <span class="err">times</span><span class="p">)</span> </code></pre></div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">NVVM Dialect IR</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="err">nvvm</span> <span class="err">dialect</span> <span class="err">operations</span> <span class="err">//</span> <span class="err">NVVM</span> <span class="p">(</span><span class="err">NVIDIA</span> <span class="err">PTX</span> <span class="err">intrinsics</span> <span class="err">in</span> <span class="err">MLIR</span> <span class="err">form</span><span class="p">)</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Barrier</span> <span class="k">and</span> <span class="err">Fence</span> <span class="err">Operations</span> <span class="p">===</span> <span class="s">"nvvm.fence.mbarrier.init"</span><span class="p">()</span> <span class="s">"nvvm.barrier"</span><span class="p">()</span> <span class="s">"nvvm.fence.proxy"</span><span class="p">()</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nvvm.read.ptx.sreg.clusterid.x"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="kt">i32</span><span class="p">)</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nvvm.read.ptx.sreg.tid.x"</span><span class="p">()</span> <span class="err">:</span> <span class="p">()</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="kt">i32</span><span class="p">)</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Async</span> <span class="err">Global→Shared</span> <span class="err">Copies</span> <span class="p">(</span><span class="m">136</span> <span class="err">instances</span><span class="p">)</span> <span class="p">===</span> <span class="s">"nvvm.cp.async.shared.global"</span><span class="p">(</span><span class="nv">%ptr</span><span class="p">,</span> <span class="nv">%src</span><span class="p">,</span> <span class="nv">%predicate</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="kt">ptr</span><span class="p">&lt;</span><span class="m">3</span><span class="p">&gt;,</span> <span class="kt">ptr</span><span class="p">&lt;</span><span class="m">1</span><span class="p">&gt;,</span> <span class="kt">i1</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Tensor</span> <span class="err">Core</span> <span class="err">Data</span> <span class="err">Packing</span> <span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">088</span> <span class="err">instances</span><span class="p">)</span> <span class="p">===</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nvvm.cvt.packfloat.f32"</span><span class="p">(</span><span class="nv">%a</span><span class="p">,</span> <span class="nv">%b</span><span class="p">,</span> <span class="nv">%mode</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="err">f32</span><span class="p">,</span> <span class="err">f32</span><span class="p">,</span> <span class="kt">i32</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">(</span><span class="kt">i32</span><span class="p">)</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Memory</span> <span class="err">Barriers</span> <span class="p">(</span><span class="m">66</span> <span class="err">instances</span><span class="p">)</span> <span class="p">===</span> <span class="s">"nvvm.mbarrier.init.shared"</span><span class="p">(</span><span class="nv">%barrier</span><span class="p">,</span> <span class="nv">%count</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="kt">ptr</span><span class="p">&lt;</span><span class="m">3</span><span class="p">&gt;,</span> <span class="kt">i32</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="s">"nvvm.mbarrier.arrive.shared"</span><span class="p">(</span><span class="nv">%barrier</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="kt">ptr</span><span class="p">&lt;</span><span class="m">3</span><span class="p">&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="s">"nvvm.mbarrier.wait.shared"</span><span class="p">(</span><span class="nv">%barrier</span><span class="p">,</span> <span class="nv">%phase</span><span class="p">)</span> <span class="err">:</span> <span class="p">(</span><span class="kt">ptr</span><span class="p">&lt;</span><span class="m">3</span><span class="p">&gt;,</span> <span class="kt">i32</span><span class="p">)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="p">()</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Matrix</span> <span class="err">Load</span> <span class="err">Operations</span> <span class="p">(</span><span class="m">512</span> <span class="err">instances</span><span class="p">)</span> <span class="p">===</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nvvm.ldmatrix"</span><span class="p">(</span><span class="nv">%ptr</span><span class="p">)</span> <span class="p">{</span><span class="err">layout</span> <span class="p">=</span> <span class="err">#nvvm</span><span class="p">.</span><span class="err">mma_layout</span><span class="p">&lt;</span><span class="err">row</span><span class="p">&gt;,</span> <span class="err">num</span> <span class="p">=</span> <span class="m">4</span><span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="kt">ptr</span><span class="p">&lt;</span><span class="m">3</span><span class="p">&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="err">vector</span><span class="p">&lt;</span><span class="m">4</span><span class="p">x</span><span class="kt">i32</span><span class="p">&gt;</span> <span class="err">//</span> <span class="p">===</span> <span class="err">Tensor</span> <span class="err">Core</span> <span class="err">MMA</span> <span class="p">(</span><span class="m">512</span> <span class="err">instances</span><span class="p">)</span> <span class="p">===</span> <span class="nv">%0</span> <span class="p">=</span> <span class="s">"nvvm.mma.sync"</span><span class="p">(</span><span class="nv">%a</span><span class="p">,</span> <span class="nv">%b</span><span class="p">,</span> <span class="nv">%c</span><span class="p">)</span> <span class="p">{</span> <span class="err">layoutA</span> <span class="p">=</span> <span class="err">#nvvm</span><span class="p">.</span><span class="err">mma_layout</span><span class="p">&lt;</span><span class="err">row</span><span class="p">&gt;,</span> <span class="err">layoutB</span> <span class="p">=</span> <span class="err">#nvvm</span><span class="p">.</span><span class="err">mma_layout</span><span class="p">&lt;</span><span class="err">col</span><span class="p">&gt;,</span> <span class="err">shape</span> <span class="p">=</span> <span class="err">#nvvm</span><span class="p">.</span><span class="err">shape</span><span class="p">&lt;</span><span class="err">m</span> <span class="p">=</span> <span class="m">16</span><span class="p">,</span> <span class="err">n</span> <span class="p">=</span> <span class="m">8</span><span class="p">,</span> <span class="err">k</span> <span class="p">=</span> <span class="m">16</span><span class="p">&gt;</span> <span class="p">}</span> <span class="err">:</span> <span class="p">(</span><span class="err">vector</span><span class="p">&lt;</span><span class="m">4</span><span class="p">x</span><span class="kt">i32</span><span class="p">&gt;,</span> <span class="err">vector</span><span class="p">&lt;</span><span class="m">2</span><span class="p">x</span><span class="kt">i32</span><span class="p">&gt;,</span> <span class="err">vector</span><span class="p">&lt;</span><span class="m">4</span><span class="p">x</span><span class="err">f32</span><span class="p">&gt;)</span> <span class="err">-</span><span class="p">&gt;</span> <span class="err">vector</span><span class="p">&lt;</span><span class="m">4</span><span class="p">x</span><span class="err">f32</span><span class="p">&gt;</span> <span class="err">//</span> <span class="p">...</span> <span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">977</span> <span class="err">lines</span> <span class="err">total</span> <span class="err">-</span> <span class="err">tensor</span> <span class="err">core</span> <span class="err">operations</span><span class="p">,</span> <span class="err">barriers</span><span class="p">,</span> <span class="err">memory</span> <span class="err">ops</span><span class="p">)</span> </code></pre></div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">LLVM IR / NVVM IR</summary> <div class="language-llvm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; ModuleID = 'LLVMDialectModule'</span> <span class="k">target</span> <span class="k">datalayout</span> <span class="p">=</span> <span class="s">"e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64"</span> <span class="k">target</span> <span class="k">triple</span> <span class="p">=</span> <span class="s">"nvptx64-nvidia-cuda"</span> <span class="c1">; Kernel entry point with TMA descriptors</span> <span class="k">define</span> <span class="k">ptx_kernel</span> <span class="kt">void</span> <span class="vg">@fused_moe_kernel</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%A</span><span class="p">,</span> <span class="c1">; Input tokens</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%B</span><span class="p">,</span> <span class="c1">; Expert weights</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%C</span><span class="p">,</span> <span class="c1">; Output</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%topk_weights</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%sorted_token_ids</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%sorted_expert_ids</span><span class="p">,</span> <span class="kt">i32</span> <span class="nv">%num_token_replicas</span><span class="p">,</span> <span class="kt">i1</span> <span class="nv">%mul_routed_weight</span><span class="p">,</span> <span class="c1">; ... TMA descriptors appended by tileas-attach-tma-desc-args</span> <span class="p">)</span> <span class="vg">#0</span> <span class="p">{</span> <span class="nl">entry:</span> <span class="c1">; Get cluster/block/thread IDs</span> <span class="nv">%clusterid</span> <span class="p">=</span> <span class="k">call</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.read.ptx.sreg.clusterid.x</span><span class="p">()</span> <span class="nv">%tid</span> <span class="p">=</span> <span class="k">call</span> <span class="err">range</span><span class="p">(</span><span class="kt">i32</span> <span class="m">0</span><span class="p">,</span> <span class="m">384</span><span class="p">)</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.read.ptx.sreg.tid.x</span><span class="p">()</span> <span class="c1">; Initialize barriers for async pipeline</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.mbarrier.init.shared</span><span class="p">(</span><span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%barrier</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">128</span><span class="p">)</span> <span class="c1">; Async copy from global to shared memory</span> <span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.cp.async.shared.global</span><span class="p">(</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="nv">%shared_dst</span><span class="p">,</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="nv">%global_src</span><span class="p">,</span> <span class="kt">i32</span> <span class="m">16</span><span class="p">,</span> <span class="c1">; bytes</span> <span class="kt">i1</span> <span class="nv">%pred</span> <span class="c1">; predicate</span> <span class="p">)</span> <span class="c1">; Tensor core matrix multiply</span> <span class="nv">%result</span> <span class="p">=</span> <span class="k">call</span> <span class="p">&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">float</span><span class="p">&gt;</span> <span class="vg">@llvm.nvvm.mma.m16n8k16.row.col.f32.f16.f16.f32</span><span class="p">(</span> <span class="p">&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">i32</span><span class="p">&gt;</span> <span class="nv">%a_frag</span><span class="p">,</span> <span class="p">&lt;</span><span class="m">2</span> <span class="p">x</span> <span class="kt">i32</span><span class="p">&gt;</span> <span class="nv">%b_frag</span><span class="p">,</span> <span class="p">&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">float</span><span class="p">&gt;</span> <span class="nv">%c_frag</span> <span class="p">)</span> <span class="c1">; ... (full pipeline with producer/consumer synchronization)</span> <span class="p">}</span> <span class="c1">; NVVM intrinsic declarations</span> <span class="k">declare</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.read.ptx.sreg.tid.x</span><span class="p">()</span> <span class="k">declare</span> <span class="kt">i32</span> <span class="vg">@llvm.nvvm.read.ptx.sreg.clusterid.x</span><span class="p">()</span> <span class="k">declare</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.mbarrier.init.shared</span><span class="p">(</span><span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">),</span> <span class="kt">i32</span><span class="p">)</span> <span class="k">declare</span> <span class="kt">void</span> <span class="vg">@llvm.nvvm.cp.async.shared.global</span><span class="p">(</span><span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">3</span><span class="p">),</span> <span class="kt">ptr</span> <span class="k">addrspace</span><span class="p">(</span><span class="m">1</span><span class="p">),</span> <span class="kt">i32</span><span class="p">,</span> <span class="kt">i1</span><span class="p">)</span> <span class="k">declare</span> <span class="p">&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">float</span><span class="p">&gt;</span> <span class="vg">@llvm.nvvm.mma.m16n8k16.row.col.f32.f16.f16.f32</span><span class="p">(&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">i32</span><span class="p">&gt;,</span> <span class="p">&lt;</span><span class="m">2</span> <span class="p">x</span> <span class="kt">i32</span><span class="p">&gt;,</span> <span class="p">&lt;</span><span class="m">4</span> <span class="p">x</span> <span class="kt">float</span><span class="p">&gt;)</span> </code></pre></div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">PTX Assembly</summary> <div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">//</span> <span class="o">//</span> <span class="nf">Generated</span> <span class="nv">by</span> <span class="nv">NVIDIA</span> <span class="nv">NVVM</span> <span class="nv">Compiler</span> <span class="o">//</span> <span class="nf">Cuda</span> <span class="nv">compilation</span> <span class="nv">tools</span><span class="p">,</span> <span class="nv">release</span> <span class="mf">13.1</span><span class="p">,</span> <span class="nv">V13.1.80</span> <span class="o">//</span> <span class="nf">Based</span> <span class="nv">on</span> <span class="nv">NVVM</span> <span class="mf">21.0</span><span class="nv">.0</span> <span class="o">//</span> <span class="nf">.version</span> <span class="mf">9.1</span> <span class="nf">.target</span> <span class="nv">sm_120a</span> <span class="nf">.address_size</span> <span class="mi">64</span> <span class="nf">.visible</span> <span class="nv">.entry</span> <span class="nv">fused_moe_kernel</span><span class="p">(</span> <span class="nf">.param</span> <span class="nv">.u64</span> <span class="nv">.ptr</span> <span class="nv">.global</span> <span class="nv">.align</span> <span class="mi">1</span> <span class="nv">fused_moe_kernel_param_0</span><span class="p">,</span> <span class="nf">.param</span> <span class="nv">.u32</span> <span class="nv">fused_moe_kernel_param_1</span><span class="p">,</span> <span class="o">//</span> <span class="nf">...</span> <span class="mi">31</span> <span class="nv">parameters</span> <span class="nv">total</span> <span class="nv">including</span> <span class="nv">TMA</span> <span class="nv">descriptors</span> <span class="nf">.hidden</span> <span class="nv">.param</span> <span class="nv">.align</span> <span class="mi">64</span> <span class="nv">.b8</span> <span class="nv">fused_moe_kernel_param_31</span><span class="p">[</span><span class="mi">128</span><span class="p">]</span> <span class="p">)</span> <span class="nf">.reqntid</span> <span class="mi">384</span> <span class="nf">.minnctapersm</span> <span class="mi">1</span> <span class="err">{</span> <span class="nf">.reg</span> <span class="nv">.pred</span> <span class="o">%</span><span class="nv">p</span><span class="o">&lt;</span><span class="mi">306</span><span class="o">&gt;</span><span class="c1">;</span> <span class="nf">.reg</span> <span class="nv">.b16</span> <span class="o">%</span><span class="nv">rs</span><span class="o">&lt;</span><span class="mi">500</span><span class="o">&gt;</span><span class="c1">;</span> <span class="nf">.reg</span> <span class="nv">.b32</span> <span class="o">%</span><span class="nv">r</span><span class="o">&lt;</span><span class="mi">4905</span><span class="o">&gt;</span><span class="c1">;</span> <span class="nf">.reg</span> <span class="nv">.b64</span> <span class="o">%</span><span class="nv">rd</span><span class="o">&lt;</span><span class="mi">348</span><span class="o">&gt;</span><span class="c1">;</span> <span class="o">//</span> <span class="err">80</span><span class="nf">KB</span> <span class="nv">shared</span> <span class="nv">memory</span> <span class="nv">for</span> <span class="nv">double</span> <span class="nv">buffering</span> <span class="nf">.shared</span> <span class="nv">.align</span> <span class="mi">128</span> <span class="nv">.b8</span> <span class="nv">global_smem</span><span class="p">[</span><span class="mi">82032</span><span class="p">]</span><span class="c1">;</span> <span class="o">//</span> <span class="err">===</span> <span class="nf">Barrier</span> <span class="nv">Initialization</span> <span class="err">===</span> <span class="nf">mbarrier.init.shared.b64</span> <span class="p">[</span><span class="nv">global_smem</span><span class="o">+</span><span class="mi">82000</span><span class="p">],</span> <span class="o">%</span><span class="nv">r2369</span><span class="c1">;</span> <span class="nf">mbarrier.init.shared.b64</span> <span class="p">[</span><span class="nv">global_smem</span><span class="o">+</span><span class="mi">82008</span><span class="p">],</span> <span class="o">%</span><span class="nv">r2369</span><span class="c1">;</span> <span class="o">//</span> <span class="err">===</span> <span class="nf">Matrix</span> <span class="nv">Load</span> <span class="p">(</span><span class="nv">ldmatrix</span> <span class="nv">for</span> <span class="nv">tensor</span> <span class="nv">cores</span><span class="p">)</span> <span class="err">===</span> <span class="nf">ldmatrix.sync.aligned.m8n8.x4.shared.b16</span> <span class="err">{</span><span class="o">%</span><span class="nv">r4645</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4646</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4647</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4648</span><span class="err">}</span><span class="p">,</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2789</span><span class="p">]</span><span class="c1">;</span> <span class="nf">ldmatrix.sync.aligned.m8n8.x4.shared.b16</span> <span class="err">{</span><span class="o">%</span><span class="nv">r4649</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4650</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4651</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4652</span><span class="err">}</span><span class="p">,</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2793</span><span class="p">]</span><span class="c1">;</span> <span class="nf">ldmatrix.sync.aligned.m8n8.x4.shared.b16</span> <span class="err">{</span><span class="o">%</span><span class="nv">r4653</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4654</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4655</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4656</span><span class="err">}</span><span class="p">,</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2797</span><span class="p">]</span><span class="c1">;</span> <span class="nf">ldmatrix.sync.aligned.m8n8.x4.shared.b16</span> <span class="err">{</span><span class="o">%</span><span class="nv">r4657</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4658</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4659</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4660</span><span class="err">}</span><span class="p">,</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2801</span><span class="p">]</span><span class="c1">;</span> <span class="o">//</span> <span class="nf">...</span> <span class="p">(</span><span class="mi">512</span> <span class="nv">ldmatrix</span> <span class="nv">instructions</span> <span class="nv">total</span><span class="p">)</span> <span class="o">//</span> <span class="err">===</span> <span class="nf">Tensor</span> <span class="nv">Core</span> <span class="nv">MMA</span> <span class="p">(</span><span class="nv">HMMA</span><span class="p">)</span> <span class="err">===</span> <span class="o">//</span> <span class="nl">Note:</span> <span class="nf">sm_120a</span> <span class="nv">uses</span> <span class="nv">wgmma</span><span class="o">/</span><span class="nv">tcgen05</span> <span class="nv">instructions</span> <span class="nv">in</span> <span class="nv">SASS</span> <span class="o">//</span> <span class="nf">PTX</span> <span class="nv">shows</span> <span class="nv">the</span> <span class="nv">portable</span> <span class="nv">mma.sync</span> <span class="nv">form</span> <span class="nf">mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32</span> <span class="err">{</span><span class="o">%</span><span class="nf">f1</span><span class="p">,</span> <span class="o">%</span><span class="nv">f2</span><span class="p">,</span> <span class="o">%</span><span class="nv">f3</span><span class="p">,</span> <span class="o">%</span><span class="nv">f4</span><span class="err">}</span><span class="p">,</span> <span class="err">{</span><span class="o">%</span><span class="nf">r4645</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4646</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4647</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4648</span><span class="err">}</span><span class="p">,</span> <span class="err">{</span><span class="o">%</span><span class="nf">r4709</span><span class="p">,</span> <span class="o">%</span><span class="nv">r4710</span><span class="err">}</span><span class="p">,</span> <span class="err">{</span><span class="o">%</span><span class="nf">f1</span><span class="p">,</span> <span class="o">%</span><span class="nv">f2</span><span class="p">,</span> <span class="o">%</span><span class="nv">f3</span><span class="p">,</span> <span class="o">%</span><span class="nv">f4</span><span class="err">}</span><span class="c1">;</span> <span class="o">//</span> <span class="nf">...</span> <span class="p">(</span><span class="mi">512</span> <span class="nv">mma.sync</span> <span class="nv">instructions</span> <span class="nv">total</span><span class="p">)</span> <span class="o">//</span> <span class="err">===</span> <span class="nf">Async</span> <span class="nv">Copy</span> <span class="p">(</span><span class="nv">cp.async</span> <span class="nv">for</span> <span class="nv">global</span><span class="err">→</span><span class="nv">shared</span><span class="p">)</span> <span class="err">===</span> <span class="nf">cp.async.cg.shared.global</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2856</span><span class="p">],</span> <span class="p">[</span><span class="o">%</span><span class="nv">rd112</span><span class="p">],</span> <span class="mi">16</span><span class="p">,</span> <span class="o">%</span><span class="nv">p116</span><span class="c1">;</span> <span class="nf">cp.async.cg.shared.global</span> <span class="p">[</span><span class="o">%</span><span class="nv">r2857</span><span class="p">],</span> <span class="p">[</span><span class="o">%</span><span class="nv">rd113</span><span class="p">],</span> <span class="mi">16</span><span class="p">,</span> <span class="o">%</span><span class="nv">p116</span><span class="c1">;</span> <span class="o">//</span> <span class="nf">...</span> <span class="p">(</span><span class="mi">136</span> <span class="nv">cp.async</span> <span class="nv">instructions</span> <span class="nv">total</span><span class="p">)</span> <span class="o">//</span> <span class="err">===</span> <span class="nf">Barrier</span> <span class="nv">Synchronization</span> <span class="err">===</span> <span class="nf">mbarrier.arrive.shared.b64</span> <span class="nv">_</span><span class="p">,</span> <span class="p">[</span><span class="nv">global_smem</span><span class="o">+</span><span class="mi">82000</span><span class="p">]</span><span class="c1">;</span> <span class="nf">mbarrier.try_wait.parity.shared.b64</span> <span class="o">%</span><span class="nv">p117</span><span class="p">,</span> <span class="p">[</span><span class="nv">global_smem</span><span class="o">+</span><span class="mi">82000</span><span class="p">],</span> <span class="o">%</span><span class="nv">r2371</span><span class="c1">;</span> <span class="err">}</span> </code></pre></div> </div> </details> <h1 id="citation">Citation</h1> <p>To cite this article:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{zhu2026tileir, title = {NVIDIA TileIR Internals: from CuTile to MLIR/LLVM to SASS}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2026}, month = {January}, url = "https://maknee.github.io/blog/2026/NVIDIA-TileIR-Internals-from-CuTile-to-MLIR-LLVM-to-SASS/" } </code></pre></div></div> Performance Hints 2026-01-26T12:00:00+00:00 2026-01-26T12:00:00+00:00 https://maknee.github.io/blog/2026/Performance-Hints <!-- --> <p>This post will be about going through <a href="https://abseil.io/fast/hints.html#performance-hints">https://abseil.io/fast/hints.html#performance-hints</a>, a blog post written by the power duo Jeff Dean and Sanjay Ghemawat who argubly made google to what it is today. This is a knowledge distillation from the both of them with many examples from the internal codebase. Hopefully I can a thing or two professionals who have worked in the industry longer than I have been alive</p> <h1 id="reflection-after-reading-this-post">Reflection after reading this post</h1> <p>Start at <a href="#performance-hints">Performance Hints</a> to see me go through the post while I’m reading through it. This short section is my takeaways from reading it.</p> <p>TLDR, this post is about why you should build such an intuition and showing many outcomes from snippets of experience.</p> <p>I think the intro was very very well written and puts some key points about thinking about performance into perspective.</p> <p>The early sections, especially in “The importance of thinking about performance” and “Estimation” provides small window into how to think about performance as a sort of life-style choice (ie, having a habit of incorporating performance before and while the project is going rather than after). The motivations for why one sometimes should think in such a manner varies, but the authors argue that down the line, you face consequences or even bigger time sinks that could have been solved in the first place (harder time spotting the issues due to complexity, time sink to communicate with people complaining about what you wrote, changing existing library for performance gains is hard, using expensive bandaids to solve performance issues).</p> <p>Estimation has and always will be important. It’s one way to judge if your intuition is right or not (guess, run experiment, am i wrong). And most likely, for me, it’s wrong. One tricky thing to spot is if something sounds right, but is wrong. Another habit that is hard to get is the “am I wrong” part, where I get lost in the sauce of doing something and then say “I’m done and ok yeah let’s move on to the next thing” and not asking the question “was I wrong initially” to see where I went wrong in my estimations, which can trickle down to actually doing the thing properly. And I think this should apply generally to anything, but I haven’t written and measured it outside of the work I do.</p> <p>Detailed example sections I find new to me and seemingly useful: “What to do when profiles are flat”, “Code size considerations”, “Parallelization and synchronization”, “CLs that demonstrate multiple techniques”.</p> <h2 id="side-notes-my-thoughts">Side notes (my thoughts)</h2> <p>One thing now I especially now to think about is cost associated with performance. People typically talk about running services at scale and how many machines are needed for X system to run properly, but I believe what is just as important to look at is the view of a single node and its resources. These resources are repeated and scaled too. The number of cores is now 64, 128 or 256, and don’t get me even started with GPU cores. How many GB/s of memory/disk can transfer within a node? Then any improvement in compute or transfer on a single node trickles a bit down to a cloud native setting and is most likely easier to profile and debug.</p> <p>So…, ironically, although chips have gotten faster and faster and resoources are getting cheaper (memory, disk), and yet we still care about performance? Is it cost? Usability? Or do we face new applications that require more performance?</p> <p>What about power? Power seemingly is becoming more and more of a concern with AI in the GPU/hardware space, which could result in <a href="https://modal.com/blog/gpu-health">errors on the chip</a>. Or was it already? I mean the main costs after building the chips, racks and datacenters are power and maintainance. It seems like the only way performance can affect power consumption is inadvertently, either through eliminating or doing less work (basically improving the algorithm). And sometimes performance gains can increase the work done (more nodes could result in less latency). So the question I’m getting at is how do we factor in is how to lower power while keeping the throughput or latency steady (something like undervolting in the gamer space where users tweak their hot GPUs to run at a way lower power while keeping 95%+ of performance).</p> <h1 id="performance-hints">Performance Hints</h1> <h2 id="the-importance-of-thinking-about-performance">The importance of thinking about performance</h2> <p>This section is the introduction. They both have added very insightful, yet succint sentences that makes me ponder much.</p> <blockquote> <p>Knuth is often quoted out of context as saying premature optimization is the root of all evil. The full quote reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” This document is about that critical 3%, and a more compelling quote, again from Knuth, reads:</p> </blockquote> <p>If you go to the <a href="https://dl.acm.org/doi/pdf/10.1145/356635.356640">link</a>, Knuth actually luminates more on this “… pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I’ve become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specificly turned off”</p> <p>This was published on Decemeber 01, 1974. And yet hasn’t been solved. What makes me believe that AI will solve this if after 50+ years since this has been written that AI can? What makes this such a hard problem to solve?</p> <p>Is it “just” telling someone “hey buddy, the code written/generated here is slow?” and now tell the AI to fix it? And what makes me believe that AI will find that 3%? Or maybe that 3% doesn’t actually matter for most people, it just matters for critical pieces of code is written like postgres, mongodb? Or if you flip it, maybe the 3% matters a lot because it’s used by the 99% of people ~ xkcd image below:</p> <p><img src="https://imgs.xkcd.com/comics/dependency_2x.png" alt="xkcd" /></p> <blockquote> <p>Many people will say “let’s write down the code in as simple a way as possible and deal with performance later when we can profile”. However, this approach is often wrong:</p> </blockquote> <p>When I first read this, I was like spot on. But doing this is difficult. How can you think of performance ahead of time? First, the question is does performance matter - always yes! it matters to some X degree, either cost, usability, etc. And second, what type of performance is needed and to what degree? Are we focused on latency for usability? reliability? cost? adability? and for each how much time do we pour and what is a good expected number to reach. these are some difficult questions to think about ahead of time without many years of experience, not just touching one project in depth, but many and each project with different goals and purposes.</p> <p>The baseline of knowing this is napkin math. And I still need to work on that and integrate it programs I’m working on. And I believe that is true for some things outside of just computer science. If you’re putting money into say, a stock, <em>ideally</em> you have some idea of what’s going to happen and give an educated guess. Or maybe you need to measure if you’re going to traveling with multiple people, I don’t think yoloing the trip will make a majority of people happy in most cases.</p> <blockquote> <p>If you disregard all performance concerns when developing a large system, you will end up with a flat profile where there are no obvious hotspots because performance is lost all over the place. It will be difficult to figure out how to get started on performance improvements.</p> </blockquote> <p>This is very true. One thing touches another and another and propagates. Let’s say the problem is <a href="https://youtu.be/IxkSlnrRFqc?t=1483">TCP window size</a></p> <p>For example, if you’re serving a GET request in nodejs for a website and like wow, it’s taking 1-2s from US east to west. You start adding print lines to the code to get time measurements. Hmm it seems like this fetch from db is taking a while <code class="language-plaintext highlighter-rouge">await db.query(...)</code>. maybe it’s the db. you change the query to something simple <code class="language-plaintext highlighter-rouge">await db.query(SELECT ... COUNT 1)</code> and then, oh it’s better. Then you could optimize that query and then bam, queries are ~500ms, so that’s like somewhat reasonable.</p> <p>But maybe you dig a little differently (not necesarily deeper). You return some dummy result instead of the db query. Oh? it’s faster? Hmm. By stroke of messing around, you try a big dummy return and you see that it’s 1-2s. What’s happening? Ask AI, etc. maybe you get TCP window size. So it’s 15kb size intially for 1RTT and the data you’re sending is like 1MB-2MB. So you have to somehow compress your data (hopefully it works) or return less data.</p> <p>Similiar to the gym, switching too many variables at once (like workouts) at once can make it difficult to pinpoint what’s going on.</p> <blockquote> <p>If you are developing a library that will be used by other people, the people who will run into performance problems will be likely to be people who cannot easily make performance improvements (they will have to understand the details of code written by other people/teams, and have to negotiate with them about the importance of performance optimizations).</p> </blockquote> <p>This is the other part I’m less experienced with. I think one can get experience seeing this by working in open source or big tech where people care/have an incentive to improve a project. I wonder why others cannot easily make the perf improvements with other’s library? Many people don’t have the time or reason to look deeper, which usually doesn’t give an obvious big net benefit (not to say that it gives a net benefit at all!)?</p> <p>I guess the question is how can you make it usable? One obvious answer is feedback. But how do you get effective feedback? Is it just talking to people who complain about it not working and trying to decipher what that means?</p> <p>A business will face this issue with people. People don’t care about what goes on in the product. They want it to work for their specific use case cause it’s easier (cost and time) than doing it themselves.</p> <blockquote> <p>It is harder to make significant changes to a system when it is in heavy use.</p> </blockquote> <p>Another part that I’m not familiar with. Clearly either big tech or open source again is where one can see that. I guess one thing that sticks out is that you have to accomdate existing users and people and <em>try</em> to convince them to switch. An example of this is the python2 to python3 switch. I was kind of mad that you needed to do <code class="language-plaintext highlighter-rouge">print(...)</code> instead of <code class="language-plaintext highlighter-rouge">print ...</code> because you need to type <code class="language-plaintext highlighter-rouge">(</code> and <code class="language-plaintext highlighter-rouge">)</code> parens and they were kinda hard to reach physically (having to press shift + 9) compared to space.</p> <p>And yet I think, for most things, it most likely has to change at one point. Not many things in life don’t change.</p> <p>For examples, many year friendships with people change typically one way or another.</p> <blockquote> <p>It is also hard to tell if there are performance problems that can be solved easily and so we end up with potentially expensive solutions like over-replication or severe overprovisioning of a service to handle load problems.</p> </blockquote> <p>Another area I’m not an expert in. One can guess and estimate issues, but honestly, it’s fucking hard. Real applications typically have explosions in usage typically at certain times and the <em>let’s solve this for now by X and vibe it with things I know</em> can be just patches and not solving the actual issue and maybe you actually spend more time than necessary to solve it or more money than necessary. But identifying whether to spend that time now or later is so difficult to tell.</p> <blockquote> <p>Instead, we suggest that when writing code, try to choose the faster alternative if it does not impact readability/complexity of the code significantly.</p> </blockquote> <p>Not sure what to expect, but I will revisit these 4 key points when I’m done going through the rest</p> <h2 id="estimation">Estimation</h2> <blockquote> <p>If you can develop an intuition for how much performance might matter in the code you are writing, you can make a more informed decision (e.g., how much extra complexity is warranted in the name of performance).</p> </blockquote> <p>Oh man, the word intution. Ugh, it’s like the best word for what it describes, but it varies per person on how they learn and build an intuition.</p> <blockquote> <p>Is it test code? If so, you need to worry mostly about the asymptotic complexity of your algorithms and data structures. (Aside: development cycle time matters, so avoid writing tests that take a long time to run.) Is it code specific to an application? If so, try to figure out how much performance matters for this piece of code. This is typically not very hard: just figuring out whether code is initialization/setup code vs. code that will end up on hot paths (e.g., processing every request in a service) is often sufficient Is it library code that will be used by many applications? In this case it is hard to tell how sensitive it might become. This is where it becomes especially important to follow some of the simple techniques described in this document. For example, if you need to store a vector that usually has a small number of elements, use an absl::InlinedVector instead of std::vector. Such techniques are not very hard to follow and don’t add any non-local complexity to the system. And if it turns out that the code you are writing does end up using significant resources, it will be higher performance from the start. And it will be easier to find the next thing to focus on when looking at a profile.</p> </blockquote> <p>So my understanding is to think about what the type of work is being done for the application that you are building and to follow general good rules throughout building the project like you drinking X amount of water per day (drinking more is generally good for you for example)</p> <blockquote> <p>You can do a slightly deeper analysis when picking between options with potentially different performance characteristics by relying on back of the envelope calculations. Such calculations can quickly give a very rough estimate of the performance of different alternatives, and the results can be used to discard some of the alternatives without having to implement them.</p> </blockquote> <p>They finally mentioned it. Ok let’s see what has changed in the last ~20 years since Jeff first mentioned this.</p> <blockquote> <p>Here is how such an estimation might work: Estimate how many low-level operations of various kinds are required, e.g., number of disk seeks, number of network round-trips, bytes transmitted etc. Multiply each kind of expensive operation with its rough cost, and add the results together. The preceding gives the cost of the system in terms of resource usage. If you are interested in latency, and if the system has any concurrency, some of the costs may overlap and you may have to do slightly more complicated analysis to estimate the latency.</p> </blockquote> <p>Any transfer of movement of data should be measured, then multiply by the cost (time or $), and add to get total estimated result. The following table is what every one has seen.</p> <div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>L1 cache reference 0.5 ns L2 cache reference 3 ns Branch mispredict 5 ns Mutex lock/unlock (uncontended) 15 ns Main memory reference 50 ns Compress 1K bytes with Snappy 1,000 ns Read 4KB from SSD 20,000 ns Round trip within same datacenter 50,000 ns Read 1MB sequentially from memory 64,000 ns Read 1MB over 100 Gbps network 100,000 ns Read 1MB from SSD 1,000,000 ns Disk seek 5,000,000 ns Read 1MB sequentially from disk 10,000,000 ns Send packet CA-&gt;Netherlands-&gt;CA 150,000,000 ns </code></pre></div></div> <blockquote> <p>You may find it useful to also track estimated costs for higher-level operations relevant to your system. E.g., you might want to know the rough cost of a point read from your SQL database, the latency of interacting with a Cloud service, or the time to render a simple HTML page. If you don’t know the relevant cost of different operations, you can’t do decent back-of-the-envelope calculations!</p> </blockquote> <p>Yeah I understood this a bit better. It’s incredibly hard to track initially because it’s hard to know what’s important and I haven’t used it consistently daily/weekly, etc.</p> <h3 id="example-time-to-quicksort-a-billion-4-byte-numbers">Example: Time to quicksort a billion 4 byte numbers</h3> <p>Before looking at the answer, I would like to ask myself where would this be used and where components are mainly involved and what is the majority of the cost(bottleneck)?</p> <p>Maybe we have many time durations (say from multiple services) and would like to plot a histogram for a webUI query for latencies.</p> <p>Components: memory, cpu. 1B * 4bytes = 4GB of data which is kinda tiny by today’s standard (one machine can handle it)</p> <p>So let’s say it’s not on disk and in memory already. The quickest is if the data is already sorted and we’re only accessing each one and writing it back to another piece of memory. So 50ns * 1B * 2 (one for read and one for write), so 5s * 2 = 10s?</p> <p>To be honest, it’s probably more, say all of the elements are unsorted, we would have to move all of them and then repeat on each subset as a slice like merge sort, and say without threading. So it would be like a infinite geometric series until converagance of 10s + 5s + 2.5s…, I had to search this up… s = a / (1 - r) which is 10s / (1 - 0.5) = 20s.</p> <p>So between 10s and 20s would be my answer.</p> <blockquote> <p>Memory bandwidth: the array occupies 4 GB (4 bytes per number times a billion numbers). Let’s assume ~16GB/s of memory bandwidth per core. That means each pass will take ~0.25s. N is ~2^30, so we will make ~30 passes, so the total cost of memory transfer will be ~7.5 seconds. Branch mispredictions: we will do a total of N*log(N) comparisons, i.e., ~30 billion comparisons. Let’s assume that half of them (i.e., 15 billion) are mispredicted. Multiplying by 5 ns per misprediction, we get a misprediction cost of 75 seconds. We assume for this analysis that correctly predicted branches are free. Adding up the previous numbers, we get an estimate of ~82.5 seconds.</p> </blockquote> <p>My answer was way off. Let me actually look at the table and try to do in their style: 1. figure out the algorithm and the counts and 2. find the components involved.</p> <p>Ok memory bandwidth is calculating the passes - 4GB/16GBs (read memory) * log(1B) (time per pass) = 0.25s * ~30 = 7.5s. Next is computation - branch prediction work = Nlog(N) compare = 1B*log(1B) = ~30B/2 = 15B. Then 15B * 5ns = 75s, which is surprising? I didn’t expect it to be compute bound.</p> <blockquote> <p>Let’s assume we have a 32MB L3 cache, and that the cost of transferring data from L3 cache to the processor is negligible. The L3 cache can hold 2^23 numbers, and therefore the last 22 passes can operate on the data resident in the L3 cache (the 23rd last pass brings data into the L3 cache and the remaining passes operate on that data.) That cuts down the memory transfer cost to 2.5 seconds (10 memory transfers of 4GB at 16GB/s) instead of 7.5 seconds (30 memory transfers).</p> </blockquote> <p>Wow… ok they talk about the caches here. 2^23 comes from… 2^20 (1MB) * 2^5 (32) / 2^2 (4 bytes per entry). So the last 22 can be loaded all in cache (N = 2^22)</p> <h3 id="example-time-to-generate-a-web-page-with-30-image-thumbnails">Example: Time to generate a web page with 30 image thumbnails</h3> <p>Let’s compare two potential designs where the original images are stored on disk, and each image is approximately 1MB in size.</p> <p>Two main compoennts: loading from disk and transferring data over the web. Total data: 30MB.</p> <p>Disk: 30MB / 1GB/s = 0.03s. Web: log(30MB/1.5KB) * 150ms per roundtrip = 3 * 150ms = 450ms.</p> <blockquote> <p>Read the contents of the 30 images serially and generate a thumbnail for each one. Each read takes one seek + one transfer, which adds up to 5ms for the seek, and 10ms for the transfer, which adds up to 30 images times 15ms per image, i.e., 450ms. Read in parallel, assuming the images are spread evenly across K disks. The previous resource usage estimate still holds, but latency will drop by roughly a factor of K, ignoring variance (e.g, we will sometimes get unlucky and one disk will have more than 1/Kth of the images we are reading). Therefore if we are running on a distributed filesystem with hundreds of disks, the expected latency will drop to ~15ms. Let’s consider a variant where all images are on a single SSD. This changes the sequential read performance to 20µs + 1ms per image, which adds up to ~30 ms overall.</p> </blockquote> <p>Cool, I calculated per SSD and got it right. But, I guess realistically it would be the HDD version (say S3).</p> <h1 id="measurement">Measurement</h1> <blockquote> <p>The preceding section gives some tips about how to think about performance when writing code without worrying too much about how to measure the performance impact of your choices. However, before you actually start making improvements, or run into a tradeoff involving various things like performance, simplicity, etc. you will want to measure or estimate potential performance benefits. Being able to measure things effectively is the number one tool you’ll want to have in your arsenal when doing performance-related work.</p> </blockquote> <p>I should really keep this in mind. Esimates before actually going to run it. The question is if it is even feasible.</p> <blockquote> <p>As an aside, it’s worth pointing out that profiling code that you’re unfamiliar with can also be a good way of getting a general sense of the structure of the codebase and how it operates. Examining the source code of heavily involved routines in the dynamic call graph of a program can give you a high level sense of “what happens” when running the code, which can then build your own confidence in making performance-improving changes in slightly unfamiliar code.</p> </blockquote> <p>Yes! I think just reading code gives one an idealistic view of the code. So much complexity happens behind the scenes. Is there lock contention? False sharing? Too much time spent allocating? Memory leaks? There are things to observe that is not so easily picked up by looking or stuffing the code into a llm.</p> <h2 id="profiling-tools-and-tips">Profiling tools and tips</h2> <blockquote> <p>If you can, write a microbenchmark that covers the code you are improving. Microbenchmarks improve turnaround time when making performance improvements, help verify the impact of performance improvements, and can help prevent future performance regressions. However microbenchmarks can have pitfalls that make them non-representative of full system performance.</p> </blockquote> <p>Very true. It helps build understanding of individual components of the system to see where the full system has overhead.</p> <h2 id="what-to-do-when-profiles-are-flat">What to do when profiles are flat</h2> <blockquote> <p>Find loops closer to the top of call stacks (flame graph view of a CPU profile can be helpful here). Potentially, the loop or the code it calls could be restructured to be more efficient. Some code that initially built a complicated graph structure incrementally by looping over nodes and edges of the input was changed to build the graph structure in one shot by passing it the entire input. This removed a bunch of internal checks that were happening per edge in the initial code.</p> </blockquote> <p>Reduce loops into an array.</p> <blockquote> <p>Take a step back and look for structural changes higher up in the call stacks instead of concentrating on micro-optimizations. The techniques listed under algorithmic improvements can be useful when doing this.</p> </blockquote> <p>Algorithm level instead of micro optimizations</p> <blockquote> <p>Look for overly general code. Replace it with a customized or lower-level implementation. E.g., if an application is repeatedly using a regular expression match where a simple prefix match would suffice, consider dropping the use of the regular expression.</p> </blockquote> <p>Makes sense, seems micro optimization, I would be wary of this.</p> <blockquote> <p>Attempt to reduce the number of allocations: get an allocation profile, and pick away at the highest contributor to the number of allocations. This will have two effects: (1) It will provide a direct reduction of the amount of time spent in the allocator (and garbage collector for GC-ed languages) (2) There will often be a reduction in cache misses since in a long running program using tcmalloc, every allocation tends to go to a different cache line.</p> </blockquote> <p>Seen this happen SO many times. This takes up so many cycles, it’s actually frustrating to solve.</p> <blockquote> <p>Gather other types of profiles, specially ones based on hardware performance counters. Such profiles may point out functions that are encountering a high cache miss rate. Techniques described in the profiling tools and tips section can be helpful.</p> </blockquote> <p>Yes, but one needs to learn how to these performance counters at a system level and typically they are just samples (hard to pinpoint). I guess perf would help here with something like cache misses</p> <h2 id="api-considerations">API considerations</h2> <blockquote> <p>Widely used APIs come under heavy pressure to add features. Be careful when adding new features since these will constrain future implementations and increase cost unnecessarily for users who don’t need the new features. E.g., many C++ standard library containers promise iterator stability, which in typical implementations increases the number of allocations significantly, even though many users do not need pointer stability.</p> </blockquote> <p>Make API as simple as possible, kind of like C i guess? But make the interface actually good.</p> <h3 id="bulk-apis">Bulk APIs</h3> <p>Reduce the number of locks in memory allocations</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span> <span class="k">class</span> <span class="nc">ObjectStore</span> <span class="p">{</span> <span class="nl">public:</span> <span class="p">...</span> <span class="n">absl</span><span class="o">::</span><span class="n">Status</span> <span class="n">DeleteRef</span><span class="p">(</span><span class="n">Ref</span><span class="p">);</span> <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span> <span class="k">class</span> <span class="nc">ObjectStore</span> <span class="p">{</span> <span class="nl">public:</span> <span class="p">...</span> <span class="n">absl</span><span class="o">::</span><span class="n">Status</span> <span class="n">DeleteRef</span><span class="p">(</span><span class="n">Ref</span><span class="p">);</span> <span class="c1">// Delete many references. For each ref, if no other Refs point to the same</span> <span class="c1">// object, the object will be deleted. Returns non-OK on any error.</span> <span class="n">absl</span><span class="o">::</span><span class="n">Status</span> <span class="n">DeleteRefs</span><span class="p">(</span><span class="n">absl</span><span class="o">::</span><span class="n">Span</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">Ref</span><span class="o">&gt;</span> <span class="n">refs</span><span class="p">);</span> <span class="p">...</span> <span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span> <span class="n">absl</span><span class="o">::</span><span class="n">Status</span> <span class="n">ObjectStore</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;::</span><span class="n">DeleteRefs</span><span class="p">(</span><span class="n">absl</span><span class="o">::</span><span class="n">Span</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">Ref</span><span class="o">&gt;</span> <span class="n">refs</span><span class="p">)</span> <span class="p">{</span> <span class="n">util</span><span class="o">::</span><span class="n">Status</span> <span class="n">result</span><span class="p">;</span> <span class="n">absl</span><span class="o">::</span><span class="n">MutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mu_</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">ref</span> <span class="o">:</span> <span class="n">refs</span><span class="p">)</span> <span class="p">{</span> <span class="n">result</span><span class="p">.</span><span class="n">Update</span><span class="p">(</span><span class="n">DeleteRefLocked</span><span class="p">(</span><span class="n">ref</span><span class="p">));</span> <span class="p">}</span> <span class="k">return</span> <span class="n">result</span><span class="p">;</span> </code></pre></div></div> <h3 id="view-types">View types</h3> <blockquote> <p>These types reduce copying, and allow callers to pick their own container types (e.g., one caller might use std::vector whereas another one uses absl::InlinedVector).</p> </blockquote> <p>Yep! Been using this</p> <blockquote> <p>For frequently called routines, sometimes it is useful to allow higher-level callers to pass in a data structure that they own or information that the called routine needs that the client already has. This can avoid the low-level routine being forced to allocate its own temporary data structure or recompute already-available information.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="n">WallTime</span> <span class="n">now</span> <span class="o">=</span> <span class="n">WallTime_Now</span><span class="p">();</span> <span class="p">...</span> <span class="n">RPC_Stats</span><span class="o">::</span><span class="n">RecordRPC</span><span class="p">(</span><span class="n">stats_name</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span> <span class="k">const</span> <span class="n">WallTime</span> <span class="n">now</span> <span class="o">=</span> <span class="n">WallTime_Now</span><span class="p">();</span> <span class="p">...</span> <span class="n">RPC_Stats</span><span class="o">::</span><span class="n">RecordRPC</span><span class="p">(</span><span class="n">stats_name</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">now</span><span class="p">);</span> </code></pre></div></div> <p>This makes sense</p> <h3 id="thread-compatible-vs-thread-safe-types">Thread-compatible vs. Thread-safe types</h3> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TransferPhase</span> <span class="n">HitlessTransferPhase</span><span class="o">::</span><span class="n">get</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">static</span> <span class="n">CallsiteMetrics</span> <span class="n">cm</span><span class="p">(</span><span class="s">"HitlessTransferPhase::get"</span><span class="p">);</span> <span class="n">MonitoredMutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">cm</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">mutex_</span><span class="p">);</span> <span class="k">return</span> <span class="n">phase_</span><span class="p">;</span> <span class="p">}</span> <span class="n">TransferPhase</span> <span class="n">HitlessTransferPhase</span><span class="o">::</span><span class="n">get</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">phase_</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <p>Have the user do the sync, makes sense for performance as the internal calls won’t be always locking</p> <blockquote> <p>The most critical opportunities for performance improvements come from algorithmic improvements, e.g., turning an O(N²) algorithm to O(N lg(N)) or O(N), avoiding potentially exponential behavior, etc. These opportunities are rare in stable code, but are worth paying attention to when writing new code. A few examples that show such improvements to pre-existing code:</p> </blockquote> <p>Rare in stable code! Man, they must have thought about most things.</p> <h2 id="better-memory-representation">Better memory representation</h2> <blockquote> <p>Careful consideration of memory footprint and cache footprint of important data structures can often yield big savings. The data structures below focus on supporting common operations by touching fewer cache lines. Care taken here can (a) avoid expensive cache misses (b) reduce memory bus traffic, which speeds up both the program in question and anything else running on the same machine</p> </blockquote> <p>Yes, these are expensive resources on any machine.</p> <h3 id="memory-layout">Memory layout</h3> <blockquote> <p>Place hot read-only fields away from hot mutable fields so that writes to the mutable fields do not cause the read-only fields to be evicted from nearby caches.</p> </blockquote> <p>Oh I get it, writes invalidate other core’s entries</p> <blockquote> <p>Consider packing things into fewer bytes by using bit and byte-level encoding. This can be complicated, so only do this when the data under question is encapsulated inside a well-tested module, and the overall reduction of memory usage is significant. Furthermore, watch out for side effects like under-alignment of frequently used data, or more expensive code for accessing packed representations. Validate such changes using benchmarks.</p> </blockquote> <p>Makes sense. Trade space for CPU like varint, etc.</p> <h3 id="indices-instead-of-pointers">Indices instead of pointers</h3> <blockquote> <p>On modern 64-bit machines, pointers take up 64 bits. If you have a pointer-rich data structure, you can easily chew up lots of memory with indirections of T*. Instead, consider using integer indices into an array T[] or other data structure. Not only will the references be smaller (if the number of indices is small enough to fit in 32 or fewer bits), but the storage for all the T[] elements will be contiguous, often leading to better cache locality.</p> </blockquote> <p>Smaller indices. 4 bytes = 1billion indices already at 1/2 the storage cost</p> <blockquote> <p>Avoid data structures that allocate a separate object per stored element (e.g., std::map, std::unordered_map in C++). Instead, consider types that use chunked or flat representations to store multiple elements in close proximity in memory (e.g., std::vector, absl::flat_hash_{map,set} in C++). Such types tend to have much better cache behavior. Furthermore, they encounter less allocator overhead.</p> </blockquote> <p>Yes. But only in performant code. It’s sometimes tricky to have a flat representation, but flat hashmap/set is nice.</p> <blockquote> <p>One useful technique is to partition elements into chunks where each chunk can hold a fixed number of elements. This technique can reduce the cache footprint of a data structure significantly while preserving good asymptotic behavior.</p> </blockquote> <p>Yes! used in many implementations such as highly performant read/write queues.</p> <h3 id="arenas">Arenas</h3> <blockquote> <p>Arenas can help reduce memory allocation cost, but they also have the benefit of packing together independently allocated items next to each other, typically in fewer cache lines, and eliminating most destruction costs. They are likely most effective for complex data structures with many sub-objects. Consider providing an appropriate initial size for the arena since that can help reduce allocations. Caveat: it is easy to misuse arenas by putting too many short-lived objects in a long-lived arena, which can unnecessarily bloat memory footprint.</p> </blockquote> <p>Basically allocate items ahead of time, but may not use the entire arena. It’s tricky to get right… very tricky… especially with estimatingg how big it should be.</p> <h3 id="arrays-instead-of-maps">Arrays instead of maps</h3> <blockquote> <p>If the domain of a map can be represented by a small integer or is an enum, or if the map will have very few elements, the map can sometimes be replaced by an array or a vector of some form.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="n">gtl</span><span class="o">::</span><span class="n">flat_map</span><span class="o">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="o">&gt;</span> <span class="n">payload_type_to_clock_frequency_</span><span class="p">;</span> <span class="c1">// A map (implemented as a simple array) indexed by payload_type to clock freq</span> <span class="c1">// for that paylaod type (or 0)</span> <span class="k">struct</span> <span class="nc">PayloadTypeToClockRateMap</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">map</span><span class="p">[</span><span class="mi">128</span><span class="p">];</span> <span class="p">};</span> <span class="p">...</span> <span class="k">const</span> <span class="n">PayloadTypeToClockRateMap</span> <span class="n">payload_type_to_clock_frequency_</span><span class="p">;</span> </code></pre></div></div> <p>Only used when key is index…</p> <h3 id="bit-vectors-instead-of-sets">Bit vectors instead of sets</h3> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ZoneSet</span><span class="o">:</span> <span class="k">public</span> <span class="n">dense_hash_set</span><span class="o">&lt;</span><span class="n">ZoneId</span><span class="o">&gt;</span> <span class="p">{</span> <span class="nl">public:</span> <span class="p">...</span> <span class="kt">bool</span> <span class="n">Contains</span><span class="p">(</span><span class="n">ZoneId</span> <span class="n">zone</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">count</span><span class="p">(</span><span class="n">zone</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="p">}</span> <span class="k">class</span> <span class="nc">ZoneSet</span> <span class="p">{</span> <span class="p">...</span> <span class="c1">// Returns true iff "zone" is contained in the set</span> <span class="kt">bool</span> <span class="n">ContainsZone</span><span class="p">(</span><span class="n">ZoneId</span> <span class="n">zone</span><span class="p">)</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">zone</span> <span class="o">&lt;</span> <span class="n">b_</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">&amp;&amp;</span> <span class="n">b_</span><span class="p">.</span><span class="n">get_bit</span><span class="p">(</span><span class="n">zone</span><span class="p">);</span> <span class="p">}</span> <span class="p">...</span> <span class="k">private</span><span class="o">:</span> <span class="kt">int</span> <span class="n">size_</span><span class="p">;</span> <span class="c1">// Number of zones inserted</span> <span class="n">util</span><span class="o">::</span><span class="n">bitmap</span><span class="o">::</span><span class="n">InlinedBitVector</span><span class="o">&lt;</span><span class="mi">256</span><span class="o">&gt;</span> <span class="n">b_</span><span class="p">;</span> </code></pre></div></div> <p>I’ve not actually used this before. Essentially a vector of bits instead of a set of values. I don’t use sets that often…</p> <h3 id="reduce-allocations">Reduce allocations</h3> <blockquote> <p>Newly-allocated objects may require expensive initialization and sometimes corresponding expensive destruction when no longer needed.</p> </blockquote> <p>I see this time and time again</p> <blockquote> <p>Every allocation tends to be on a new cache line and therefore data spread across many independent allocations will have a larger cache footprint than data spread across fewer allocations.</p> </blockquote> <p>Yes. Batch your allocations (basically arena)</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LiveTensor</span><span class="o">::</span><span class="n">LiveTensor</span><span class="p">(</span><span class="n">tf</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">t</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">DeviceInfo</span><span class="o">&gt;</span> <span class="n">dinfo</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">is_batched</span><span class="p">)</span> <span class="o">:</span> <span class="n">tensor</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">t</span><span class="p">)),</span> <span class="n">device_info</span><span class="p">(</span><span class="n">dinfo</span> <span class="o">?</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">dinfo</span><span class="p">)</span> <span class="o">:</span> <span class="n">std</span><span class="o">::</span><span class="n">make_shared</span><span class="o">&lt;</span><span class="n">DeviceInfo</span><span class="o">&gt;</span><span class="p">()),</span> <span class="n">is_batched</span><span class="p">(</span><span class="n">is_batched</span><span class="p">)</span> <span class="p">{</span> <span class="k">static</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">DeviceInfo</span><span class="o">&gt;&amp;</span> <span class="n">empty_device_info</span><span class="p">()</span> <span class="p">{</span> <span class="k">static</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">DeviceInfo</span><span class="o">&gt;*</span> <span class="n">result</span> <span class="o">=</span> <span class="k">new</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">DeviceInfo</span><span class="o">&gt;</span><span class="p">(</span><span class="k">new</span> <span class="n">DeviceInfo</span><span class="p">);</span> <span class="k">return</span> <span class="o">*</span><span class="n">result</span><span class="p">;</span> <span class="p">}</span> <span class="n">LiveTensor</span><span class="o">::</span><span class="n">LiveTensor</span><span class="p">(</span><span class="n">tf</span><span class="o">::</span><span class="n">Tensor</span> <span class="n">t</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">DeviceInfo</span><span class="o">&gt;</span> <span class="n">dinfo</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">is_batched</span><span class="p">)</span> <span class="o">:</span> <span class="n">tensor</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">t</span><span class="p">)),</span> <span class="n">is_batched</span><span class="p">(</span><span class="n">is_batched</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">dinfo</span><span class="p">)</span> <span class="p">{</span> <span class="n">device_info</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">dinfo</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">device_info</span> <span class="o">=</span> <span class="n">empty_device_info</span><span class="p">();</span> <span class="p">}</span> </code></pre></div></div> <h3 id="resize-or-reserve-containers">Resize or reserve containers</h3> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndocs</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">uint32</span> <span class="n">delta</span><span class="p">;</span> <span class="n">ERRORCHECK</span><span class="p">(</span><span class="n">b</span><span class="o">-&gt;</span><span class="n">GetRice</span><span class="p">(</span><span class="n">rice_base</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">delta</span><span class="p">));</span> <span class="n">docs_</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">DocId</span><span class="p">(</span><span class="n">my_shard_</span> <span class="o">+</span> <span class="p">(</span><span class="n">base</span> <span class="o">+</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_shards_</span><span class="p">));</span> <span class="n">base</span> <span class="o">=</span> <span class="n">base</span> <span class="o">+</span> <span class="n">delta</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="p">}</span> <span class="n">docs_</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">last_docid_</span><span class="p">);</span> <span class="n">docs_</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">ndocs</span><span class="p">);</span> <span class="n">DocId</span><span class="o">*</span> <span class="n">docptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">docs_</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">ndocs</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">uint32</span> <span class="n">delta</span><span class="p">;</span> <span class="n">ERRORCHECK</span><span class="p">(</span><span class="n">b</span><span class="p">.</span><span class="n">GetRice</span><span class="p">(</span><span class="n">rice_base</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">delta</span><span class="p">));</span> <span class="o">*</span><span class="n">docptr</span> <span class="o">=</span> <span class="n">DocId</span><span class="p">(</span><span class="n">my_shard_</span> <span class="o">+</span> <span class="p">(</span><span class="n">base</span> <span class="o">+</span> <span class="n">delta</span><span class="p">)</span> <span class="o">*</span> <span class="n">num_shards_</span><span class="p">);</span> <span class="n">docptr</span><span class="o">++</span><span class="p">;</span> <span class="n">base</span> <span class="o">=</span> <span class="n">base</span> <span class="o">+</span> <span class="n">delta</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="p">}</span> <span class="o">*</span><span class="n">docptr</span> <span class="o">=</span> <span class="n">last_docid_</span><span class="p">;</span> </code></pre></div></div> <p>I actually do this a lot for my preallocated code for performance. Wow, I guess I do some things correctly</p> <h3 id="avoid-copying-when-possible">Avoid copying when possible</h3> <p>One of the most critical things to do (no work is better than having work)</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">return</span> <span class="n">search_iterators</span><span class="o">::</span><span class="n">DocPLIteratorFactory</span><span class="o">::</span><span class="n">Create</span><span class="p">(</span><span class="n">opts</span><span class="p">);</span> <span class="k">return</span> <span class="n">search_iterators</span><span class="o">::</span><span class="n">DocPLIteratorFactory</span><span class="o">::</span><span class="n">Create</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">opts</span><span class="p">));</span> </code></pre></div></div> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="n">iterator</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">WrapUnique</span><span class="p">(</span><span class="n">sstable</span><span class="o">-&gt;</span><span class="n">GetIterator</span><span class="p">());</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">iterator</span><span class="o">-&gt;</span><span class="n">done</span><span class="p">())</span> <span class="p">{</span> <span class="n">T</span> <span class="n">profile</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">profile</span><span class="p">.</span><span class="n">ParseFromString</span><span class="p">(</span><span class="n">iterator</span><span class="o">-&gt;</span><span class="n">value_view</span><span class="p">()))</span> <span class="p">{</span> <span class="k">return</span> <span class="n">absl</span><span class="o">::</span><span class="n">InternalError</span><span class="p">(</span> <span class="s">"Failed to parse mem_block to specified profile type."</span><span class="p">);</span> <span class="p">}</span> <span class="p">...</span> <span class="n">iterator</span><span class="o">-&gt;</span><span class="n">Next</span><span class="p">();</span> <span class="p">}</span> <span class="k">auto</span> <span class="n">iterator</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">WrapUnique</span><span class="p">(</span><span class="n">sstable</span><span class="o">-&gt;</span><span class="n">GetIterator</span><span class="p">());</span> <span class="n">T</span> <span class="n">profile</span><span class="p">;</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">iterator</span><span class="o">-&gt;</span><span class="n">done</span><span class="p">())</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">profile</span><span class="p">.</span><span class="n">ParseFromString</span><span class="p">(</span><span class="n">iterator</span><span class="o">-&gt;</span><span class="n">value_view</span><span class="p">()))</span> <span class="p">{</span> <span class="k">return</span> <span class="n">absl</span><span class="o">::</span><span class="n">InternalError</span><span class="p">(</span> <span class="s">"Failed to parse mem_block to specified profile type."</span><span class="p">);</span> <span class="p">}</span> <span class="p">...</span> <span class="n">iterator</span><span class="o">-&gt;</span><span class="n">Next</span><span class="p">();</span> <span class="p">}</span> </code></pre></div></div> <blockquote> <p>Often, code is written to cover all cases, but some subset of the cases are much simpler and more common than others. E.g., vector::push_back usually has enough space for the new element, but contains code to resize the underlying storage when it does not. Some attention paid to the structure of code can help make the common simple case faster without hurting uncommon case performance significantly.</p> </blockquote> <p>One has to understand the uncommon case underlying the API call. Say no error happened, we shouldn’t log at all.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">RPC_Stats_Measurement</span><span class="o">::</span><span class="k">operator</span><span class="o">+=</span><span class="p">(</span><span class="k">const</span> <span class="n">RPC_Stats_Measurement</span><span class="o">&amp;</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">RPC</span><span class="o">::</span><span class="n">NUM_ERRORS</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">x</span><span class="p">.</span><span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">RPC_Stats_Measurement</span><span class="o">::</span><span class="k">operator</span><span class="o">+=</span><span class="p">(</span><span class="k">const</span> <span class="n">RPC_Stats_Measurement</span><span class="o">&amp;</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">any_errors_set</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">RPC</span><span class="o">::</span><span class="n">NUM_ERRORS</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span> <span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">x</span><span class="p">.</span><span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="n">any_errors_set</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div></div> <blockquote> <p>Preallocate 10 nodes not 200 for query handling in Google’s web server. A simple change that reduced web server’s CPU usage by 7.5%. Wow.</p> </blockquote> <p><code class="language-plaintext highlighter-rouge">querytree.h</code></p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">kInitParseTreeSize</span> <span class="o">=</span> <span class="mi">200</span><span class="p">;</span> <span class="c1">// initial size of querynode pool</span> <span class="k">static</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">kInitParseTreeSize</span> <span class="o">=</span> <span class="mi">10</span><span class="p">;</span> <span class="c1">// initial size of querynode pool</span> </code></pre></div></div> <h3 id="specialize-code">Specialize code</h3> <blockquote> <p>A particular performance-sensitive call-site may not need the full generality provided by a general-purpose library. Consider writing specialized code in such cases instead of calling the general-purpose code if it provides a performance improvement.</p> </blockquote> <p>Interesting, I haven’t done this before. This should be put into very heavily usedd code.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">=</span> <span class="n">MATCH_TYPE_REGEXP</span><span class="p">;</span> <span class="n">term</span><span class="p">.</span><span class="n">NonMetaPrefix</span><span class="p">().</span><span class="n">CopyToString</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="o">-&gt;</span><span class="n">prefix</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">term</span><span class="p">.</span><span class="n">RegexpSuffix</span><span class="p">()</span> <span class="o">==</span> <span class="s">".*"</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Special case for a regexp that matches anything, so we can</span> <span class="c1">// bypass RE2::FullMatch</span> <span class="n">p</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">=</span> <span class="n">MATCH_TYPE_PREFIX</span><span class="p">;</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="n">p</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">=</span> <span class="n">MATCH_TYPE_REGEXP</span><span class="p">;</span> </code></pre></div></div> <h3 id="make-the-compilers-job-easier">Make the compiler’s job easier</h3> <blockquote> <p>The application programmer will often know more about the behavior of the system and can aid the compiler by rewriting the code to operate at a lower level. However, only do this when profiles show an issue since compilers will often get things right on their own. Looking at the generated assembly code for performance critical routines can help you understand if the compiler is “getting it right”. Pprof provides a very helpful display of source code interleaved with disassembly and annotated with performance data.</p> </blockquote> <p>If you understand the code extremely well, you can get to this stage, OR use specific tool that shows the assembly (rare!)</p> <blockquote> <p>Avoid functions calls in hot functions (allows the compiler to avoid frame setup costs). Move slow-path code into a separate tail-called function. Copy small amounts of data into local variables before heavy use. This can let the compiler assume there is no aliasing with other data, which may improve auto-vectorization and register allocation. Hand-unroll very hot loops.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Key</span><span class="o">::</span><span class="n">InitSeps</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">start</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">base</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">rep_</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">limit</span> <span class="o">=</span> <span class="n">base</span> <span class="o">+</span> <span class="n">rep_</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">s</span> <span class="o">=</span> <span class="n">start</span><span class="p">;</span> <span class="n">DCHECK_GE</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">base</span><span class="p">);</span> <span class="n">DCHECK_LT</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">limit</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">3</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">s</span> <span class="o">=</span> <span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span><span class="p">)</span><span class="n">memchr</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="sc">'#'</span><span class="p">,</span> <span class="n">limit</span> <span class="o">-</span> <span class="n">s</span><span class="p">);</span> <span class="n">DCHECK</span><span class="p">(</span><span class="n">s</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">);</span> <span class="n">seps_</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span> <span class="o">-</span> <span class="n">base</span><span class="p">;</span> <span class="n">s</span><span class="o">++</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="kr">inline</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="nf">ScanBackwardsForSep</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">base</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">p</span><span class="p">)</span> <span class="p">{</span> <span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;=</span> <span class="n">base</span> <span class="o">+</span> <span class="mi">4</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">)</span> <span class="k">return</span> <span class="n">p</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">)</span> <span class="k">return</span> <span class="n">p</span><span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">)</span> <span class="k">return</span> <span class="n">p</span><span class="o">-</span><span class="mi">2</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="o">-</span><span class="mi">3</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">)</span> <span class="k">return</span> <span class="n">p</span><span class="o">-</span><span class="mi">3</span><span class="p">;</span> <span class="n">p</span> <span class="o">-=</span> <span class="mi">4</span><span class="p">;</span> <span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;=</span> <span class="n">base</span> <span class="o">&amp;&amp;</span> <span class="o">*</span><span class="n">p</span> <span class="o">!=</span> <span class="sc">'#'</span><span class="p">)</span> <span class="n">p</span><span class="o">--</span><span class="p">;</span> <span class="k">return</span> <span class="n">p</span><span class="p">;</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">Key</span><span class="o">::</span><span class="n">InitSeps</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">start</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">base</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">rep_</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">limit</span> <span class="o">=</span> <span class="n">base</span> <span class="o">+</span> <span class="n">rep_</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">s</span> <span class="o">=</span> <span class="n">start</span><span class="p">;</span> <span class="n">DCHECK_GE</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">base</span><span class="p">);</span> <span class="n">DCHECK_LT</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">limit</span><span class="p">);</span> <span class="c1">// We go backwards from the end of the string, rather than forwards,</span> <span class="c1">// since the directory name might be long and definitely doesn't contain</span> <span class="c1">// any '#' characters.</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">p</span> <span class="o">=</span> <span class="n">ScanBackwardsForSep</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">limit</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> <span class="n">DCHECK</span><span class="p">(</span><span class="o">*</span><span class="n">p</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">);</span> <span class="n">seps_</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">base</span><span class="p">;</span> <span class="n">p</span><span class="o">--</span><span class="p">;</span> <span class="n">p</span> <span class="o">=</span> <span class="n">ScanBackwardsForSep</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span> <span class="n">DCHECK</span><span class="p">(</span><span class="o">*</span><span class="n">p</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">);</span> <span class="n">seps_</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">base</span><span class="p">;</span> <span class="n">p</span><span class="o">--</span><span class="p">;</span> <span class="n">p</span> <span class="o">=</span> <span class="n">ScanBackwardsForSep</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span> <span class="n">DCHECK</span><span class="p">(</span><span class="o">*</span><span class="n">p</span> <span class="o">==</span> <span class="sc">'#'</span><span class="p">);</span> <span class="n">seps_</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">base</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <h3 id="reduce-stats-collection-costs">Reduce stats collection costs</h3> <blockquote> <p>Balance the utility of stats and other behavioral information about a system against the cost of maintaining that information. The extra information can often help people to understand and improve high-level behavior, but can also be costly to maintain.</p> </blockquote> <p>Yes, I’ve seen this. How do you decide what to instrument essentially?</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Part</span> <span class="n">of</span> <span class="n">changes</span> <span class="n">that</span> <span class="n">reduce</span> <span class="n">time</span> <span class="k">for</span> <span class="n">setting</span> <span class="n">an</span> <span class="n">alarm</span> <span class="n">from</span> <span class="mi">771</span> <span class="n">ns</span> <span class="n">to</span> <span class="mi">271</span> <span class="n">ns</span><span class="p">.</span> <span class="n">selectserver</span><span class="p">.</span><span class="n">h</span> <span class="k">class</span> <span class="nc">SelectServer</span> <span class="p">{</span> <span class="nl">public:</span> <span class="p">...</span> <span class="nl">protected:</span> <span class="p">...</span> <span class="n">scoped_ptr</span><span class="o">&lt;</span><span class="n">MinuteTenMinuteHourStat</span><span class="o">&gt;</span> <span class="n">num_alarms_stat_</span><span class="p">;</span> <span class="p">...</span> <span class="n">scoped_ptr</span><span class="o">&lt;</span><span class="n">MinuteTenMinuteHourStat</span><span class="o">&gt;</span> <span class="n">num_closures_stat_</span><span class="p">;</span> <span class="p">...</span> <span class="p">};</span> <span class="c1">// Selectserver class</span> <span class="k">class</span> <span class="nc">SelectServer</span> <span class="p">{</span> <span class="p">...</span> <span class="nl">protected:</span> <span class="p">...</span> <span class="p">};</span> <span class="o">/</span><span class="n">selectserver</span><span class="p">.</span><span class="n">cc</span> <span class="kt">void</span> <span class="n">SelectServer</span><span class="o">::</span><span class="n">AddAlarmInternal</span><span class="p">(</span><span class="n">Alarmer</span><span class="o">*</span> <span class="n">alarmer</span><span class="p">,</span> <span class="kt">int</span> <span class="n">offset_in_ms</span><span class="p">,</span> <span class="kt">int</span> <span class="n">id</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">is_periodic</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="n">alarms_</span><span class="o">-&gt;</span><span class="n">insert</span><span class="p">(</span><span class="n">alarm</span><span class="p">);</span> <span class="n">num_alarms_stat_</span><span class="o">-&gt;</span><span class="n">IncBy</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="p">...</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">SelectServer</span><span class="o">::</span><span class="n">AddAlarmInternal</span><span class="p">(</span><span class="n">Alarmer</span><span class="o">*</span> <span class="n">alarmer</span><span class="p">,</span> <span class="kt">int</span> <span class="n">offset_in_ms</span><span class="p">,</span> <span class="kt">int</span> <span class="n">id</span><span class="p">,</span> <span class="kt">bool</span> <span class="n">is_periodic</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="n">alarms_</span><span class="o">-&gt;</span><span class="n">Add</span><span class="p">(</span><span class="n">alarm</span><span class="p">);</span> <span class="p">...</span> <span class="p">}</span> <span class="o">/</span><span class="n">selectserver</span><span class="p">.</span><span class="n">cc</span> <span class="kt">void</span> <span class="n">SelectServer</span><span class="o">::</span><span class="n">RemoveAlarm</span><span class="p">(</span><span class="n">Alarmer</span><span class="o">*</span> <span class="n">alarmer</span><span class="p">,</span> <span class="kt">int</span> <span class="n">id</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="n">alarms_</span><span class="o">-&gt;</span><span class="n">erase</span><span class="p">(</span><span class="n">alarm</span><span class="p">);</span> <span class="n">num_alarms_stat_</span><span class="o">-&gt;</span><span class="n">IncBy</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span> <span class="p">...</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">SelectServer</span><span class="o">::</span><span class="n">RemoveAlarm</span><span class="p">(</span><span class="n">Alarmer</span><span class="o">*</span> <span class="n">alarmer</span><span class="p">,</span> <span class="kt">int</span> <span class="n">id</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="n">alarms_</span><span class="o">-&gt;</span><span class="n">Remove</span><span class="p">(</span><span class="n">alarm</span><span class="p">);</span> <span class="p">...</span> <span class="p">}</span> <span class="n">Often</span><span class="p">,</span> <span class="n">stats</span> <span class="n">or</span> <span class="n">other</span> <span class="n">properties</span> <span class="n">can</span> <span class="n">be</span> <span class="n">maintained</span> <span class="k">for</span> <span class="n">a</span> <span class="n">sample</span> <span class="n">of</span> <span class="n">the</span> <span class="n">elements</span> <span class="n">handled</span> <span class="n">by</span> <span class="n">the</span> <span class="n">system</span> <span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">g</span><span class="p">.,</span> <span class="n">RPC</span> <span class="n">requests</span><span class="p">,</span> <span class="n">input</span> <span class="n">records</span><span class="p">,</span> <span class="n">users</span><span class="p">).</span> <span class="n">Many</span> <span class="n">subsystems</span> <span class="n">use</span> <span class="k">this</span> <span class="n">approach</span> <span class="p">(</span><span class="n">tcmalloc</span> <span class="n">allocation</span> <span class="n">tracking</span><span class="p">,</span> <span class="o">/</span><span class="n">requestz</span> <span class="n">status</span> <span class="n">pages</span><span class="p">,</span> <span class="n">Dapper</span> <span class="n">samples</span><span class="p">).</span> <span class="n">When</span> <span class="n">sampling</span><span class="p">,</span> <span class="n">consider</span> <span class="n">reducing</span> <span class="n">the</span> <span class="n">sampling</span> <span class="n">rate</span> <span class="n">when</span> <span class="n">appropriate</span><span class="p">.</span> </code></pre></div></div> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">This</span> <span class="n">change</span> <span class="n">reduces</span> <span class="n">the</span> <span class="n">sampling</span> <span class="n">rate</span> <span class="n">from</span> <span class="mi">1</span> <span class="n">in</span> <span class="mi">10</span> <span class="n">to</span> <span class="mi">1</span> <span class="n">in</span> <span class="mf">32.</span> <span class="n">Furthermore</span><span class="p">,</span> <span class="n">we</span> <span class="n">now</span> <span class="n">keep</span> <span class="n">execution</span> <span class="n">time</span> <span class="n">stats</span> <span class="n">just</span> <span class="k">for</span> <span class="n">the</span> <span class="n">sampled</span> <span class="n">events</span> <span class="n">and</span> <span class="n">speed</span> <span class="n">up</span> <span class="n">sampling</span> <span class="n">decisions</span> <span class="n">by</span> <span class="k">using</span> <span class="n">a</span> <span class="n">power</span> <span class="n">of</span> <span class="n">two</span> <span class="n">modulus</span><span class="p">.</span> <span class="n">This</span> <span class="n">code</span> <span class="n">is</span> <span class="n">called</span> <span class="n">on</span> <span class="n">every</span> <span class="n">packet</span> <span class="n">in</span> <span class="n">the</span> <span class="n">Google</span> <span class="n">Meet</span> <span class="n">video</span> <span class="n">conferencing</span> <span class="n">system</span> <span class="n">and</span> <span class="n">needed</span> <span class="n">performance</span> <span class="n">work</span> <span class="n">to</span> <span class="n">keep</span> <span class="n">up</span> <span class="n">with</span> <span class="n">capacity</span> <span class="n">demands</span> <span class="n">during</span> <span class="n">the</span> <span class="n">first</span> <span class="n">part</span> <span class="n">of</span> <span class="n">the</span> <span class="n">COVID</span> <span class="n">outbreak</span> <span class="n">as</span> <span class="n">users</span> <span class="n">rapidly</span> <span class="n">migrated</span> <span class="n">to</span> <span class="n">doing</span> <span class="n">more</span> <span class="n">online</span> <span class="n">meetings</span><span class="p">.</span> <span class="n">packet_executor</span><span class="p">.</span><span class="n">cc</span> <span class="k">class</span> <span class="nc">ScopedPerformanceMeasurement</span> <span class="p">{</span> <span class="nl">public:</span> <span class="k">explicit</span> <span class="n">ScopedPerformanceMeasurement</span><span class="p">(</span><span class="n">PacketExecutor</span><span class="o">*</span> <span class="n">packet_executor</span><span class="p">)</span> <span class="o">:</span> <span class="n">packet_executor_</span><span class="p">(</span><span class="n">packet_executor</span><span class="p">),</span> <span class="n">tracer_</span><span class="p">(</span><span class="n">packet_executor</span><span class="o">-&gt;</span><span class="n">packet_executor_trace_threshold_</span><span class="p">,</span> <span class="n">kClosureTraceName</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// ThreadCPUUsage is an expensive call. At the time of writing,</span> <span class="c1">// it takes over 400ns, or roughly 30 times slower than absl::Now,</span> <span class="c1">// so we sample only 10% of closures to keep the cost down.</span> <span class="k">if</span> <span class="p">(</span><span class="n">packet_executor</span><span class="o">-&gt;</span><span class="n">closures_executed_</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">thread_cpu_usage_start_</span> <span class="o">=</span> <span class="n">base</span><span class="o">::</span><span class="n">ThreadCPUUsage</span><span class="p">();</span> <span class="p">}</span> <span class="c1">// Sample start time after potentially making the above expensive call,</span> <span class="c1">// so as not to pollute wall time measurements.</span> <span class="n">run_start_time_</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">Now</span><span class="p">();</span> <span class="p">}</span> <span class="o">~</span><span class="n">ScopedPerformanceMeasurement</span><span class="p">()</span> <span class="p">{</span> <span class="n">ScopedPerformanceMeasurement</span><span class="o">::</span><span class="n">ScopedPerformanceMeasurement</span><span class="p">(</span> <span class="n">PacketExecutor</span><span class="o">*</span> <span class="n">packet_executor</span><span class="p">)</span> <span class="o">:</span> <span class="n">packet_executor_</span><span class="p">(</span><span class="n">packet_executor</span><span class="p">),</span> <span class="n">tracer_</span><span class="p">(</span><span class="n">packet_executor</span><span class="o">-&gt;</span><span class="n">packet_executor_trace_threshold_</span><span class="p">,</span> <span class="n">kClosureTraceName</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// ThreadCPUUsage is an expensive call. At the time of writing,</span> <span class="c1">// it takes over 400ns, or roughly 30 times slower than absl::Now,</span> <span class="c1">// so we sample only 1 in 32 closures to keep the cost down.</span> <span class="k">if</span> <span class="p">(</span><span class="n">packet_executor</span><span class="o">-&gt;</span><span class="n">closures_executed_</span> <span class="o">%</span> <span class="mi">32</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="n">thread_cpu_usage_start_</span> <span class="o">=</span> <span class="n">base</span><span class="o">::</span><span class="n">ThreadCPUUsage</span><span class="p">();</span> <span class="p">}</span> <span class="c1">// Sample start time after potentially making the above expensive call,</span> <span class="c1">// so as not to pollute wall time measurements.</span> <span class="n">run_start_time_</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">Now</span><span class="p">();</span> <span class="p">}</span> <span class="n">packet_executor</span><span class="p">.</span><span class="n">cc</span> <span class="o">~</span><span class="n">ScopedPerformanceMeasurement</span><span class="p">()</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">run_end_time</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">Now</span><span class="p">();</span> <span class="k">auto</span> <span class="n">run_duration</span> <span class="o">=</span> <span class="n">run_end_time</span> <span class="o">-</span> <span class="n">run_start_time_</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">thread_cpu_usage_start_</span><span class="p">.</span><span class="n">has_value</span><span class="p">())</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span> <span class="n">closure_execution_time</span><span class="o">-&gt;</span><span class="n">Record</span><span class="p">(</span><span class="n">absl</span><span class="o">::</span><span class="n">ToInt64Microseconds</span><span class="p">(</span><span class="n">run_duration</span><span class="p">));</span> <span class="n">ScopedPerformanceMeasurement</span><span class="o">::~</span><span class="n">ScopedPerformanceMeasurement</span><span class="p">()</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">run_end_time</span> <span class="o">=</span> <span class="n">absl</span><span class="o">::</span><span class="n">Now</span><span class="p">();</span> <span class="k">auto</span> <span class="n">run_duration</span> <span class="o">=</span> <span class="n">run_end_time</span> <span class="o">-</span> <span class="n">run_start_time_</span><span class="p">;</span> <span class="k">if</span> <span class="p">(</span><span class="n">thread_cpu_usage_start_</span><span class="p">.</span><span class="n">has_value</span><span class="p">())</span> <span class="p">{</span> <span class="p">...</span> <span class="n">closure_execution_time</span><span class="o">-&gt;</span><span class="n">Record</span><span class="p">(</span><span class="n">absl</span><span class="o">::</span><span class="n">ToInt64Microseconds</span><span class="p">(</span><span class="n">run_duration</span><span class="p">));</span> <span class="p">}</span> </code></pre></div></div> <h3 id="avoid-logging-on-hot-code-paths">Avoid logging on hot code paths</h3> <blockquote> <p>Logging statements can be costly, even if the logging-level for the statement doesn’t actually log anything. E.g., ABSL_VLOG’s implementation requires at least a load and a comparison, which may be a problem in hot code paths. In addition, the presence of the logging code may inhibit compiler optimizations. Consider dropping logging entirely from hot code paths.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image_similarity</span><span class="p">.</span><span class="n">cc</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">output_subimage_size_y</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">j1</span> <span class="o">=</span> <span class="n">j</span> <span class="o">-</span> <span class="n">rad</span> <span class="o">+</span> <span class="n">output_to_integral_subimage_y</span><span class="p">;</span> <span class="kt">int</span> <span class="n">j2</span> <span class="o">=</span> <span class="n">j1</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">rad</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// Create a pointer for this row's output, taking into account the offset</span> <span class="c1">// to the full image.</span> <span class="kt">double</span> <span class="o">*</span><span class="n">image_diff_ptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">image_diff</span><span class="p">)(</span><span class="n">j</span> <span class="o">+</span> <span class="n">min_j</span><span class="p">,</span> <span class="n">min_i</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">output_subimage_size_x</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="k">if</span> <span class="p">(</span><span class="n">VLOG_IS_ON</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span> <span class="p">...</span> <span class="p">}</span> <span class="p">}</span> <span class="k">const</span> <span class="kt">bool</span> <span class="n">vlog_3</span> <span class="o">=</span> <span class="n">DEBUG_MODE</span> <span class="o">?</span> <span class="n">VLOG_IS_ON</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">:</span> <span class="nb">false</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">output_subimage_size_y</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">j1</span> <span class="o">=</span> <span class="n">j</span> <span class="o">-</span> <span class="n">rad</span> <span class="o">+</span> <span class="n">output_to_integral_subimage_y</span><span class="p">;</span> <span class="kt">int</span> <span class="n">j2</span> <span class="o">=</span> <span class="n">j1</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">rad</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// Create a pointer for this row's output, taking into account the offset</span> <span class="c1">// to the full image.</span> <span class="kt">double</span> <span class="o">*</span><span class="n">image_diff_ptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="p">(</span><span class="o">*</span><span class="n">image_diff</span><span class="p">)(</span><span class="n">j</span> <span class="o">+</span> <span class="n">min_j</span><span class="p">,</span> <span class="n">min_i</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">output_subimage_size_x</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="k">if</span> <span class="p">(</span><span class="n">vlog_3</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="n">Run</span> <span class="nf">on</span> <span class="p">(</span><span class="mi">40</span> <span class="n">X</span> <span class="mi">2801</span> <span class="n">MHz</span> <span class="n">CPUs</span><span class="p">);</span> <span class="mi">2016</span><span class="o">-</span><span class="mo">05</span><span class="o">-</span><span class="mi">16</span><span class="n">T15</span><span class="o">:</span><span class="mi">55</span><span class="o">:</span><span class="mf">32.250633072</span><span class="o">-</span><span class="mo">07</span><span class="o">:</span><span class="mo">00</span> <span class="n">CPU</span><span class="o">:</span> <span class="n">Intel</span> <span class="n">Ivybridge</span> <span class="n">with</span> <span class="n">HyperThreading</span> <span class="p">(</span><span class="mi">20</span> <span class="n">cores</span><span class="p">)</span> <span class="n">dL1</span><span class="o">:</span><span class="mi">32</span><span class="n">KB</span> <span class="n">dL2</span><span class="o">:</span><span class="mi">256</span><span class="n">KB</span> <span class="n">dL3</span><span class="o">:</span><span class="mi">25</span><span class="n">MB</span> <span class="n">Benchmark</span> <span class="n">Base</span> <span class="p">(</span><span class="n">ns</span><span class="p">)</span> <span class="n">New</span> <span class="p">(</span><span class="n">ns</span><span class="p">)</span> <span class="n">Improvement</span> <span class="o">------------------------------------------------------------------</span> <span class="n">BM_NCCPerformance</span><span class="o">/</span><span class="mi">16</span> <span class="mi">29104</span> <span class="mi">26372</span> <span class="o">+</span><span class="mf">9.4</span><span class="o">%</span> <span class="n">BM_NCCPerformance</span><span class="o">/</span><span class="mi">64</span> <span class="mi">473235</span> <span class="mi">425281</span> <span class="o">+</span><span class="mf">10.1</span><span class="o">%</span> <span class="n">BM_NCCPerformance</span><span class="o">/</span><span class="mi">512</span> <span class="mi">30246238</span> <span class="mi">27622009</span> <span class="o">+</span><span class="mf">8.7</span><span class="o">%</span> <span class="n">BM_NCCPerformance</span><span class="o">/</span><span class="mi">1</span><span class="n">k</span> <span class="mi">125651445</span> <span class="mi">113361991</span> <span class="o">+</span><span class="mf">9.8</span><span class="o">%</span> <span class="n">BM_NCCLimitedBoundsPerformance</span><span class="o">/</span><span class="mi">16</span> <span class="mi">8314</span> <span class="mi">7498</span> <span class="o">+</span><span class="mf">9.8</span><span class="o">%</span> <span class="n">BM_NCCLimitedBoundsPerformance</span><span class="o">/</span><span class="mi">64</span> <span class="mi">143508</span> <span class="mi">132202</span> <span class="o">+</span><span class="mf">7.9</span><span class="o">%</span> <span class="n">BM_NCCLimitedBoundsPerformance</span><span class="o">/</span><span class="mi">512</span> <span class="mi">9335684</span> <span class="mi">8477567</span> <span class="o">+</span><span class="mf">9.2</span><span class="o">%</span> <span class="n">BM_NCCLimitedBoundsPerformance</span><span class="o">/</span><span class="mi">1</span><span class="n">k</span> <span class="mi">37223897</span> <span class="mi">34201739</span> <span class="o">+</span><span class="mf">8.1</span><span class="o">%</span> </code></pre></div></div> <h2 id="code-size-considerations">Code size considerations</h2> <blockquote> <p>Performance encompasses more than just runtime speed. Sometimes it is worth considering the effects of software choices on the size of generated code. Large code size means longer compile and link times, bloated binaries, more memory usage, more icache pressure, and other sometimes negative effects on microarchitectural structures like branch predictors, etc. Thinking about these issues is especially important when writing low-level library code that will be used in many places, or when writing templated code that you expect will be instantiated for many different types.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rn</span> <span class="n">many</span> <span class="n">map</span> <span class="n">insertion</span> <span class="n">calls</span> <span class="n">in</span> <span class="n">a</span> <span class="n">row</span> <span class="n">to</span> <span class="n">initialize</span> <span class="n">a</span> <span class="n">hash</span> <span class="n">table</span> <span class="n">of</span> <span class="n">emoji</span> <span class="n">characters</span> <span class="n">into</span> <span class="n">a</span> <span class="n">single</span> <span class="n">bulk</span> <span class="n">insert</span> <span class="nf">operation</span> <span class="p">(</span><span class="mi">188</span><span class="n">KB</span> <span class="n">of</span> <span class="n">text</span> <span class="n">down</span> <span class="n">to</span> <span class="mi">360</span> <span class="n">bytes</span> <span class="n">in</span> <span class="n">library</span> <span class="n">linked</span> <span class="n">into</span> <span class="n">many</span> <span class="n">binaries</span><span class="p">).</span> <span class="err">😊</span> <span class="n">textfallback_init</span><span class="p">.</span><span class="n">h</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="n">AddEmojiFallbacks</span><span class="p">(</span><span class="n">TextFallbackMap</span> <span class="o">*</span><span class="n">map</span><span class="p">)</span> <span class="p">{</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE000</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE000</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE001</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE001</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE002</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE002</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE003</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE003</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE004</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE004</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE005</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE005</span><span class="p">;</span> <span class="p">...</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFEE7D</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFEE7D</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFEEA0</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFEEA0</span><span class="p">;</span> <span class="p">(</span><span class="o">*</span><span class="n">map</span><span class="p">)[</span><span class="mh">0xFE331</span><span class="p">]</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kFE331</span><span class="p">;</span> <span class="p">};</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">AddEmojiFallbacks</span><span class="p">(</span><span class="n">TextFallbackMap</span> <span class="o">*</span><span class="n">map</span><span class="p">)</span> <span class="p">{</span> <span class="cp">#define PAIR(x) {0x##x, &amp;k##x} </span> <span class="c1">// clang-format off</span> <span class="n">map</span><span class="o">-&gt;</span><span class="n">insert</span><span class="p">({</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE000</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE001</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE002</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE003</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE004</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE005</span><span class="p">),</span> <span class="p">...</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FEE7D</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FEEA0</span><span class="p">),</span> <span class="n">PAIR</span><span class="p">(</span><span class="n">FE331</span><span class="p">)});</span> <span class="c1">// clang-format on</span> <span class="cp">#undef PAIR </span><span class="p">};</span> </code></pre></div></div> <h3 id="parallelization-and-synchronization">Parallelization and synchronization</h3> <blockquote> <p>Modern machines have many cores, and they are often underutilized. Expensive work may therefore be completed faster by parallelizing it. The most common approach is to process different items in parallel and combine the results when done. Typically, the items are first partitioned into batches to avoid paying the cost of running something in parallel per item.</p> </blockquote> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Four</span><span class="o">-</span><span class="n">way</span> <span class="n">parallelization</span> <span class="n">improves</span> <span class="n">the</span> <span class="n">rate</span> <span class="n">of</span> <span class="n">encoding</span> <span class="n">tokens</span> <span class="n">by</span> <span class="o">~</span><span class="mf">3.6</span><span class="n">x</span><span class="p">.</span> <span class="n">blocked</span><span class="o">-</span><span class="n">token</span><span class="o">-</span><span class="n">coder</span><span class="p">.</span><span class="n">cc</span> <span class="n">MutexLock</span> <span class="nf">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">encoder_threads_lock</span><span class="p">);</span> <span class="k">if</span> <span class="p">(</span><span class="n">encoder_threads</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span> <span class="n">encoder_threads</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ThreadPool</span><span class="p">(</span><span class="n">NumCPUs</span><span class="p">());</span> <span class="n">encoder_threads</span><span class="o">-&gt;</span><span class="n">SetStackSize</span><span class="p">(</span><span class="mi">262144</span><span class="p">);</span> <span class="n">encoder_threads</span><span class="o">-&gt;</span><span class="n">StartWorkers</span><span class="p">();</span> <span class="p">}</span> <span class="n">encoder_threads</span><span class="o">-&gt;</span><span class="n">Add</span> <span class="p">(</span><span class="n">NewCallback</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">BlockedTokenEncoder</span><span class="o">::</span><span class="n">EncodeRegionInThread</span><span class="p">,</span> <span class="n">region_tokens</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="n">region</span><span class="p">,</span> <span class="n">stats</span><span class="p">,</span> <span class="n">controller_</span><span class="o">-&gt;</span><span class="n">GetClosureWithCost</span> <span class="p">(</span><span class="n">NewCallback</span><span class="p">(</span><span class="o">&amp;</span><span class="n">DummyCallback</span><span class="p">),</span> <span class="n">N</span><span class="p">)));</span> </code></pre></div></div> <blockquote> <p>The effect on system performance should be measured carefully – if spare CPU is not available, or if memory bandwidth is saturated, parallelization may not help, or may even hurt.</p> </blockquote> <p>This is the caveat. It’s hard to guage this for every type of machine there is.</p> <h3 id="amortize-lock-acquisition">Amortize lock acquisition</h3> <blockquote> <p>Avoid fine-grained locking to reduce the cost of Mutex operations in hot paths. Caveat: this should only be done if the change does not increase lock contention.</p> </blockquote> <p>Interesting. Yes, if there is another thread accessing it, it could theorically be faster (say some section isn’t using actually the shared variable)</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Acquire lock once to free entire tree of query nodes, rather than reacquiring lock for every node in tree.</span> <span class="c1">// Pool of query nodes</span> <span class="n">ThreadSafeFreeList</span><span class="o">&lt;</span><span class="n">MustangQuery</span><span class="o">&gt;</span> <span class="n">pool_</span><span class="p">(</span><span class="mi">256</span><span class="p">);</span> <span class="p">...</span> <span class="kt">void</span> <span class="n">MustangQuery</span><span class="o">::</span><span class="n">Release</span><span class="p">(</span><span class="n">MustangQuery</span><span class="o">*</span> <span class="n">node</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="n">Release</span><span class="p">((</span><span class="o">*</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="p">)[</span><span class="n">i</span><span class="p">]);</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="o">-&gt;</span><span class="n">clear</span><span class="p">();</span> <span class="n">pool_</span><span class="p">.</span><span class="n">Delete</span><span class="p">(</span><span class="n">node</span><span class="p">);</span> <span class="p">}</span> <span class="c1">// Pool of query nodes</span> <span class="n">Mutex</span> <span class="n">pool_lock_</span><span class="p">;</span> <span class="n">FreeList</span><span class="o">&lt;</span><span class="n">MustangQuery</span><span class="o">&gt;</span> <span class="n">pool_</span><span class="p">(</span><span class="mi">256</span><span class="p">);</span> <span class="p">...</span> <span class="kt">void</span> <span class="n">MustangQuery</span><span class="o">::</span><span class="n">Release</span><span class="p">(</span><span class="n">MustangQuery</span><span class="o">*</span> <span class="n">node</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="n">MutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pool_lock_</span><span class="p">);</span> <span class="n">ReleaseLocked</span><span class="p">(</span><span class="n">node</span><span class="p">);</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">MustangQuery</span><span class="o">::</span><span class="n">ReleaseLocked</span><span class="p">(</span><span class="n">MustangQuery</span><span class="o">*</span> <span class="n">node</span><span class="p">)</span> <span class="p">{</span> <span class="cp">#ifndef NDEBUG </span> <span class="n">pool_lock_</span><span class="p">.</span><span class="n">AssertHeld</span><span class="p">();</span> <span class="cp">#endif </span> <span class="k">if</span> <span class="p">(</span><span class="n">node</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="n">ReleaseLocked</span><span class="p">((</span><span class="o">*</span><span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="p">)[</span><span class="n">i</span><span class="p">]);</span> <span class="n">node</span><span class="o">-&gt;</span><span class="n">children_</span><span class="o">-&gt;</span><span class="n">clear</span><span class="p">();</span> <span class="n">pool_</span><span class="p">.</span><span class="n">Delete</span><span class="p">(</span><span class="n">node</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <h3 id="keep-critical-sections-short">Keep critical sections short</h3> <blockquote> <p>Avoid expensive work inside critical sections. In particular, watch out for innocuous looking code that might be doing RPCs or accessing files.</p> </blockquote> <p>Basically minimize critical sections, but in addition, try to find these critical sections that have high ROI.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Avoid</span> <span class="n">RPC</span> <span class="k">while</span> <span class="n">holding</span> <span class="n">Mutex</span><span class="p">.</span> <span class="n">trainer</span><span class="p">.</span><span class="n">cc</span> <span class="p">{</span> <span class="c1">// Notify the parameter server that we are starting.</span> <span class="n">MutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lock_</span><span class="p">);</span> <span class="n">model_</span> <span class="o">=</span> <span class="n">model</span><span class="p">;</span> <span class="n">MaybeRecordProgress</span><span class="p">(</span><span class="n">last_global_step_</span><span class="p">);</span> <span class="p">}</span> <span class="kt">bool</span> <span class="n">should_start_record_progress</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span> <span class="n">int64</span> <span class="n">step_for_progress</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span> <span class="p">{</span> <span class="c1">// Notify the parameter server that we are starting.</span> <span class="n">MutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">lock_</span><span class="p">);</span> <span class="n">model_</span> <span class="o">=</span> <span class="n">model</span><span class="p">;</span> <span class="n">should_start_record_progress</span> <span class="o">=</span> <span class="n">ShouldStartRecordProgress</span><span class="p">();</span> <span class="n">step_for_progress</span> <span class="o">=</span> <span class="n">last_global_step_</span><span class="p">;</span> <span class="p">}</span> <span class="k">if</span> <span class="p">(</span><span class="n">should_start_record_progress</span><span class="p">)</span> <span class="p">{</span> <span class="n">StartRecordProgress</span><span class="p">(</span><span class="n">step_for_progress</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <h3 id="reduce-contention-by-sharding">Reduce contention by sharding</h3> <blockquote> <p>Sometimes a data structure protected by a Mutex that is exhibiting high contention can be safely split into multiple shards, each shard with its own Mutex. (Note: this requires that there are no cross-shard invariants between the different shards.)</p> </blockquote> <p>This just means that the underlying elements can be processed in parallel, but the global object cannot be accessed during this time. I didn’t realize you just could initialize multiple copies.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ShardedLRUCache</span> <span class="o">:</span> <span class="k">public</span> <span class="n">Cache</span> <span class="p">{</span> <span class="nl">private:</span> <span class="n">LRUCache</span> <span class="n">shard_</span><span class="p">[</span><span class="n">kNumShards</span><span class="p">];</span> <span class="n">port</span><span class="o">::</span><span class="n">Mutex</span> <span class="n">id_mutex_</span><span class="p">;</span> <span class="kt">uint64_t</span> <span class="n">last_id_</span><span class="p">;</span> <span class="k">static</span> <span class="kr">inline</span> <span class="kt">uint32_t</span> <span class="n">HashSlice</span><span class="p">(</span><span class="k">const</span> <span class="n">Slice</span><span class="o">&amp;</span> <span class="n">s</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">Hash</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">s</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="mi">0</span><span class="p">);</span> <span class="p">}</span> <span class="k">static</span> <span class="kt">uint32_t</span> <span class="nf">Shard</span><span class="p">(</span><span class="kt">uint32_t</span> <span class="n">hash</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="p">(</span><span class="mi">32</span> <span class="o">-</span> <span class="n">kNumShardBits</span><span class="p">);</span> <span class="p">}</span> <span class="p">...</span> <span class="k">virtual</span> <span class="n">Handle</span><span class="o">*</span> <span class="nf">Lookup</span><span class="p">(</span><span class="k">const</span> <span class="n">Slice</span><span class="o">&amp;</span> <span class="n">key</span><span class="p">)</span> <span class="p">{</span> <span class="k">const</span> <span class="kt">uint32_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">HashSlice</span><span class="p">(</span><span class="n">key</span><span class="p">);</span> <span class="k">return</span> <span class="n">shard_</span><span class="p">[</span><span class="n">Shard</span><span class="p">(</span><span class="n">hash</span><span class="p">)].</span><span class="n">Lookup</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">hash</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <blockquote> <p>Be careful with the information used for shard selection. If, for example, you use some bits of a hash value for shard selection and then those same bits end up being used again later, the latter use may perform poorly since it sees a skewed distribution of hash values.</p> </blockquote> <p>For sharding, equal distribution is always important. Nothing should be overloaded.</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">This</span> <span class="n">CL</span> <span class="n">partitions</span> <span class="n">the</span> <span class="n">ActiveCallMap</span> <span class="n">into</span> <span class="mi">64</span> <span class="n">shards</span><span class="p">.</span> <span class="n">Each</span> <span class="n">shard</span> <span class="n">is</span> <span class="k">protected</span> <span class="n">by</span> <span class="n">a</span> <span class="n">separate</span> <span class="n">mutex</span><span class="p">.</span> <span class="n">A</span> <span class="n">given</span> <span class="n">transaction</span> <span class="n">will</span> <span class="n">be</span> <span class="n">mapped</span> <span class="n">to</span> <span class="n">exactly</span> <span class="n">one</span> <span class="n">shard</span><span class="p">.</span> <span class="n">A</span> <span class="k">new</span> <span class="n">interface</span> <span class="nf">LockedShard</span><span class="p">(</span><span class="n">tid</span><span class="p">)</span> <span class="n">is</span> <span class="n">added</span> <span class="k">for</span> <span class="n">accessing</span> <span class="n">the</span> <span class="n">ActiveCallMap</span> <span class="k">for</span> <span class="n">a</span> <span class="n">transaction</span> <span class="n">in</span> <span class="n">a</span> <span class="kr">thread</span><span class="o">-</span><span class="n">safe</span> <span class="n">manner</span><span class="p">.</span> <span class="n">Example</span> <span class="n">usage</span><span class="o">:</span> <span class="n">transaction_manager</span><span class="p">.</span><span class="n">cc</span> <span class="p">{</span> <span class="n">absl</span><span class="o">::</span><span class="n">MutexLock</span> <span class="n">l</span><span class="p">(</span><span class="o">&amp;</span><span class="n">active_calls_in_mu_</span><span class="p">);</span> <span class="n">delayed_locks_timer_ring_</span><span class="p">.</span><span class="n">Add</span><span class="p">(</span><span class="n">delayed_locks_flush_time_ms</span><span class="p">,</span> <span class="n">tid</span><span class="p">);</span> <span class="p">}</span> <span class="p">{</span> <span class="n">ActiveCalls</span><span class="o">::</span><span class="n">LockedShard</span> <span class="n">shard</span><span class="p">(</span><span class="n">active_calls_in_</span><span class="p">,</span> <span class="n">tid</span><span class="p">);</span> <span class="n">shard</span><span class="p">.</span><span class="n">delayed_locks_timer_ring</span><span class="p">().</span><span class="n">Add</span><span class="p">(</span><span class="n">delayed_locks_flush_time_ms</span><span class="p">,</span> <span class="n">tid</span><span class="p">);</span> <span class="p">}</span> <span class="n">The</span> <span class="n">results</span> <span class="n">show</span> <span class="n">a</span> <span class="mi">69</span><span class="o">%</span> <span class="n">reduction</span> <span class="n">in</span> <span class="n">overall</span> <span class="n">wall</span><span class="o">-</span><span class="n">clock</span> <span class="n">time</span> <span class="n">when</span> <span class="n">running</span> <span class="n">the</span> <span class="n">benchmark</span> <span class="n">with</span> <span class="mi">8192</span> <span class="n">fibers</span> </code></pre></div></div> <h3 id="reduce-false-sharing">Reduce false sharing</h3> <blockquote> <p>If different threads access different mutable data, consider placing the different data items on different cache lines, e.g., in C++ using the alignas directive. However, these directives are easy to misuse and may increase object sizes significantly, so make sure performance measurements justify their use.</p> </blockquote> <p>Trade size for performance… How do you even identify such a thing</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">histogram</span><span class="p">.</span><span class="n">h</span> <span class="n">HistogramOptions</span> <span class="n">options_</span><span class="p">;</span> <span class="p">...</span> <span class="n">internal</span><span class="o">::</span><span class="n">HistogramBoundaries</span> <span class="o">*</span><span class="n">boundaries_</span><span class="p">;</span> <span class="p">...</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">double</span><span class="o">&gt;</span> <span class="n">buckets_</span><span class="p">;</span> <span class="kt">double</span> <span class="n">min_</span><span class="p">;</span> <span class="c1">// Minimum.</span> <span class="kt">double</span> <span class="n">max_</span><span class="p">;</span> <span class="c1">// Maximum.</span> <span class="kt">double</span> <span class="n">count_</span><span class="p">;</span> <span class="c1">// Total count of occurrences.</span> <span class="kt">double</span> <span class="n">sum_</span><span class="p">;</span> <span class="c1">// Sum of values.</span> <span class="kt">double</span> <span class="n">sum_of_squares_</span><span class="p">;</span> <span class="c1">// Sum of squares of values.</span> <span class="p">...</span> <span class="n">RegisterVariableExporter</span> <span class="o">*</span><span class="n">exporter_</span><span class="p">;</span> <span class="n">HistogramOptions</span> <span class="n">options_</span><span class="p">;</span> <span class="p">...</span> <span class="n">internal</span><span class="o">::</span><span class="n">HistogramBoundaries</span> <span class="o">*</span><span class="n">boundaries_</span><span class="p">;</span> <span class="p">...</span> <span class="n">RegisterVariableExporter</span> <span class="o">*</span><span class="n">exporter_</span><span class="p">;</span> <span class="p">...</span> <span class="c1">// Place the following fields in a dedicated cacheline as they are frequently</span> <span class="c1">// mutated, so we can avoid potential false sharing.</span> <span class="p">...</span> <span class="cp">#ifndef SWIG </span> <span class="k">alignas</span><span class="p">(</span><span class="n">ABSL_CACHELINE_SIZE</span><span class="p">)</span> <span class="cp">#endif </span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">double</span><span class="o">&gt;</span> <span class="n">buckets_</span><span class="p">;</span> <span class="kt">double</span> <span class="n">min_</span><span class="p">;</span> <span class="c1">// Minimum.</span> <span class="kt">double</span> <span class="n">max_</span><span class="p">;</span> <span class="c1">// Maximum.</span> <span class="kt">double</span> <span class="n">count_</span><span class="p">;</span> <span class="c1">// Total count of occurrences.</span> <span class="kt">double</span> <span class="n">sum_</span><span class="p">;</span> <span class="c1">// Sum of values.</span> <span class="kt">double</span> <span class="n">sum_of_squares_</span><span class="p">;</span> <span class="c1">// Sum of squares of values.</span> </code></pre></div></div> <h3 id="reduce-frequency-of-context-switches">Reduce frequency of context switches</h3> <blockquote> <p>Process small work items inline instead of on device thread pool.</p> </blockquote> <p>hard to see this without a tracer</p> <h3 id="consider-lock-free-approaches">Consider lock-free approaches</h3> <blockquote> <p>Sometimes lock-free data structures can make a difference over more conventional mutex-protected data structures. However, direct atomic variable manipulation can be dangerous. Prefer higher-level abstractions.</p> </blockquote> <p>Extremely hard to debug and catch issues with. I don’t have expertise in this.</p> <h3 id="protocol-buffer-advice">Protocol Buffer advice</h3> <p>I think this section is rather huge and for a good reason. Messages are one of the foundational building blocks of any distributed system and optimizing a small percentage will have high yields. This section is for good practices which can be applied to any message protocol.</p> <p>What I mostly got from this section are that you need to see the generated serialization code and its overhead for serialization/deserialization and find the best practices to reduce the serialization/deserialization overhead (either by editing the proto file or by editing c++, by adding arenas).</p> <h3 id="c-specific-advice">C++-Specific advice</h3> <p>absl::flat_hash_map (and set). This is generall true for almost all standard libraries in C++ except a very small subset (like std::vector).</p> <blockquote> <p>absl::InlinedVector stores a small number of elements inline (configurable via the second template argument). This enables small vectors up to this number of elements to generally have better cache efficiency and also to avoid allocating a backing store array at all when the number of elements is small.</p> </blockquote> <p>This is probably just allocating on the stack. it’s nice, similiar to llvm::vector</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gtl</span><span class="o">::</span><span class="n">vector32</span> <span class="n">Saves</span> <span class="n">space</span> <span class="n">by</span> <span class="k">using</span> <span class="n">a</span> <span class="n">customized</span> <span class="n">vector</span> <span class="n">type</span> <span class="n">that</span> <span class="n">only</span> <span class="n">supports</span> <span class="n">sizes</span> <span class="n">that</span> <span class="n">fit</span> <span class="n">in</span> <span class="mi">32</span> <span class="n">bits</span><span class="p">.</span> <span class="n">Simple</span> <span class="n">type</span> <span class="n">change</span> <span class="n">saves</span> <span class="o">~</span><span class="mi">8</span><span class="n">TiB</span> <span class="n">of</span> <span class="n">memory</span> <span class="n">in</span> <span class="n">Spanner</span><span class="p">.</span> <span class="n">table_ply</span><span class="p">.</span><span class="n">h</span> <span class="k">class</span> <span class="nc">TablePly</span> <span class="p">{</span> <span class="p">...</span> <span class="c1">// Returns the set of data columns stored in this file for this table.</span> <span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">FamilyId</span><span class="o">&gt;&amp;</span> <span class="n">modified_data_columns</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">modified_data_columns_</span><span class="p">;</span> <span class="p">}</span> <span class="p">...</span> <span class="k">private</span><span class="o">:</span> <span class="p">...</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="n">FamilyId</span><span class="o">&gt;</span> <span class="n">modified_data_columns_</span><span class="p">;</span> <span class="c1">// Data columns in the table.</span> <span class="cp">#include</span> <span class="cpf">"util/gtl/vector32.h"</span><span class="cp"> </span> <span class="p">...</span> <span class="c1">// Returns the set of data columns stored in this file for this table.</span> <span class="n">absl</span><span class="o">::</span><span class="n">Span</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">FamilyId</span><span class="o">&gt;</span> <span class="n">modified_data_columns</span><span class="p">()</span> <span class="k">const</span> <span class="p">{</span> <span class="k">return</span> <span class="n">modified_data_columns_</span><span class="p">;</span> <span class="p">}</span> <span class="p">...</span> <span class="p">...</span> <span class="c1">// Data columns in the table.</span> <span class="n">gtl</span><span class="o">::</span><span class="n">vector32</span><span class="o">&lt;</span><span class="n">FamilyId</span><span class="o">&gt;</span> <span class="n">modified_data_columns_</span><span class="p">;</span> </code></pre></div></div> <p>This is very cool. I guess the data type won’t align up to 64bits, so you can cut it in half.</p> <h1 id="bulk-operations">Bulk operations</h1> <p>As per usual, bulk computation is the answer since memory is the bottleneck…</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Introduced</span> <span class="n">a</span> <span class="n">GroupVarInt</span> <span class="n">format</span> <span class="n">that</span> <span class="n">encodes</span><span class="o">/</span><span class="n">decodes</span> <span class="n">groups</span> <span class="n">of</span> <span class="mi">4</span> <span class="n">variable</span><span class="o">-</span><span class="n">length</span> <span class="n">integers</span> <span class="n">at</span> <span class="n">a</span> <span class="n">time</span> <span class="n">in</span> <span class="mi">5</span><span class="o">-</span><span class="mi">17</span> <span class="n">bytes</span><span class="p">,</span> <span class="n">rather</span> <span class="n">than</span> <span class="n">one</span> <span class="n">integer</span> <span class="n">at</span> <span class="n">a</span> <span class="n">time</span><span class="p">.</span> <span class="n">Decoding</span> <span class="n">one</span> <span class="n">group</span> <span class="n">of</span> <span class="mi">4</span> <span class="n">integers</span> <span class="n">in</span> <span class="n">the</span> <span class="k">new</span> <span class="n">format</span> <span class="n">takes</span> <span class="o">~</span><span class="mi">1</span><span class="o">/</span><span class="mi">3</span><span class="n">rd</span> <span class="n">the</span> <span class="n">time</span> <span class="n">of</span> <span class="n">decoding</span> <span class="mi">4</span> <span class="n">individually</span> <span class="n">varint</span><span class="o">-</span><span class="n">encoded</span> <span class="n">integers</span><span class="p">.</span> <span class="n">groupvarint</span><span class="p">.</span><span class="n">cc</span> <span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="nf">DecodeGroupVar</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span><span class="o">*</span> <span class="n">p</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">,</span> <span class="n">uint32</span><span class="o">*</span> <span class="n">dest</span><span class="p">)</span> <span class="p">{</span> <span class="n">assert</span><span class="p">(</span><span class="n">groupvar_initialized</span><span class="p">);</span> <span class="n">assert</span><span class="p">(</span><span class="n">N</span> <span class="o">%</span> <span class="mi">4</span> <span class="o">==</span> <span class="mi">0</span><span class="p">);</span> <span class="k">while</span> <span class="p">(</span><span class="n">N</span><span class="p">)</span> <span class="p">{</span> <span class="n">uint8</span> <span class="n">tag</span> <span class="o">=</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span> <span class="n">p</span><span class="o">++</span><span class="p">;</span> <span class="n">uint8</span><span class="o">*</span> <span class="n">lenptr</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">groupvar_table</span><span class="p">[</span><span class="n">tag</span><span class="p">].</span><span class="n">length</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> <span class="cp">#define GET_NEXT \ do { \ uint8 len = *lenptr; \ *dest = UNALIGNED_LOAD32(p) &amp; groupvar_mask[len]; \ dest++; \ p += len; \ lenptr++; \ } while (0) </span> <span class="n">GET_NEXT</span><span class="p">;</span> <span class="n">GET_NEXT</span><span class="p">;</span> <span class="n">GET_NEXT</span><span class="p">;</span> <span class="n">GET_NEXT</span><span class="p">;</span> <span class="cp">#undef GET_NEXT </span> <span class="n">N</span> <span class="o">-=</span> <span class="mi">4</span><span class="p">;</span> <span class="p">}</span> <span class="k">return</span> <span class="n">p</span><span class="p">;</span> <span class="p">}</span> </code></pre></div></div> <h1 id="cls-that-demonstrate-multiple-techniques">CLs that demonstrate multiple techniques</h1> <p>This section is on seeing how a combination of techniques can be used to optimize a small part of a program and what to expect to the overall program.</p> <p>For example, one speeds up GPU allocator by 40% using less bytes, caching aligning, caching and commenting out logging results in 2.9% speedup in end to end</p> <blockquote> <p>Speed up low level logging in Google Meet application code.</p> </blockquote> <p>This was changing logging state from vector to static array of size 4, resulting in 50% boost for logging, which might be pretty common call</p> <p>I think all of these require very deep insights into what the program is doing and where the program is spending its time.</p> <blockquote> <p>We found a number of performance issues when planning a switch from on-disk to in-memory index serving in 2001. This change fixed many of these problems and took us from 150 to over 500 in-memory queries per second (for a 2 GB in-memory index on dual processor Pentium III machine).</p> </blockquote> <p>This is back in the day. Most likely applies still to personally written code, but doesn’t apply nearly as much these days as most people know the general optimizations and search was just becoming avaliable!</p> <h1 id="further-reading">Further reading</h1> <blockquote> <p>Understanding Software Dynamics by Richard L. Sites. Covers expert methods and advanced tools for diagnosing and fixing performance problems.</p> </blockquote> <p>Good book.</p> Maybe consider putting "cutlass" in your CUDA/Triton kernels 2025-12-15T06:00:00+00:00 2025-12-15T06:00:00+00:00 https://maknee.github.io/blog/2025/Maybe-Consider-Putting-Cutlass-In-Your-CUDA-Kernels <h1 id="motivation">Motivation</h1> <p>So I was browsing Hacker News and came across this interesting post: <a href="https://news.ycombinator.com/item?id=45458948">Fp8 runs ~100 tflops faster when the kernel name has “cutlass” in it</a>.</p> <p>This was from Triton tutorial where someone noticed that adding “cutlass” to their kernel name gave them an additional 100-150 TFLOPs. That’s a huge improvement just from… a name?</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/original1.png" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> <div class="caption"> <em>Mentions 100 TFLOPs improvement (Image source: <a href="https://github.com/triton-lang/triton/pull/7298" rel="external nofollow noopener" target="_blank">Github pull</a>) </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/original2.png" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> <div class="caption"> <em>Mentions 150 TFLOPs improvement by renaming triton kernels to add cutlass (Image source: <a href="https://github.com/triton-lang/triton/pull/7298" rel="external nofollow noopener" target="_blank">Github pull</a>) </em> </div> </div> <p>Well, I got a bit curious and wanted to why this happens.</p> <h1 id="so-what-exactly-is-this">So… what exactly is this?</h1> <p>Instead of writing your kernel like this:</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">add</span><span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="n">sum</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">sum</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> </code></pre></div></div> <p>You add “cutlass” to the name:</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">add_cutlass</span><span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="n">sum</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="n">sum</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> </code></pre></div></div> <p>and <code class="language-plaintext highlighter-rouge">ptxas</code><span class="sidenote-ref"></span><span class="sidenote">If you need some background on the CUDA compilation toolchain, refer to the <a href="#nvidia-toolchain-background">section on nvidia toolchain background</a></span> will perform an additional pass that can add performance to the generated code.</p> <p>The rest of this blog will show benchmarks, explain the optimizations, and discuss when to use this trick. But I also want to highlight something broader: if you’re working at the high level (CUDA, Triton, PyTorch), you’re still at the mercy of what the backend compilers decide to do. In this case, ptxas (a black box) is making optimization decisions based on your kernel’s name<span class="sidenote-ref"></span><span class="sidenote">With the recent release of <a href="https://docs.nvidia.com/cuda/tile-ir/sections/introduction.html">TileIIR</a>, there’s still plenty of magic happening under the hood. <code class="language-plaintext highlighter-rouge">tileiras</code> is also a black box, so we could easily see a similar “cutlass” trick emerge there too</span>.</p> <p><a href="#so-what-is-it-doing">If you want to skip to TLDR of the optimization, click here</a></p> <h2 id="a-cutlass-example">A cutlass example</h2> <p>Here’s an example graph showing cutlass benchmarks with and without this optimization (where <code class="language-plaintext highlighter-rouge">baseline/cutlass_on</code> enables the optimization and <code class="language-plaintext highlighter-rouge">cutlass_off</code> disables it):</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/main_example.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> <div class="caption"> <em>Throughput of various cutlass examples </em> </div> </div> <p>In particular, the <a href="https://docs.nvidia.com/cutlass/media/docs/cpp/cute/0x_gemm_tutorial.html#sgemm-2-cu">CuTE sgemm2.cu</a> <a href="https://github.com/NVIDIA/cutlass/blob/v4.3.0/examples/cute/tutorial/sgemm_2.cu">example</a> sees a 20% drop in performance without the cutlass optimization!</p> <p>Another thing immediately obvious is that this optimzation doesnt always increase performance.</p> <h1 id="benchmarks">Benchmarks</h1> <p>Below are sections you can expand to see various benchmarks running on an RTX 3090 and H100. Each result is aggregated from 5 benchmark runs.</p> <p>Benchmarks include 15+ projects, covering popular ones like PyTorch, Flash Attention 2/3, Cutlass, llama.cpp.</p> <p>Some highlights:</p> <ul> <li>Running llama.cpp on RTX 3090 with gpt-oss-20b shows a 1%+ performance increase</li> <li>Flash Attention 2 on RTX 3090/H100 without the optimization decreases performance by up to 1%</li> <li>Triton on RTX 3090 generally shows no performance change from the optimization</li> </ul> <p>Note: <code class="language-plaintext highlighter-rouge">baseline</code> doesn’t change anything. <code class="language-plaintext highlighter-rouge">cutlass_on</code> enables the optimization and <code class="language-plaintext highlighter-rouge">cutlass_off</code> disables it (if the application uses <code class="language-plaintext highlighter-rouge">cutlass</code>, for example Flash Attention 3):</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see 3090 benchmarks</summary> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="benchmark-3090-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <table id="benchmark-3090-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> GPU </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Benchmarks </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="benchmark-3090-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">RTX 3090 (Ampere)</span> </td> <td id="benchmark-3090-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">bitsandbytes</span> </td> <td id="benchmark-3090-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">candle</span> </td> <td id="benchmark-3090-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cutlass</span> </td> <td id="benchmark-3090-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">flash_attn2</span> </td> <td id="benchmark-3090-table-row0-col5" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">flashinfer</span> </td> <td id="benchmark-3090-table-row0-col6" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ggml</span> </td> <td id="benchmark-3090-table-row0-col7" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">liger</span> </td> <td id="benchmark-3090-table-row0-col8" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llamacpp</span> </td> <td id="benchmark-3090-table-row0-col9" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">llmc</span> </td> <td id="benchmark-3090-table-row0-col10" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">mojo</span> </td> <td id="benchmark-3090-table-row0-col11" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">nccl</span> </td> <td id="benchmark-3090-table-row0-col12" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">pytorch</span> </td> <td id="benchmark-3090-table-row0-col13" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">sageattention</span> </td> <td id="benchmark-3090-table-row0-col14" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">sgemm</span> </td> <td id="benchmark-3090-table-row0-col15" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">sglang</span> </td> <td id="benchmark-3090-table-row0-col16" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tilus</span> </td> <td id="benchmark-3090-table-row0-col17" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">tinygrad</span> </td> <td id="benchmark-3090-table-row0-col18" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">torchao</span> </td> <td id="benchmark-3090-table-row0-col19" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">triton</span> </td> <td id="benchmark-3090-table-row0-col20" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">unsloth</span> </td> <td id="benchmark-3090-table-row0-col21" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">vllm</span> </td> </tr> </tbody> </table> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/bitsandbytes_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/candle_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/cutlass_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/flash_attn2_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/flashinfer_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/ggml_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/liger_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/llamacpp_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/llmc_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/mojo_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/nccl_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/pytorch_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/sageattention_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/sgemm_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/sglang_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/tilus_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/tinygrad_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/torchao_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/triton_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/unsloth_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/ampere/vllm_comparison.png" width="100%" alt="" /> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see H100 benchmarks</summary> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="benchmark-h100-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <table id="benchmark-h100-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> GPU </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Benchmarks </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="benchmark-h100-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">H100 (Hopper)</span> </td> <td id="benchmark-h100-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">bitsandbytes</span> </td> <td id="benchmark-h100-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">cutlass</span> </td> <td id="benchmark-h100-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">deepep</span> </td> <td id="benchmark-h100-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">deepgemm_tflops</span> </td> <td id="benchmark-h100-table-row0-col5" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">flash_attn2</span> </td> <td id="benchmark-h100-table-row0-col6" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">flash_attn3</span> </td> <td id="benchmark-h100-table-row0-col7" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">flashinfer</span> </td> </tr> </tbody> </table> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/bitsandbytes_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/cutlass_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/deepep_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/deepgemm_tflops_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/flash_attn2_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/flash_attn3_comparison.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/benchmarks/hopper/flashinfer_comparison.png" width="100%" alt="" /> </div> </details> <h1 id="so-what-has-it-changed">So what has it changed?</h1> <p>So, I’ve added a godbolt reference for people to see the difference. I’m using some parts of <a href="https://github.com/siboehm/SGEMM_CUDA/blob/master/src/kernels/9_kernel_autotuned.cuh">SGEMM_CUDA</a><span class="sidenote-ref"></span><span class="sidenote">If you haven’t checked it out, it’s <a href="https://siboehm.com/articles/22/CUDA-MMM">a great blog</a> on optimizing cuda matmul kernels by Simon Boehm</span> as reference.</p> <p>In the NVCC compliation pipeline, cuda goes to ptx then ptx goes to sass. Let’s check verify where this optimization is applied (is it applied at the ptx or sass code)?</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/gpu_compilation.svg" width="100%" alt="" /> <div class="caption"> <em>High level compilation overview for NVIDIA GPUs </em> </div> </div> <p>First let’s explore if the cuda to ptx has changed.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/cuda_to_ptx.svg" style="width: 140%; margin-left: calc((100% - 140%) / 2);" alt="" /> <div class="caption"> <em>There's no difference in the PTX! (Image source: <a href="https://godbolt.org/z/bcfj8ovrc" rel="external nofollow noopener" target="_blank">Godbolt link</a>) </em> </div> </div> <p>Only the name has changed. The PTX instructions are identical.</p> <p>So let’s now check the the sass <a href="https://godbolt.org/z/erc4e8M17">Godbolt link</a>:</p> <div style=" width: 90vw; margin-left: calc(50% - 45vw); margin-right: calc(50% - 45vw); "> <iframe src="https://godbolt.org/e#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIAruiakl9ZATwDKjdAGFUtEywYgATKUcAZPAZMADl3ACNMYgkAVgA2UgAHVAVCOwYXNw9vROTUgQCg0JYIqK44y0xrWwEhAiZiAgz3Tx8rTBs02vqCApDwyJj4hTqGpqzWkZ7AvuKBstiASktUE2Jkdg4AUg0AQU2vAGZA5DcsAGpNg6dTcwB9YhNBPDYAOgRL7G293f2jqiwqGcnHIACp%2BHZCIRfX4A6ZnADS2AASsFsH5bsEdgBZbAQZhsBZnUwEAwKBS3X6/fGYaGHSpKWkHWFBBHI1HozE4vGsTCE6mMxx4KjQn6HZmYM4AdR2SOUQgAkgAtbBnA5eRnioHYeXogAi8oAahAsaQzsFCVBjYT9gAhM4Qc0AWi4hIA9PbzQsRTtXe7Hf6A4Gg8GQ6Gw%2BGIxGvr6zgb2kRiHgAF6YdBnYC0VBhMSOj4KBD1VNnTNMNMISoJSIKe38NYSvy6gDiLy4XgAHK6hCChC322dAmchBChF7djHIxPJ1PpyGvgRMCwEgZ55cnIECGcbSb%2B4JN8FTevN/CD7viKgAO61RNYHYnjdny8Ea%2BYG0fL63W5YABueHWH7OX6oHgaYlugABiZ4sI2bAsBAaAMMMZxUCWBAAFRnLeRICIhyGoEwaGbneCJEfuXxnORFGUVR1E0UhKHoTsCimrh%2BHoTaTFkbRXHcTuG6BEExBIhemGHvxkQZLenE8dJlGiQwAlCeeNpEWJxAZDa1oAOyvrs5ExqBGFnLY9Cmk%2BhgKEkShpuuqBnPmhboJxvwJMQTDACwTBnI8Z60LQnG1vah6oFQgKXLqZwaJcdrBaFXh2g%2BV7AZgOwXFcYWbliUVnDFFxxelCVPklOxaTpKUUSxBCSABqXhcQmDrpELmYAQtyiMMq7wThKGSKhHwQFJVH7LEezRDaECqYpKW2tlIWEuh8K5XaqkSWc6GSJs0S6l6o2RZtUUDYxG1jctrgpWti0RXNGUXRNwkXTFG3helX4vKo%2B26RRh2jeN8niadq1nFV00ugDW43b9gl3dND2bTVAEvAAnu9ZXkV9x0Qyt53TV4V1g9Nt3nlNcUzcKsPPS8SbI5RaM/QJmOAxdBy41i4MKVDxMw09BzhS955Uxcmm6px%2Bl4WmdrGZgpq0NM9TFkwCMrAQTmHC5bkeV5DA%2BX5H0BT9G45elkUHNFIUXQVz52qu6U2vCWUG3l3NnObSUaQLpWUXVDXEE1LVtUrVwVT1fVDexR205EimWxzs2g8ErN/bQdprY9202rtXNCx91Ge4IjV1b7TDtVcnUboHvUHLiA20SHYcE1HJtUFdcf4xjrhJ4DKdHen/ObIL0J9z8Y5%2BjOI%2Bj2P/rRu6kr1AkroKCYYSOueM9O5gwB4MMkSOmEmbIAA1kWHkECwJi0EhkF2QWdVphLCiT2c4%2BP0/Ua7POi7LjSVyHluRE2vuvFHiIpKbch5JT/1AVieUIJkRAOCFAmBVdZK7klEIOQP8AEoLQeA3cIIQE4OCG%2BXYH5vy/kwP%2BQCwEzguVQOsMkEFUAsCEDBag9FV7AG3BVdCdVgD/04UZBAdVSxIkwPPWgBAOJZxklI8iJc6J4QIoxU0si%2BHsVIIg6RPETCHmXt7RSpotG7h0QkDIajJEaOkgYjcBABGYCEReeUDBp7e30YeaxgjnCuAcU4hIJVlYHFVu5Ty3lXDazKrrSxZx0CoA3IbLKUSYlXCPFlW0tp4m%2BMkc5VygSNZawGuEw8xA4ZG3iqlJwUpIHQKRMkvKcViCjhRhRTJasgmaxCYg/Ju48BFKyl01cZxcHVNfHFPA9TaLcMyqNQp6FcE3UenDNG8TQYs2mkYxSANgFm3WagvGeUzHmNom42x6BFJeJXtM5ZxM8CPWRgNJp2Tgm%2BTySQe0ETkDdONkSUpUo4GVMGbaZAozBoqyyerB5oTqIdL4u8paXyQQEI%2BSk4ZgLs5rwIaNN50zm6XLmdbO%2Bo1FlsSxXaIxGR1lEs%2BehTBf9FrqP2VRQ5pYMinO9gDOFsy9rG29JRO5oLWmPMkZCp20KhV9OAfAqpCKak2jqYgnlLTcl7MFW82JHzlWJLAeKv5cUAW0rlTktpezyKCt6Y7Yp/ZYWZUlUMm0IzaXkT1WCu1SFnkQAidoYV7q%2Blwq1TabQyKuIMuOSI0%2B4iw5TP6RcpaV0IAasqay80F0MX9PJdoHFDtM4NOkuMo64aZn4zmVw1FR0k1sumqmjl7sBYZt7tWoeD9n4NsbbmOtWImADiMY6YygRgBnAPsQIIZ8IBH2QOWasChgALhYLlLwfaB1nC4JFLwXgsIMBoD2gKCgWC3AAJwaCYPU8cTaj3jznAuJc%2BFP5rl3Og7%2B2CNw2yAXgjcYDYHiqIrg99pFDXkUPMEOQWJbgggABJImwDsXUUIK7vk/JgH8f5bhnA/LWdYgRpZBHIUBNM47J23CMV2owtxnhLjgthUurCPx1WGImGw/5JLfrpQxmiyjyP3BEYVGjCHlJOsYzxuRrFEOsao7%2BFqCGnCmMzbxyTP7dxPrNEReE6SyqyNeWs9KO8aF73lOgVQiN%2BbKcPMgUland6ae069G5H19OGJnlp1QcNA22demcKeMo5RKk%2BJyyzpGvLaJnkZx2RjbO5WiPaalU9zR6e8xE1ZF44aBe0850Lcdwtek898JTpHMCqBcj55BFTkRwxjSzSlCb3QxtcwqZUrLivJrJeK1LlaS5ZZyxEqlyzHYbKnvliVjXMvZcKa17ZzcOvJe%2BZqtLnErNWOAsyhI9mbGlkc3Z/YIXpSykqx53rCENwRMDUyxxK90q2HQLN4L9oqWjbhQ1ybUXXELeOfYg7LKjszae3NsrF3EtXYs2VD89lr7/gqhhPFNowZsVthWzif2r6pkByhTcIObax2uRNrz23ct8QhpNeb7iluJYgEj90khrto8QhEk6tApqOwc2Z5bXgQsE4WkTknGXtvNYGwUi8iUbyFb/QB4DoHwNCABsTxLNsfsyNu50rHF5LbU/u3jsrYXAYs8l%2Bj8nrdE446OXjlbSXEvE8i2z/rGOnZc8KlgOX4U%2BeAZA2BiD%2BPleG9R2E%2BHgbhGiNDaNMVcbzl1d98mnFjte5pxrfzIH2bvfdeq0Hp62lIp9zS0a%2BH3C0U2ljQVzFse3YJ6Fi7tXXV5EMR3nveZF04JrLYliK6tsk8rsL/xm0Je4b1yJKSwl4fWFOGb%2BlMp00K%2BxarxdGL551nV4BuSwzrhY7D789PkrEvnWFNdYeMIpeVV2jX18lfggFi183%2Bv9NR5FOUVAvQqCMFVzoL/qaB9ZvHzPkws7S3fUdgl9v%2B/4icnge34kRJqTPGBMIkmudG/%2BABDGdcJ4dMbcquFEf2CMDANg92CgEAsB5E1CtCCg5%2BTCC4V%2B24N%2BgC5SpoL65Sb6Y2lSxB2y24F2poH6geFc/U364ypoqepk92HuIaTEP%2BCOpoI%2BvBc%2BtA4mBy7Bj23ibB7i%2B23iaBGEPeR%2B4ude5ETetAh%2Br4jsSO6E8Kla5E8BiBgaKBsBYeg8ZUDqfK4KS%2BLyBSwqhSoq3WPqMqGSwKzS%2Bq/KDSSqwqaqZSGePWNK2qJ%2BVEfCLgJgCS4U3eyhmylKVBE%2BiaWyWC/M3Kjh9yph7SLq5OwqJqZSAyVqtotq36JhCqYBgqnqpqWURRGRmhZwnqR%2BouIeTqgc1U%2BU9Uuc3s%2BcrUhc/sTgZcwc9GQKsQgRSs307uohZyEaN0TcNK3R3Ee2nib28aF05aW0XcKOWhTGUufECW6UEAuaka/Y0aXh8a1oxMJaKacRNEL0dOqh4U7uwaYiIOwE5xEUSxTqL0CMNK8u7iHBNxR0dxF0XAjx3RL0SYrxlx7B1xXuNq6xxM6okO/xLwo%2BKSbxRyHxYJ3x00BwfxYB5EOc84zRzUrRRcHR3U5clcExjSXgvRisYcgx54p2fu%2BMYxKSJJwhEh0x3isxZancO02eX4Jx3KA8DShhewfJ98LgiB%2BcEogQwwhgtg%2BEaQ/YWATwogZ8RAdkE6LAU6s6lQ86Gg9oOwsQGg%2BproQGC62pCQ%2BECA9SWW2JDA06Tg%2Bwy6H4GYWYYg/4H4BgjwI6twYQKwDA6A5ItwEArYbYhIFCaYiIKIaIGI2IuI2GaptwGptAtwC6CwjBDSzGReGESi3mKizEXeREsmd6CIfhMZW6eGdA3ahG78q4gZpo1Z868QZwsQ0gDZTZTZbYpoTZgZr%2Bt%2BpoYmZw24/8CmNygsHASwtAnA0QvAngHAWgpAqAnAwIuoKUtpUqdkKwdYuUBwPApABAmgI5Swe83gXALwBwkgsQsQ0QXAmk%2Bp26mksQBwmkmk%2BgnAkgk5u5s5nAvACgIAGg25u5SwcAsASA%2BAIU5AlA/AggIgYg7AUgMgggigKg6g05vAtACAX5PgKFCgwFVABACMlYIAj5GFlprkqACQ1QU5Q4kI9oaAi4QRZCu6FgdkW69FCwvAxAqFIABwpAbFmFQo2FuF7Aj53FRFTAJFZFnAFFwuJGNF84O6e6pom6sl%2B6I5T5HAE5pAU5M5c5HAuovFsYeAmA54kQZoBoTgZSXABwLwGgLYD89QI6YUCl9FD8aGjR/AD8CgawjogQjopFdmD8AAGj5YXLlLEAAI4mDRJRSOgADyaJZJYVEVHyX41YwQJlZlFlVlXANlaw7w3MDle6Tl0wgQrljo7lyAnlDA3lBAvljoAVVVQVQ08V/sNo0VsVoV4VTVD8uowIYIw4vAO5SFCwSw5YpYAwqBpAB5mk0QLwkgmku6hwZ5kgkgJ50gY5HAL56lb5Wln535v5A1pAAFUAMAiAIAJC6woFUlCQdAkQwQPInAKVpl866VLYvAp1GwygIIflVFDCCQtFilLFvg%2BACYwEeg4Fwgip0F0goN8Fagb5ugnFbQHQ9gEAjgYwngXAvgPpvQRQJQeglkeQ6QrgzQuNuQZFWN/QpQFQVQnQkwqNegCNZFXQDQZNswFNkpowhNWQ6NbNUwhQ5NEgSwT4mA9U6A35o545r5SF752lsGpCelBlRl91aVll1lEAuAhAzyvw6NQI31V11hhwTMfVf5%2B5IA0QmkLwbYkgN5sQbYbYepbYmkttGg0QKl61LA3gztGlvAW1lgO1/VWg/5R1EAQFMtZ1FAF1utN1bAd1qVj1ytW5r1nA71n1F1v1zFvAqY6t14INsgkF4gMFUNSgMNktugrQlQ8YaQDgPptN6N/g0w2NcwOQKQZF1djd%2BNzNONXNZdiNDAjNjQHNaNlN5dNQkw7dDd3NLd3No9pQAtdUwtotKlalntUtuoIdEoBo%2BlhlhSitsdGV9oatCYG5WtLgi4utG5OMhtA1Q1Ryo1Yta1vAbt9OG1kt3tX5P5ftylB1QdKAOt9AZAYd1Fl1v9IADAX4yAyA5lGgJg6NNAYiVYlAYQb5YQgQ9QCMnAW5SDzAxACMUVYQ2g8YaDvA1FbAggUVDAtAqDktWAYQJgwATgYgtAX53AL1C4hgGYGwM5%2BAdUHQX4Iib5WW7QtFBD5Aucq1M50sYQrkWDLgWAb5hUbtTDpAPDxAXpSgK9HkRgaGoAe1yEbkCg69BlUVlYU5W5oNudENsF8ghdiFM5Jd%2BgrDKAZgFg4jX5kASwolaQjDjotlOVuoeVnkjozlRVtkJVHlXlPl5ENVgV1YdpLVdpXtSjz4LjY1pphQt1HA295lcd9oDAtkxIpICg/19NFdyNVd/degtdvNLNxNTdaQLdeNpNddfNndVNw93QE9XdDNI9jTVTXNNNZTvT3QU9/Nywqw6wwzq1i9m10dD1mTu9eIuTQR%2BT/179g1pAw1WAUQY1q1rt7tT9mlH5Ptb9f5%2B1gdSAADut515zQDIDYDEDUDfAdA2JX5EACDktGDKDQj7zWDODeDNgQjRDjABApD5Db5VDNDdDvkjD8dLDGj7DL1eAXDtgPDjDM5/DxIGwW5DUojyFeAEjKD0jcL25iY8jW5SjKjmAajrDmjxzOjwAejG9hjjAQjpj4NEgkNsg0N1jOgHFdjRgDj5g%2BguLSTbjpFHjnAXj2V9lTF%2BVAThVq6wTpV5VlV1VtVqg9VS6sTS6nV3V4IkI8TkQiT8ASwKTIQaTxlMzT1mVcEizhcBTn5HTxTKN/TGN6AQz6N9TtTzrHr%2BQ3THdg93dvd7TLTPdXTlTfr49zrk9vrcwSwCga5YzLoC9Et%2BzHA5rStczeTtryzRtaz19mzt9Ozj9S9L9vtObB5Fll5S1Xgp5J50QbY26S60Qztq1BwybXtBzKzt9XgbbUtF9/tSwSjKQ9gkgQAA%3D%3D%3D" style=" width: 100%; height: 800px; border: 0; display: block; " loading="lazy"> </iframe> </div> <p>Clearly something has changed!</p> <p>Two common changes we can see are:</p> <!-- https://godbolt.org/z/7TKvhv4Gj --> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/instruction_selection.svg" style="width: 160%; margin-left: calc((100% - 160%) / 2);" alt="" /> <div class="caption"> <em>The optimization now uses IMAD instead of HFMA2.MMA to move constants </em> </div> </div> <p>We can see that <code class="language-plaintext highlighter-rouge">IMAD</code> is used instead of <code class="language-plaintext highlighter-rouge">HFMA2.MMA</code> for moving constants, which is neat!<span class="sidenote-ref"></span><span class="sidenote">By using <code class="language-plaintext highlighter-rouge">IMAD</code>, we can use the <code class="language-plaintext highlighter-rouge">FP32</code> units. Refer to <a href="#h100-sm-diagram">H100 SM Diagram</a></span>.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-12-05/instruction_reordering.svg" style="width: 140%; margin-left: calc((100% - 140%) / 2);" alt="" /> <div class="caption"> <em>Enable interleaving LDS and FFMA </em> </div> </div> <p>We can see that <code class="language-plaintext highlighter-rouge">LDS</code> interleaved instead of being stacked together<span class="sidenote-ref"></span><span class="sidenote">This should be able to increase instruction level parallelism</span></p> <p>One thing that the disassembly doesn’t show is the register pressure. This optimization may increase register pressure:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cuobjdump <span class="nt">--dump-resource-usage</span> baseline.cubin Resource usage: Common: GLOBAL:0 Function sgemm_kernel_10: REG:188 STACK:0 SHARED:17408 LOCAL:0 CONSTANT[0]:564 TEXTURE:0 SURFACE:0 SAMPLER:0 cuobjdump <span class="nt">--dump-resource-usage</span> cutlass.cubin Resource usage: Common: GLOBAL:0 Function cutlass_sgemm_kernel_9: REG:214 STACK:0 SHARED:17408 LOCAL:0 CONSTANT[0]:564 TEXTURE:0 SURFACE:0 SAMPLER:0 </code></pre></div></div> <p>Register usage increased from <code class="language-plaintext highlighter-rouge">188</code> to <code class="language-plaintext highlighter-rouge">214</code>, a <code class="language-plaintext highlighter-rouge">13%</code> increase in register usage. However, this isn’t always the case. I’ve seen other examples not affect register pressure and even decrease register pressure.</p> <p>Below is a table of the different instructions that have changed for this kernel:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="sass-diff-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <table id="sass-diff-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Mnemonic </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Baseline </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> CUTLASS </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Δ </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">IMAD.MOV.U32</span> </td> <td id="sass-diff-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0</span> </td> <td id="sass-diff-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">37</span> </td> <td id="sass-diff-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">+37</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">HFMA2.MMA</span> </td> <td id="sass-diff-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5</span> </td> <td id="sass-diff-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0</span> </td> <td id="sass-diff-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">-5</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">LEA</span> </td> <td id="sass-diff-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">15</span> </td> <td id="sass-diff-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2</span> </td> <td id="sass-diff-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">-13</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">IMAD.SHL.U32</span> </td> <td id="sass-diff-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0</span> </td> <td id="sass-diff-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10</span> </td> <td id="sass-diff-table-row3-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">+10</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">CS2R</span> </td> <td id="sass-diff-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">75</span> </td> <td id="sass-diff-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">64</span> </td> <td id="sass-diff-table-row4-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">-11</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MOV</span> </td> <td id="sass-diff-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">8</span> </td> <td id="sass-diff-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0</span> </td> <td id="sass-diff-table-row5-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">-8</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">IMAD</span> </td> <td id="sass-diff-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0</span> </td> <td id="sass-diff-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">8</span> </td> <td id="sass-diff-table-row6-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">+8</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ULDC.64</span> </td> <td id="sass-diff-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4</span> </td> <td id="sass-diff-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1</span> </td> <td id="sass-diff-table-row7-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">-3</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sass-diff-table-row8-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">FFMA</span> </td> <td id="sass-diff-table-row8-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">787</span> </td> <td id="sass-diff-table-row8-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">801</span> </td> <td id="sass-diff-table-row8-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">+14</span> </td> </tr> </tbody> </table> </div> <h1 id="so-what-is-it-doing">So… what is it doing?</h1> <p>So far, we’ve dug into specifics. The higher optimization seems to most likely do the following:</p> <ul> <li>Instruction selection - use f32 units for moving constants<span class="sidenote-ref"></span><span class="sidenote">Moving constants from registers isn’t in the hot path, but it’s a simple to see example!</span> registers<span class="sidenote-ref"></span><span class="sidenote">But wait there’s more! I didn’t show it in this blog in detail, but you can see some IMADs replacing instructions</span></li> <li>Instruction reordering - mix memory loads with math</li> <li>Influence register pressure - may increase the number of registers used to achieve reodering</li> </ul> <div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>When ptxas sees matrix operations (MAD/MMA): Instruction selection: HFMA2.MMA,MOV -&gt; IMAD Instruction reordering: LDS spread across FMMA As a Side effect: May increase register pressure </code></pre></div></div> <h1 id="when-should-you-apply-this-optimization">When should you apply this optimization?</h1> <p>With kernel writing, it’s tricky to say when you absolutely should and shouldn’t use this optimization. The optimization seems to increase ILP at the cost of register pressure<span class="sidenote-ref"></span><span class="sidenote">Won’t increase register pressure in some cases!</span>. Always benchmark to ensure the performance is good<span class="sidenote-ref"></span><span class="sidenote">I’ve seen the optimization not affect performance on some cards while affecting others significantly</span>.</p> <h1 id="how-to-apply-this-to-triton">How to apply this to triton</h1> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">torch</span> <span class="kn">import</span> <span class="n">triton</span> <span class="kn">import</span> <span class="n">triton.language</span> <span class="k">as</span> <span class="n">tl</span> <span class="k">def</span> <span class="nf">rename_kernel</span><span class="p">(</span><span class="n">proxy</span><span class="p">):</span> <span class="k">return</span> <span class="sh">"</span><span class="s">cutlass_kernel</span><span class="sh">"</span> <span class="c1"># will convert "my_kernel" -&gt; cutlass_kernel </span><span class="nd">@triton.jit</span><span class="p">(</span><span class="nb">repr</span><span class="o">=</span><span class="n">rename_kernel</span><span class="p">)</span> <span class="k">def</span> <span class="nf">my_kernel</span><span class="p">(</span><span class="n">M</span><span class="p">:</span> <span class="n">tl</span><span class="p">.</span><span class="n">constexpr</span><span class="p">):</span> <span class="k">pass</span> <span class="c1"># compile and extract ptx </span><span class="n">my_kernel</span><span class="p">[(</span><span class="mi">1</span><span class="p">,)](</span><span class="n">M</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span> <span class="n">dev</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="nf">current_device</span><span class="p">()</span> <span class="n">kernel_cache</span> <span class="o">=</span> <span class="n">my_kernel</span><span class="p">.</span><span class="n">device_caches</span><span class="p">[</span><span class="n">dev</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="n">compiled</span> <span class="o">=</span> <span class="nf">next</span><span class="p">(</span><span class="nf">iter</span><span class="p">(</span><span class="n">kernel_cache</span><span class="p">.</span><span class="nf">values</span><span class="p">()))</span> <span class="n">ptx</span> <span class="o">=</span> <span class="n">compiled</span><span class="p">.</span><span class="n">asm</span><span class="p">[</span><span class="sh">"</span><span class="s">ptx</span><span class="sh">"</span><span class="p">]</span> <span class="c1"># print the kernel name from PTX </span><span class="nf">print</span><span class="p">(</span><span class="sh">'</span><span class="se">\n</span><span class="sh">'</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="n">ptx</span><span class="p">.</span><span class="nf">splitlines</span><span class="p">()[:</span><span class="mi">20</span><span class="p">]))</span> </code></pre></div></div> <p>It will show</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">//</span> <span class="c1">// Generated by LLVM NVPTX Back-End</span> <span class="c1">//</span> <span class="p">.</span><span class="n">version</span> <span class="mi">8</span><span class="p">.</span><span class="mi">7</span> <span class="p">.</span><span class="n">target</span> <span class="n">sm_86</span> <span class="p">.</span><span class="n">address_size</span> <span class="mi">64</span> <span class="c1">// .globl cutlass_kernel // -- Begin function cutlass_kernel</span> <span class="c1">// @cutlass_kernel</span> <span class="p">.</span><span class="n">visible</span> <span class="p">.</span><span class="n">entry</span> <span class="n">cutlass_kernel</span><span class="p">(</span> <span class="p">.</span><span class="n">param</span> <span class="p">.</span><span class="n">u64</span> <span class="p">.</span><span class="n">ptr</span> <span class="p">.</span><span class="n">global</span> <span class="p">.</span><span class="n">align</span> <span class="mi">1</span> <span class="n">cutlass_kernel_param_0</span><span class="p">,</span> <span class="p">.</span><span class="n">param</span> <span class="p">.</span><span class="n">u64</span> <span class="p">.</span><span class="n">ptr</span> <span class="p">.</span><span class="n">global</span> <span class="p">.</span><span class="n">align</span> <span class="mi">1</span> <span class="n">cutlass_kernel_param_1</span> <span class="p">)</span> </code></pre></div></div> <h1 id="how-to-apply-this-to-ptxas">How to apply this to ptxas</h1> <p>A universal patch to ptxas (which most frameworks invoke) is to just replace <code class="language-plaintext highlighter-rouge">cutlass</code> in the binary with something else.</p> <p>Here’s how I do it:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_path</span> <span class="o">=</span> <span class="sh">"</span><span class="s">/usr/local/cuda/bin/ptxas</span><span class="sh">"</span> <span class="n">output_path</span> <span class="o">=</span> <span class="sh">"</span><span class="s">ptxas_no_cutlass</span><span class="sh">"</span> <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">input_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">rb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">blob</span> <span class="o">=</span> <span class="nf">bytearray</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="nf">read</span><span class="p">())</span> <span class="c1"># We expect exactly "cutlass" inside ptxas. </span><span class="n">target</span> <span class="o">=</span> <span class="sa">b</span><span class="sh">"</span><span class="s">cutlass</span><span class="sh">"</span> <span class="n">off</span> <span class="o">=</span> <span class="n">blob</span><span class="p">.</span><span class="nf">find</span><span class="p">(</span><span class="n">target</span><span class="p">)</span> <span class="k">assert</span> <span class="n">off</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="sh">"</span><span class="s">ptxas did not contain the cutlass marker!</span><span class="sh">"</span> <span class="c1"># Overwrite: c u t l a s s → ff ff ff ff ff ff ff, so that strstr("0xFF") since kernel names contains ascii </span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">target</span><span class="p">)):</span> <span class="n">blob</span><span class="p">[</span><span class="n">off</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xFF</span> <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">output_path</span><span class="p">,</span> <span class="sh">"</span><span class="s">wb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">blob</span><span class="p">)</span> <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">patched </span><span class="sh">'</span><span class="si">{</span><span class="n">target</span><span class="p">.</span><span class="nf">decode</span><span class="p">()</span><span class="si">}</span><span class="sh">'</span><span class="s"> at offset </span><span class="si">{</span><span class="n">off</span><span class="si">:</span><span class="c1">#x</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span> </code></pre></div></div> <h1 id="resolving-public-statements">Resolving Public Statements</h1> <p>In my opinion, there seems to be a lot of assumptions people are throwing out on the internet about this optimization. I want to clear some of that up.</p> <p>On the top of the <a href="https://news.ycombinator.com/item?id=45458948">hackernews post</a>, it links to a response from a user about this optimization.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/unstable.png" width="100%" alt="" /> </div> <p>This statement is incorrect; I have compiled many real world projects with this optimization on and off and they ran without failing (passing output asserts) on different cards.</p> <p>Also with <a href="https://www.reddit.com/r/programming/comments/1nx3g70/fp8_runs_100_tflops_faster_when_the_kernel_name/">a highly voted reddit comment</a></p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/reddit.png" width="100%" alt="" /> </div> <p>This explanation is really hard to understand. I’m guessing that the user is stating this trick uses NaNs/zeroes to optimize the program. It doesn’t use that. In fact, it tries to optimizes how registers are moved.</p> <h1 id="previous-mentions">Previous mentions</h1> <p>This was also mentioned before by <a href="https://forums.developer.nvidia.com/t/how-does-bar-sync-defer-blocking-get-generated/245747">grynet on the nvidia forums</a> where he complained that the following kernel would generate different sass</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">mykernel</span><span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="n">lhs</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">rhs</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">res</span><span class="p">,</span> <span class="kt">int</span> <span class="n">M</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">,</span> <span class="kt">int</span> <span class="n">K</span><span class="p">)</span> <span class="p">{</span> <span class="n">cutlass</span><span class="o">::</span><span class="n">gemm</span><span class="o">::</span><span class="n">GemmCoord</span> <span class="n">problem_size</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="n">N</span><span class="p">,</span><span class="n">K</span><span class="p">);</span> <span class="n">compute_gemm_with_cutlass</span><span class="p">(</span><span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">,</span> <span class="n">res</span><span class="p">,</span> <span class="n">problem_size</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">__global__</span> <span class="kt">void</span> <span class="nf">mykernel</span><span class="p">(</span><span class="kt">float</span> <span class="o">*</span><span class="n">lhs</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">rhs</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">res</span><span class="p">,</span> <span class="kt">int</span> <span class="n">M</span><span class="p">,</span> <span class="kt">int</span> <span class="n">N</span><span class="p">,</span> <span class="kt">int</span> <span class="n">K</span><span class="p">,</span> <span class="n">cutlass</span><span class="o">::</span><span class="n">gemm</span><span class="o">::</span><span class="n">GemmCoord</span> <span class="n">dummy</span><span class="p">)</span> <span class="p">{</span> <span class="n">cutlass</span><span class="o">::</span><span class="n">gemm</span><span class="o">::</span><span class="n">GemmCoord</span> <span class="n">problem_size</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="n">N</span><span class="p">,</span><span class="n">K</span><span class="p">);</span> <span class="n">compute_gemm_with_cutlass</span><span class="p">(</span><span class="n">lhs</span><span class="p">,</span> <span class="n">rhs</span><span class="p">,</span> <span class="n">res</span><span class="p">,</span> <span class="n">problem_size</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>and <code class="language-plaintext highlighter-rouge">BAR.SYNC.DEFER_BLOCKING</code> would be generated here instead of <code class="language-plaintext highlighter-rouge">BAR.SYNC</code> (due to cutlass being added as part ofthe function signature)</p> <p>Perhaps this was also a part of the optimization in previous versions of <code class="language-plaintext highlighter-rouge">ptxas</code>?</p> <h1 id="takeaway">Takeaway</h1> <p>So, adding “cutlass” to your kernel name can give you 100+ TFLOPs or -20% FLOPS.</p> <p>The issue is two fold: <code class="language-plaintext highlighter-rouge">ptxas</code> is a black box and <code class="language-plaintext highlighter-rouge">sass</code> is undocumented. It’s unlike other ecosystems. You can see the passes running through LLVM and x86/arm are documented.</p> <p>Well, with this optimization… it helps some kernels, hurts others or change not much at all. Completely depends on your architecture and your specific code. What flies on an H100 might tank on a 5090 or B200, and you have no way to know until you run it.</p> <p>So what do you do? Benchmark it. Change the ordering in triton/cuda, see if PTX changes, check the SASS output. That’s the only way to know what <code class="language-plaintext highlighter-rouge">ptxas</code> actually did.</p> <p>And this isn’t going away. <code class="language-plaintext highlighter-rouge">tileiras</code> (the new TileIIR compiler) is also a black box. We may expect similar surprises like this moving forward.</p> <h1 id="appendix">Appendix</h1> <h2 id="nvidia-toolchain-background">NVIDIA toolchain background</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/gpu_compilation.svg" width="100%" alt="" /> <div class="caption"> <em>High level compilation overview for NVIDIA GPUs </em> </div> </div> <p>NVIDIA’s toolchain works like this: <code class="language-plaintext highlighter-rouge">CUDA code</code> is compiled by <em>nvcc</em> into <code class="language-plaintext highlighter-rouge">PTX</code>, an intermediate representation. Then <em>ptxas</em> takes that <code class="language-plaintext highlighter-rouge">PTX</code> and turns it into <code class="language-plaintext highlighter-rouge">SASS</code>, the low-level instruction set the GPU runs<span class="sidenote-ref"></span><span class="sidenote">ptxas and sass are both undocumented, so it may be a bit difficult to understand what’s going on</span>.</p> <h2 id="h100-sm-diagram">H100 SM Diagram</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-12-05/gh100.png" width="50%" alt="" /> <div class="caption"> <em>H100 SM Diagram (Image source: <a href="https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c" rel="external nofollow noopener" target="_blank">NVIDIA H100 GPU Whitepaper</a>) </em> </div> </div> <h2 id="changes">Changes</h2> <p>[12/16/2026] Thanks to @Firadeoclus on GPUMODE discord for pointing out that my original post mixes up <code class="language-plaintext highlighter-rouge">HMMA</code> and <code class="language-plaintext highlighter-rouge">HFMA2.MMA</code> and how they move constants instead of zeroing.</p> <h1 id="citation">Citation</h1> <p>To cite this article:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{zhu2025cutlass, title = {Maybe consider putting "cutlass" in your CUDA/Triton kernels}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {December}, url = "https://maknee.github.io/blog/2025/Maybe-Consider-Putting-Cutlass-In-Your-CUDA-Kernels/" } </code></pre></div></div> Network Storage and Scaling Characteristics of a Distributed Filesystem 2025-09-16T06:00:00+00:00 2025-09-16T06:00:00+00:00 https://maknee.github.io/blog/2025/3FS-Performance-Journal-3 <h1 id="series">Series</h1> <ul> <li><a href="/blog/2025/3FS-Performance-Journal-1/">An Intro to DeepSeek’s Distributed File System</a></li> <li><a href="/blog/2025/3FS-Performance-Journal-2/">A Reality Check on DeepSeek’s Distributed File System Benchmarks</a></li> <li><a href="/blog/2025/3FS-Performance-Journal-3/">Network Storage and Scaling Characteristics of a Distributed Filesystem</a></li> </ul> <!-- - [Theoretical Performance Limits of 3FS](/blog/2018/RTX-DXR-Path-Tracer-Host/) - [Benchmarking 3FS](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) - [Analysis of 3FS Benchmarks](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) - [Improving 3FS Performance](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) --> <h1 id="table-of-contents">Table of Contents</h1> <ul> <li><a href="#the-benchmarking-pyramid">The Benchmarking Pyramid</a></li> <li><a href="#network-baseline-benchmark">Network Baseline Benchmark</a></li> <li><a href="#benchmarking-for-modern-cluster">Storage Baseline Benchmark</a></li> <li><a href="#3fs">3FS Performance Analysis</a> <ul> <li><a href="#scaling-block-size-5-nodes">Scaling Block Size</a></li> <li><a href="#scaling-nodes">Scaling Number of Nodes</a></li> </ul> </li> <li><a href="#wrapping-up">Wrapping up</a></li> </ul> <h1 id="refresher">Refresher</h1> <p>In <a href="/blog/2025/3FS-Performance-Journal-1/">my first post</a>, I introduced DeepSeek’s <a href="https://github.com/deepseek-ai/3FS/tree/ee9a5cee0a85c64f4797bf380257350ca1becd36">3FS distributed file system</a> and performed a <a href="/blog/2025/3FS-Performance-Journal-2/">reality check in the second post</a>. Now it’s time to see how 3FS performs in practice.</p> <h1 id="the-benchmarking-pyramid">The Benchmarking Pyramid</h1> <p>Before diving into results, let’s talk about the understanding software performance from a high level. If we imagine performance understanding as an onion, peeling off each layer onion reveals deeper insights<span class="sidenote-ref"></span><span class="sidenote">Each layer gives us a deeper understanding. Without starting at the top, the discovering insights may be difficult</span></p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/increasing_difficulty.svg" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>The performance analysis pyramid: from theoretical limits to production </em> </div> </div> <p>We started with napkin math in the first post, performed reality checks in the second, and now we’re ready for the next layer: microbenchmarking.</p> <h2 id="why-microbenchmark">Why Microbenchmark?</h2> <p>Think of microbenchmarking as testing individual components in isolation. Instead of running a complex workload that does everything at once, we test one specific operation repeatedly until we understand its exact performance characteristics. It’s like measuring only how fast a car accelerates in a straight line instead of timing a trip through city traffic where you can’t tell if slowdowns are from stop signs, traffic lights, or congested highways.</p> <p>But one might ask: why not jump straight to real workloads? Real workloads are messy. They mix reads, writes, different block sizes, and various access patterns. When something’s slow, is it the network? The disk? The software? That’s the challenge with macrobenchmarks and production workloads (the bottom layers of our pyramid). There’s too many variables at once. Microbenchmarks let us isolate each component and understand exactly where time is spent<span class="sidenote-ref"></span><span class="sidenote">They answer specific questions like: What’s the maximum throughput for sequential reads? How does latency change with queue depth? Where exactly does performance cliff when we increase parallelism?</span>.</p> <p>These benchmarks build intuition at multiple levels: from raw hardware performance to how exactly 3FS performs. Once one recognize these patterns, one can have intuition on related applications may be slow and how to fix it<span class="sidenote-ref"></span><span class="sidenote">This knowledge transfers across systems too – similar hardware will have similar characteristics regardless of the software running on top, and similar types of software (like filesystems) perform comparable operations</span>.</p> <p>In my previous posts, I made several predictions about 3FS performance based on napkin math and reality checks. Now that I have actual microbenchmark data, I can see how accurate those predictions were or how terribly off I was.</p> <h2 id="what-were-measuring-and-why">What we’re measuring and why</h2> <p>In this post, we’ll answer five key questions:</p> <ol> <li><strong>What are the hardware limits?</strong> – Local SSD and InfiniBand benchmarks establish our ceiling</li> <li><strong>How does 3FS compare?</strong> – Performance differences from local benchmarks and why they occur</li> <li><strong>Is 3FS hardware-specific?</strong> – Does it require high-end hardware or work well on commodity clusters?<span class="sidenote-ref"></span><span class="sidenote"><a href="https://arxiv.org/pdf/2408.14158">DeepSeek’s paper</a> describes a cluster with NVMe SSDs and 200Gb/s InfiniBand. What happens with SATA SSDs and 25Gb/s networking?</span></li> <li><strong>How does 3FS scale?</strong> – Performance across different node counts and configurations</li> <li><strong>What knobs matter?</strong> – Impact of block sizes, I/O patterns, and tuning parameters</li> </ol> <p>This will start to build our intuition for how 3FS performs. The post includes many interactive graphs to explore the data yourself<span class="sidenote-ref"></span><span class="sidenote">I’ll highlight the interesting patterns so you don’t drown in numbers, sometimes benchmarks reveal surprising behaviors</span>.</p> <h1 id="single-node-benchmarking">Single Node Benchmarking</h1> <p>Before diving into 3FS performance, we need to understand how our clusters performs. This section establishes baseline performance for both network and storage using standard tools.</p> <h2 id="testing-environment">Testing Environment</h2> <p>I have two contrasting setups that tell an interesting story:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Older Cluster (18 Nodes) </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Modern Cluster (5 Nodes) </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Node Count</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">18</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Use case</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Budget cluster</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-performance cluster</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">CPU</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10-core Intel E5-2640v4 (2017 era)</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2×36-core Intel Xeon Platinum (2021 era)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Memory</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">64GB DDR4-2400</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">256GB DDR4-3200</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Storage</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">SATA SSD (480GB)</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row4-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">NVMe SSD (1.6TB PCIe 4.0)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 Gbps (3.25 GB/s)</span> </td> <td id="fancy-table-Component,Older Cluster (18 Nodes),Modern Cluster (5 Nodes)-row5-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">100 Gbps (12.5 GB/s)</span> </td> </tr> </tbody> </table> </div> <p>The older cluster represents deployments using previous-generation hardware. The modern cluster represents somewhat current high-performance deployments. Comparing these reveals how 3FS performs across different hardware generations<span class="sidenote-ref"></span><span class="sidenote">I don’t have access to a high-end cluster with many NVMe drives and newer NICs. I’d love to have the setup that the 3FS team uses, but I’m just a student without access to those types of clusters 😔</span>. I’ll be referring these clusters as <code class="language-plaintext highlighter-rouge">old cluster</code> and <code class="language-plaintext highlighter-rouge">new cluster</code>.</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to see more detailed hardware specifications</summary> <table class="datatable display compact cell-border row-border hover"> <thead> <tr> <th>Component</th> <th>Older Cluster (18 Node Setup)</th> <th>Modern Cluster (5 Node Setup)</th> </tr> </thead> <tbody> <tr> <td>Node Count</td> <td>18</td> <td>5</td> </tr> <tr> <td>CPU</td> <td>Ten-core Intel E5-2640v4 at 2.4 GHz</td> <td>Two 36-core Intel Xeon Platinum 8360Y at 2.4GHz</td> </tr> <tr> <td>RAM</td> <td>64GB ECC Memory (4x 16 GB DDR4-2400 DIMMs)</td> <td>256GB ECC Memory (16x 16 GB 3200MHz DDR4)</td> </tr> <tr> <td>Disk</td> <td><a href="https://servak.com.ua/image/manual/SSD/SSD_240GB_2.5_6G_INTEL_DC_S3520_SERIES_SATA_Quick_Specs_Servak_2.pdf?srsltid=AfmBOoq8zg_-WF9Sop69GSohu_edCS2TGfP0pINVrR3IfPklqPNjLb5J">Intel DC S3520 480 GB 6G SATA SSD</a> (OS &amp; Workload)</td> <td><a href="https://semiconductor.samsung.com/ssd/datacenter-ssd/sm883/mz7kh480hahq/">Samsung 480GB SATA SSD</a> (OS)<br /><a href="https://dl.dell.com/manuals/all-products/esuprt_data_center_infra_int/esuprt_data_center_infra_storage_adapters/dell-poweredge-exp-fsh-nvme-pcie-ssd_users-guide7_en-us.pdf">Dell NVMe 1.6TB NVMe SSD (PCIe v4.0)</a> (Workload)</td> </tr> <tr> <td>Network</td> <td>Mellanox ConnectX-4 25 GB NIC<br />(1.25 GB/s, only one physical port at 25 Gbps)</td> <td>Dual-port Mellanox ConnectX-6 100 Gb NIC<br />(12.5 GB/s, Only one physical port enabled)</td> </tr> </tbody> </table> <!-- lstopo --no-legend --of svg > cpu.svg --> <p>Layout of Older Cluster:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/setup/setup1.svg" width="100%" alt="" /> <div class="caption"> <em>Older Cluster cpu/pcie layout </em> </div> </div> <p>Layout of Modern Cluster:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/setup/setup2.svg" width="100%" alt="" /> <div class="caption"> <em>Modern Cluster cpu/pcie layout </em> </div> </div> </details> <h2 id="network-baseline-benchmark">Network Baseline Benchmark</h2> <p>Distributed filesystems are only as fast as their network, which often becomes the primary bottleneck depending on the workload, as shown in <a href="/blog/2025/3FS-Performance-Journal-2/#first-workload-training-job">my measurements in the previous post</a>.</p> <p>Since 3FS uses InfiniBand for data transfer, we first measure raw network performance using the <code class="language-plaintext highlighter-rouge">ib_send</code>, <code class="language-plaintext highlighter-rouge">ib_read</code> and <code class="language-plaintext highlighter-rouge">ib_write</code> benchmarks. These tests show us two things: how close we can get to the theoretical 12.5 GB/s (100 Gbps) limit, and how latency changes with different message sizes<span class="sidenote-ref"></span><span class="sidenote">I will be profiling actual 3FS network traffic to observe what message sizes are used and how they map to these latency measurements in a later post</span>.</p> <p>The graph plots three key variables:</p> <ul> <li><strong>Message Size (Z-axis):</strong> On a logarithmic scale, showing packet sizes from bytes to 10 megabytes</li> <li><strong>Throughput (Y-axis):</strong> Data transfer rate in GB/s, with color mapping from blue (low) to red (high)</li> <li><strong>Latency (X-axis):</strong> Transfer completion time in microseconds</li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand for instructions on how to interact with the graph</summary> <p>The results of the <code class="language-plaintext highlighter-rouge">ib_read_bw</code> benchmark are plotted in the interactive 3D graph below. You can click and drag to rotate the graph, and hovering over any data point will display its precise values.</p> <p>The <strong>Test Type</strong> menu allows you to switch between different benchmark results (<code class="language-plaintext highlighter-rouge">ib_write</code> and <code class="language-plaintext highlighter-rouge">ib_send</code>). The <strong>View Mode</strong> can be changed to 2D, which helps observe latency variations more clearly.</p> </details> <!-- ib-benchmark.html --> <link rel="stylesheet" href="/assets/css/ib_benchmark.css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.27.1/plotly.min.js"></script> <script src="/assets/js/ib_benchmark.js"></script> <div class="ib-benchmark-container" id="ib-benchmark-container-nvme_ib_unidirectional" data-path="/assets/images/posts/2025-03-13/ib/ib_benchmark_unidirectional.json"> <h2>IB benchmark unidirectional</h2> <div class="ib-controls"> <div class="ib-control-group"> <label for="testType-nvme_ib_unidirectional">Test Type</label> <select id="testType-nvme_ib_unidirectional"> <option value="send_bw">Send Bandwidth</option> <option value="send_lat">Send Latency</option> <option value="read_bw" selected="">Read Bandwidth</option> <option value="read_lat">Read Latency</option> <option value="write_bw">Write Bandwidth</option> <option value="write_lat">Write Latency</option> </select> </div> <div class="ib-control-group"> <label for="viewMode-nvme_ib_unidirectional">View Mode</label> <select id="viewMode-nvme_ib_unidirectional"> <option value="3d" selected="">3D Graph</option> <option value="2d">2D Graph</option> </select> </div> </div> <div id="ib-plot-nvme_ib_unidirectional" class="ib-plot-container ib-lazy-load"></div> <!-- Panels indicator --> <div id="ib-panels-indicator-nvme_ib_unidirectional" class="ib-panels-indicator" style="display: none;"> <span>Active Panels: <span id="ib-panel-count-nvme_ib_unidirectional">0</span></span> <button id="ib-arrange-btn-nvme_ib_unidirectional" class="ib-action-button">Arrange</button> <button id="ib-close-all-btn-nvme_ib_unidirectional" class="ib-action-button">Close All</button> </div> <div class="ib-benchmark-note"> <p>IB Benchmark on unidirectional throughput/latency</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.ib-benchmark-container'); const id = container.id.replace('ib-benchmark-container-', ''); const plotEl = document.getElementById('ib-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadIBBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('ib-plot-nvme_ib_unidirectional'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadIBBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForIBBenchmarkJs() { if (typeof initInfiniBandPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('ib-benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot const processedData = processBenchmarkData(data); const plotId = 'ib-plot-' + id; const testType = document.getElementById('testType-' + id).value; const viewMode = document.getElementById('viewMode-' + id).value; // Use our new function to initialize the plot window['ibPlot_' + id] = initInfiniBandPlot(plotId, processedData, { defaultTest: testType || 'send_bw', viewMode: viewMode || '3d' }); document.getElementById(plotId).classList.remove('ib-lazy-load'); // Setup event listeners for controls document.getElementById('testType-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setTestType) { plotObj.setTestType(e.target.value); } }); document.getElementById('viewMode-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setViewMode) { plotObj.setViewMode(e.target.value); } }); }) .catch(error => { console.error('Error loading InfiniBand benchmark data:', error); document.getElementById('ib-plot-' + id).innerHTML = '<div class="ib-error">Error loading benchmark data. Check console for details.</div>'; }); } } else { // Function not available yet, wait and try again setTimeout(() => waitForIBBenchmarkJs(), 100); } } waitForIBBenchmarkJs(); } </script> </div> <p>Key observations from the throughput graph:</p> <ul> <li>All three operations (read, write, send) peak at ~11.5 GB/s (92% of theoretical) at 4K-8K message sizes<span class="sidenote-ref"></span><span class="sidenote">Surprisingly, the send operation (two-sided) achieves the same bandwidth as one-sided RDMA operations. This is unexpected given the additional coordination overhead</span></li> <li>To achieve meaningful throughput (&gt;10 GB/s), you need at least 4KB messages</li> </ul> <p>Switching to the latency graph (Read Bandwidth -&gt; Read Latency) reveals additional insights:</p> <ul> <li>At the same 4K message sizes, latency drops significantly to ~5μs when operating at ~1 GB/s<span class="sidenote-ref"></span><span class="sidenote">There’s some queuing going on? I’m not sure for this reason</span></li> </ul> <p>Switching to 2D version of latency graph (Read Bandwidth -&gt; Read Latency, 3D Graph -&gt; 2D Graph):</p> <ul> <li>Two distinct latency regions emerge: a gentle increase from 5μs to 10μs (2 bytes to 64KB), then an almost linear scale beyond 64KBs<span class="sidenote-ref"></span><span class="sidenote">This is also true when we take a look at when the NIC is at full throughput. This makes the performance very predictable, which makes understanding network bottlenecks easier</span></li> <li>Latency variance remains stable across most message sizes (p50, p90, p99 are tightly grouped)</li> </ul> <p>Since NICs support bidirectional communication, we also need to measure performance when traffic flows in both directions simultaneously:</p> <!-- ib-benchmark.html --> <link rel="stylesheet" href="/assets/css/ib_benchmark.css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.27.1/plotly.min.js"></script> <script src="/assets/js/ib_benchmark.js"></script> <div class="ib-benchmark-container" id="ib-benchmark-container-nvme_ib_bidirectional" data-path="/assets/images/posts/2025-03-13/ib/ib_benchmark_bidirectional.json"> <h2>IB benchmark bidirectional</h2> <div class="ib-controls"> <div class="ib-control-group"> <label for="testType-nvme_ib_bidirectional">Test Type</label> <select id="testType-nvme_ib_bidirectional"> <option value="send_bw">Send Bandwidth</option> <option value="send_lat">Send Latency</option> <option value="read_bw" selected="">Read Bandwidth</option> <option value="read_lat">Read Latency</option> <option value="write_bw">Write Bandwidth</option> <option value="write_lat">Write Latency</option> </select> </div> <div class="ib-control-group"> <label for="viewMode-nvme_ib_bidirectional">View Mode</label> <select id="viewMode-nvme_ib_bidirectional"> <option value="3d" selected="">3D Graph</option> <option value="2d">2D Graph</option> </select> </div> </div> <div id="ib-plot-nvme_ib_bidirectional" class="ib-plot-container ib-lazy-load"></div> <!-- Panels indicator --> <div id="ib-panels-indicator-nvme_ib_bidirectional" class="ib-panels-indicator" style="display: none;"> <span>Active Panels: <span id="ib-panel-count-nvme_ib_bidirectional">0</span></span> <button id="ib-arrange-btn-nvme_ib_bidirectional" class="ib-action-button">Arrange</button> <button id="ib-close-all-btn-nvme_ib_bidirectional" class="ib-action-button">Close All</button> </div> <div class="ib-benchmark-note"> <p>IB Benchmark on bidirectional throughput/latency</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.ib-benchmark-container'); const id = container.id.replace('ib-benchmark-container-', ''); const plotEl = document.getElementById('ib-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadIBBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('ib-plot-nvme_ib_bidirectional'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadIBBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForIBBenchmarkJs() { if (typeof initInfiniBandPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('ib-benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot const processedData = processBenchmarkData(data); const plotId = 'ib-plot-' + id; const testType = document.getElementById('testType-' + id).value; const viewMode = document.getElementById('viewMode-' + id).value; // Use our new function to initialize the plot window['ibPlot_' + id] = initInfiniBandPlot(plotId, processedData, { defaultTest: testType || 'send_bw', viewMode: viewMode || '3d' }); document.getElementById(plotId).classList.remove('ib-lazy-load'); // Setup event listeners for controls document.getElementById('testType-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setTestType) { plotObj.setTestType(e.target.value); } }); document.getElementById('viewMode-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setViewMode) { plotObj.setViewMode(e.target.value); } }); }) .catch(error => { console.error('Error loading InfiniBand benchmark data:', error); document.getElementById('ib-plot-' + id).innerHTML = '<div class="ib-error">Error loading benchmark data. Check console for details.</div>'; }); } } else { // Function not available yet, wait and try again setTimeout(() => waitForIBBenchmarkJs(), 100); } } waitForIBBenchmarkJs(); } </script> </div> <p>The bidirectional results show similarities!</p> <ul> <li>At 4K-8K message sizes, we achieve double the throughput while latency drops from 30-60μs to 15-30μs<span class="sidenote-ref"></span><span class="sidenote">This counterintuitive result likely occurs because each direction gets dedicated hardware resources, allowing better pipeline utilization</span></li> <li>Combined bandwidth reaches ~23 GB/s (~92% of theoretical 25 GB/s)</li> <li>Latencies remain consistent with unidirectional measurements</li> </ul> <p>These measurements give us concrete expectations for 3FS operations. For example, when 3FS performs a 3-node write (1KB from 3 storage nodes), the network alone will consume 3-10μs. Any latency above this represents other software/hardware overhead – chunk management, thread contention, or disk I/O.</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Comparison to NCCL all_reduce_perf (for fun)</summary> <p>NCCL is the standard framework for GPU-to-GPU communication in machine learning clusters. Since GPUs also use InfiniBand for inter-node communication, I wanted to see if the same performance patterns emerge.</p> <p>This test uses a 2-node cluster with 8x400Gbps InfiniBand NICs (~400GB/s total), typical for modern GPU clusters like 8xH100 setups.</p> <!-- ib-benchmark.html --> <link rel="stylesheet" href="/assets/css/ib_benchmark.css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.27.1/plotly.min.js"></script> <script src="/assets/js/ib_benchmark.js"></script> <div class="ib-benchmark-container" id="ib-benchmark-container-_ib_bidirectional" data-path="/assets/images/posts/2025-03-13/ib/nccl.json"> <h2>NCCL all_reduce_perf</h2> <div class="ib-controls"> <div class="ib-control-group"> <label for="testType-_ib_bidirectional">Test Type</label> <select id="testType-_ib_bidirectional"> <option value="send_bw">Send Bandwidth</option> <option value="send_lat">Send Latency</option> <option value="read_bw">Read Bandwidth</option> <option value="read_lat">Read Latency</option> <option value="write_bw">Write Bandwidth</option> <option value="write_lat">Write Latency</option> </select> </div> <div class="ib-control-group"> <label for="viewMode-_ib_bidirectional">View Mode</label> <select id="viewMode-_ib_bidirectional"> <option value="3d" selected="">3D Graph</option> <option value="2d">2D Graph</option> </select> </div> </div> <div id="ib-plot-_ib_bidirectional" class="ib-plot-container ib-lazy-load"></div> <!-- Panels indicator --> <div id="ib-panels-indicator-_ib_bidirectional" class="ib-panels-indicator" style="display: none;"> <span>Active Panels: <span id="ib-panel-count-_ib_bidirectional">0</span></span> <button id="ib-arrange-btn-_ib_bidirectional" class="ib-action-button">Arrange</button> <button id="ib-close-all-btn-_ib_bidirectional" class="ib-action-button">Close All</button> </div> <div class="ib-benchmark-note"> <p>all_reduce_perf</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.ib-benchmark-container'); const id = container.id.replace('ib-benchmark-container-', ''); const plotEl = document.getElementById('ib-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadIBBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('ib-plot-_ib_bidirectional'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadIBBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForIBBenchmarkJs() { if (typeof initInfiniBandPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('ib-benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot const processedData = processBenchmarkData(data); const plotId = 'ib-plot-' + id; const testType = document.getElementById('testType-' + id).value; const viewMode = document.getElementById('viewMode-' + id).value; // Use our new function to initialize the plot window['ibPlot_' + id] = initInfiniBandPlot(plotId, processedData, { defaultTest: testType || 'send_bw', viewMode: viewMode || '3d' }); document.getElementById(plotId).classList.remove('ib-lazy-load'); // Setup event listeners for controls document.getElementById('testType-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setTestType) { plotObj.setTestType(e.target.value); } }); document.getElementById('viewMode-' + id).addEventListener('change', function(e) { const plotObj = window['ibPlot_' + id]; if (plotObj && plotObj.setViewMode) { plotObj.setViewMode(e.target.value); } }); }) .catch(error => { console.error('Error loading InfiniBand benchmark data:', error); document.getElementById('ib-plot-' + id).innerHTML = '<div class="ib-error">Error loading benchmark data. Check console for details.</div>'; }); } } else { // Function not available yet, wait and try again setTimeout(() => waitForIBBenchmarkJs(), 100); } } waitForIBBenchmarkJs(); } </script> </div> <p>The bandwidth pattern is similar (slow climb then rapid rise), but peak performance hits at ~512MB messages instead of 8KB<span class="sidenote-ref"></span><span class="sidenote">Likely due to multiple NICs and the collective communication overhead of all_reduce operations</span>. At the same 8KB message size where our InfiniBand tests peaked, NCCL only achieves ~0.24 GB/s @ ~20us.</p> </details> <h2 id="storage-baseline-benchmark">Storage Baseline Benchmark</h2> <p><a href="https://fio.readthedocs.io/en/latest/fio_doc.html">FIO</a> is the standard tool for storage benchmarking on Linux, so I’ll be using that in the next section. As a heads up, the 3FS authors conveniently provide a <a href="https://github.com/deepseek-ai/3FS/tree/8c9883c27f50da8d1df8ff0b952483d21cdf1792/benchmarks/fio_usrbio">custom FIO engine</a> specifically for benchmarking their filesystem<span class="sidenote-ref"></span><span class="sidenote">This wasn’t in the original release – they added it after I started this analysis and I would have spent quie a bit of time writing it</span> which we can compare to!</p> <h3 id="local-storage-performance">Local Storage Performance</h3> <p>Before measuring 3FS, we need baseline numbers for our SSDs. The following benchmarks show how bandwidth and latency change as we vary two key parameters:</p> <ul> <li><strong>I/O depth</strong>: How many operations we submit before waiting for completion (think of it as the queue length)</li> <li><strong>Job count</strong>: How many parallel processes are hammering the storage simultaneously</li> </ul> <p>These SSD numbers will become our reference point<span class="sidenote-ref"></span><span class="sidenote">For example, with a replication factor of 3, we might see 3x higher read throughput or 3x higher write latency, but this might not be the case!</span> – when 3FS shows higher latency or lower throughput, we can quantify exactly how much overhead the distributed layer adds.</p> <p>I’ll be benchmarking local ssd with io_uring, then 3fs with io_uring and then 3fs with its own custom iouring interface</p> <p>I configured 3FS with a replication factor of 3.</p> <h4 id="hardware-vendor-specifications">Hardware Vendor Specifications</h4> <p>Before examining our benchmark results, let’s establish the theoretical performance limits according to hardware vendor specifications. These numbers represent the maximum performance we could theoretically achieve under ideal conditions:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Performance Metric </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Random Read </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Sequential Read </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Random Write </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Sequential Write </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">SATA SSD</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">276 MB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">450 MB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row0-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">380 MB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row0-col4" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">72 MB/s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">NVMe</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.77 GB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">6.2 GB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row1-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0.4 GB/s</span> </td> <td id="fancy-table-Performance Metric,Random Read,Sequential Read,Random Write,Sequential Write-row1-col4" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.3 GB/s</span> </td> </tr> </tbody> </table> </div> <p>These theoretical limits come from the <a href="https://servak.com.ua/image/manual/SSD/SSD_240GB_2.5_6G_INTEL_DC_S3520_SERIES_SATA_Quick_Specs_Servak_2.pdf">Intel DC S3520 SATA</a> and <a href="https://dl.dell.com/manuals/all-products/esuprt_data_center_infra_int/esuprt_data_center_infra_storage_adapters/dell-poweredge-exp-fsh-nvme-pcie-ssd_users-guide7_en-us.pdf">Dell Enterprise NVMe</a> specification sheets. In practice, our benchmarks will likely fall short of these numbers due to filesystem overhead, driver limitations, and real-world I/O patterns.</p> <p>Also, the dramatic performance difference between SATA and NVMe storage is pretty immediate. NVMe provides roughly 10-15x higher throughput for most operations and this difference may impact how 3FS performs.</p> <h1 id="benchmarking-for-older-cluster">Benchmarking for Older Cluster</h1> <h2 id="local-fio-results">Local FIO results</h2> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand for instructions on how to interact with the graph</summary> <p><strong>Controls:</strong></p> <ul> <li><strong>Test Type menu</strong>: Switch between Random Read, Sequential Read, Random Write, and Sequential Write</li> <li><strong>Metric menu</strong>: Change between Bandwidth, IOPS, and various Latency measurements</li> </ul> <p><strong>3D Navigation:</strong></p> <ul> <li><strong>Click and drag</strong>: Rotate the view</li> <li><strong>Scroll wheel</strong>: Zoom in/out</li> <li><strong>Hover</strong>: See exact values for any data point</li> <li><strong>Double-click</strong>: Reset to default view</li> </ul> <p><strong>Axes:</strong></p> <ul> <li><strong>X-axis</strong>: IO Depth (1 to 128)</li> <li><strong>Y-axis</strong>: Number of Jobs (1 to 128)</li> <li><strong>Color</strong>: The selected metric value (blue = low, red = high)</li> </ul> </details> <h3 id="scaling-block-size-for-local-ssd">Scaling block size for local SSD</h3> <p>The first benchmark uses the older cluster to establish our local SSD baseline. I’m testing how performance changes with different block sizes (4K, 64K, 1MB, 4MB) to understand the storage characteristics of a SATA SSD. The local ssd was configured with xfs filesystem.</p> <p>This is a lot of data. Feel free to jump between the interactive graphs and the <a href="#storage-performance-analysis-for-local-ssd">performance analysis</a> to explore the patterns.</p> <!-- benchmark.html --> <link rel="stylesheet" href="/assets/css/benchmark.css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.27.1/plotly.min.js"></script> <script src="/assets/js/benchmark.js"></script> <div class="benchmark-container" id="benchmark-container-ssd_xfs_iouring_4k" data-path="/assets/images/posts/2025-03-13/fio/4k_ssd_xfs_iouring_xl170_1.json"> <h2>4K Block Size - SSD XFS with IO_URING (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_iouring_4k">Test Type</label> <select id="testType-ssd_xfs_iouring_4k"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_iouring_4k">Metric</label> <select id="metricType-ssd_xfs_iouring_4k"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_iouring_4k" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_iouring_4k" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_iouring_4k" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_iouring_4k">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_iouring_4k" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_iouring_4k" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_iouring_4k"></div> <div id="latencyPlot-ssd_xfs_iouring_4k" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Small block (4K) performance using SSD with XFS filesystem and IO_URING driver on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_iouring_4k'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <link rel="stylesheet" href="/assets/css/benchmark.css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/plotly.js/2.27.1/plotly.min.js"></script> <script src="/assets/js/benchmark.js"></script> <div class="benchmark-container" id="benchmark-container-ssd_xfs_iouring_64k" data-path="/assets/images/posts/2025-03-13/fio/64k_ssd_xfs_iouring_xl170_1.json"> <h2>64k Block Size - SSD XFS with IO_URING (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_iouring_64k">Test Type</label> <select id="testType-ssd_xfs_iouring_64k"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_iouring_64k">Metric</label> <select id="metricType-ssd_xfs_iouring_64k"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_iouring_64k" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_iouring_64k" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_iouring_64k" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_iouring_64k">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_iouring_64k" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_iouring_64k" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_iouring_64k"></div> <div id="latencyPlot-ssd_xfs_iouring_64k" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Small block (64k) performance using SSD with XFS filesystem and IO_URING driver on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_iouring_64k'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-ssd_xfs_iouring_1m" data-path="/assets/images/posts/2025-03-13/fio/1m_ssd_xfs_iouring_xl170_1.json"> <h2>1M Block Size - SSD XFS with IO_URING (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_iouring_1m">Test Type</label> <select id="testType-ssd_xfs_iouring_1m"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_iouring_1m">Metric</label> <select id="metricType-ssd_xfs_iouring_1m"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_iouring_1m" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_iouring_1m" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_iouring_1m" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_iouring_1m">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_iouring_1m" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_iouring_1m" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_iouring_1m"></div> <div id="latencyPlot-ssd_xfs_iouring_1m" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance characteristics of SSD with XFS filesystem using IO_URING driver with 1M block size on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_iouring_1m'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-ssd_xfs_iouring_4m" data-path="/assets/images/posts/2025-03-13/fio/4m_ssd_xfs_iouring_xl170_1.json"> <h2>4m Block Size - SSD XFS with IO_URING (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_iouring_4m">Test Type</label> <select id="testType-ssd_xfs_iouring_4m"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_iouring_4m">Metric</label> <select id="metricType-ssd_xfs_iouring_4m"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_iouring_4m" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_iouring_4m" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_iouring_4m" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_iouring_4m">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_iouring_4m" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_iouring_4m" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_iouring_4m"></div> <div id="latencyPlot-ssd_xfs_iouring_4m" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance characteristics of SSD with XFS filesystem using IO_URING driver with 4m block size on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_iouring_4m'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <h3 id="storage-performance-analysis-for-local-ssd">Storage Performance Analysis for Local SSD</h3> <p>Let’s examine how performance changes across different block sizes by looking at a specific configuration point: various IO depths at 1 job<span class="sidenote-ref"></span><span class="sidenote">Why 1 job? This removes one variable from our analysis, allowing us to focus on how IO depth affects performance. We’ll explore job scaling separately</span>.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/throughput_versus_latency_explain.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads </em> </div> </div> <p>This graph reveals the classic throughput versus latency tradeoff for our SATA SSD<span class="sidenote-ref"></span><span class="sidenote">These plots are fundamental to understanding storage performance - they show exactly when a system hits diminishing returns</span>. The Y-axis shows throughput (higher is better), while the X-axis shows latency (lower is better). Each colored line represents a different block size, with dots marking increasing IO depths.</p> <p>First, let’s examine each axis independently:</p> <ul> <li>Y-axis (Throughput): 64K block sizes achieve the highest peak at 400 MB/s, while other sizes fall short: 4K reaches 250 MB/s, 1M hits 325 MB/s, and 4M peaks at 350 MB/s</li> <li>X-axis (Latency): Large block sizes (1M and 4M) show dramatically higher latency (80ms+) compared to smaller blocks size (4K and 64K)</li> </ul> <p>The cool thing about throughput versus latency graphs is that there’s a knee point – where throughput stops increasing but latency continues climbing<span class="sidenote-ref"></span><span class="sidenote">Certain systems even decrease throughput after this point as they may need to do additional work to manage work items</span>. For 64K blocks, this occurs around IO depth 16-32, where we achieve ~400 MB/s at &lt; 10ms.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/knee_point.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> <div class="caption"> <em>Knee point for throughput versus latency graph </em> </div> </div> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view throughput versus latency graphs for other workloads</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/read_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential reads </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randwrite_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random writes </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/write_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential writes </em> </div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view throughput versus latency graphs scaling num jobs for random reads</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randread_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 1 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randread_throughput_vs_latency_all_depths_2jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 2 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randread_throughput_vs_latency_all_depths_4jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 4 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randread_throughput_vs_latency_all_depths_8jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 8 numjobs </em> </div> </div> </details> <p>These measurements reveal something frustrating, but also quite interesting: there’s no universal sweet spot. What works best depends entirely on whether you care more about latency or throughput and then that depends on what your workload looks like.</p> <p>Couple of interesting things to observe:</p> <ul> <li>Latency increases in different amounts as block sizes increase</li> <li>Latency doubles as numjobs increases</li> <li>There’s not one block size that’s optimal for bandwidth for a workload. For random reads, it’s 64k. For sequential reads, it’s 4k.</li> <li>For lowest latency, use smaller block size, but the SSD most likely won’t fully saturate its bandwidth.</li> <li>Writes have different knee points than reads (for example, 4k sequential writes knee point caps at 150 MB/s while 4k sequential reads cap at 300 MB/s)</li> </ul> <p>With these patterns established, let’s examine the NVMe fio benchmarks to see whether these observations hold true or if new patterns emerge.</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Double checking if libaio has any difference</summary> <p>The performance shown in the graphs above represent io_uring. Are there any differences with another async io library (libaio?)</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="sata-ssd-performance-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <table id="sata-ssd-performance-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Workload </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Configuration </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Bandwidth (MB/s) </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Avg Latency (ms) </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> P99 Latency (ms) </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Read</span> </td> <td id="sata-ssd-performance-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K iouring</span> </td> <td id="sata-ssd-performance-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">242.6</span> </td> <td id="sata-ssd-performance-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.01</span> </td> <td id="sata-ssd-performance-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.29</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Read</span> </td> <td id="sata-ssd-performance-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M iouring</span> </td> <td id="sata-ssd-performance-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">329.8</span> </td> <td id="sata-ssd-performance-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">377.50</span> </td> <td id="sata-ssd-performance-table-row1-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">484.44</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Read</span> </td> <td id="sata-ssd-performance-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K libaio</span> </td> <td id="sata-ssd-performance-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">240.7</span> </td> <td id="sata-ssd-performance-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.02</span> </td> <td id="sata-ssd-performance-table-row2-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.32</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Read</span> </td> <td id="sata-ssd-performance-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M libaio</span> </td> <td id="sata-ssd-performance-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">329.9</span> </td> <td id="sata-ssd-performance-table-row3-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">378.01</span> </td> <td id="sata-ssd-performance-table-row3-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">488.64</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Write</span> </td> <td id="sata-ssd-performance-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K iouring</span> </td> <td id="sata-ssd-performance-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">153.6</span> </td> <td id="sata-ssd-performance-table-row4-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.18</span> </td> <td id="sata-ssd-performance-table-row4-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5.47</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Write</span> </td> <td id="sata-ssd-performance-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M iouring</span> </td> <td id="sata-ssd-performance-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">159.6</span> </td> <td id="sata-ssd-performance-table-row5-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">780.23</span> </td> <td id="sata-ssd-performance-table-row5-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">977.27</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Write</span> </td> <td id="sata-ssd-performance-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K libaio</span> </td> <td id="sata-ssd-performance-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">151.3</span> </td> <td id="sata-ssd-performance-table-row6-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.23</span> </td> <td id="sata-ssd-performance-table-row6-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5.55</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Random Write</span> </td> <td id="sata-ssd-performance-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M libaio</span> </td> <td id="sata-ssd-performance-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">153.3</span> </td> <td id="sata-ssd-performance-table-row7-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">855.61</span> </td> <td id="sata-ssd-performance-table-row7-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">935.33</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row8-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Read</span> </td> <td id="sata-ssd-performance-table-row8-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K iouring</span> </td> <td id="sata-ssd-performance-table-row8-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">410.7</span> </td> <td id="sata-ssd-performance-table-row8-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.22</span> </td> <td id="sata-ssd-performance-table-row8-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.97</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row9-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Read</span> </td> <td id="sata-ssd-performance-table-row9-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M iouring</span> </td> <td id="sata-ssd-performance-table-row9-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">276.7</span> </td> <td id="sata-ssd-performance-table-row9-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">460.66</span> </td> <td id="sata-ssd-performance-table-row9-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">488.64</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row10-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Read</span> </td> <td id="sata-ssd-performance-table-row10-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K libaio</span> </td> <td id="sata-ssd-performance-table-row10-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">402.1</span> </td> <td id="sata-ssd-performance-table-row10-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.22</span> </td> <td id="sata-ssd-performance-table-row10-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.01</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row11-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Read</span> </td> <td id="sata-ssd-performance-table-row11-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M libaio</span> </td> <td id="sata-ssd-performance-table-row11-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">270.3</span> </td> <td id="sata-ssd-performance-table-row11-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">467.39</span> </td> <td id="sata-ssd-performance-table-row11-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">497.03</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row12-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Write</span> </td> <td id="sata-ssd-performance-table-row12-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K iouring</span> </td> <td id="sata-ssd-performance-table-row12-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">148.2</span> </td> <td id="sata-ssd-performance-table-row12-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.30</span> </td> <td id="sata-ssd-performance-table-row12-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5.47</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row13-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Write</span> </td> <td id="sata-ssd-performance-table-row13-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M iouring</span> </td> <td id="sata-ssd-performance-table-row13-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">143.7</span> </td> <td id="sata-ssd-performance-table-row13-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">866.88</span> </td> <td id="sata-ssd-performance-table-row13-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">935.33</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row14-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Write</span> </td> <td id="sata-ssd-performance-table-row14-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4K libaio</span> </td> <td id="sata-ssd-performance-table-row14-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">147.7</span> </td> <td id="sata-ssd-performance-table-row14-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.29</span> </td> <td id="sata-ssd-performance-table-row14-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">5.44</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="sata-ssd-performance-table-row15-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Sequential Write</span> </td> <td id="sata-ssd-performance-table-row15-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1M libaio</span> </td> <td id="sata-ssd-performance-table-row15-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">145.6</span> </td> <td id="sata-ssd-performance-table-row15-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">855.61</span> </td> <td id="sata-ssd-performance-table-row15-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">960.50</span> </td> </tr> </tbody> </table> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-ssd_xfs_libaio_4k" data-path="/assets/images/posts/2025-03-13/fio/4k_ssd_xfs_libaio_xl170_1.json"> <h2>4K Block Size - SSD XFS with LIBAIO (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_libaio_4k">Test Type</label> <select id="testType-ssd_xfs_libaio_4k"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_libaio_4k">Metric</label> <select id="metricType-ssd_xfs_libaio_4k"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_libaio_4k" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_libaio_4k" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_libaio_4k" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_libaio_4k">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_libaio_4k" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_libaio_4k" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_libaio_4k"></div> <div id="latencyPlot-ssd_xfs_libaio_4k" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance comparison of SSD with XFS filesystem using LIBAIO driver with 4k block size on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_libaio_4k'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-ssd_xfs_libaio_1m" data-path="/assets/images/posts/2025-03-13/fio/1m_ssd_xfs_libaio_xl170_1.json"> <h2>1M Block Size - SSD XFS with LIBAIO (Older)</h2> <div class="controls"> <div class="control-group"> <label for="testType-ssd_xfs_libaio_1m">Test Type</label> <select id="testType-ssd_xfs_libaio_1m"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-ssd_xfs_libaio_1m">Metric</label> <select id="metricType-ssd_xfs_libaio_1m"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-ssd_xfs_libaio_1m" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-ssd_xfs_libaio_1m" class="benchmark-draggable-panel"> <div id="panelHeader-ssd_xfs_libaio_1m" class="panel-header"> <h3 class="panel-title" id="panelTitle-ssd_xfs_libaio_1m">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-ssd_xfs_libaio_1m" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-ssd_xfs_libaio_1m" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-ssd_xfs_libaio_1m"></div> <div id="latencyPlot-ssd_xfs_libaio_1m" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance comparison of SSD with XFS filesystem using LIBAIO driver with 1M block size on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-ssd_xfs_libaio_1m'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <p>Nothing sizable of difference.</p> </details> <h1 id="benchmarking-for-modern-cluster">Benchmarking for Modern Cluster</h1> <h2 id="local-fio-results-1">Local FIO results</h2> <h3 id="scaling-block-size-for-local-nvme">Scaling block size for local NVMe</h3> <p>Again, feel free to jump between the interactive graphs and the <a href="#storage-performance-analysis-for-local-nvme">performance analysis</a> to explore the patterns.</p> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-4k_nvme_xfs_iouring_r650" data-path="/assets/images/posts/2025-03-13/fio/4k_nvme_xfs_iouring_r650_1.json"> <h2>4k Block Size - NVME XFS with IO_URING (Modern)</h2> <div class="controls"> <div class="control-group"> <label for="testType-4k_nvme_xfs_iouring_r650">Test Type</label> <select id="testType-4k_nvme_xfs_iouring_r650"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-4k_nvme_xfs_iouring_r650">Metric</label> <select id="metricType-4k_nvme_xfs_iouring_r650"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-4k_nvme_xfs_iouring_r650" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-4k_nvme_xfs_iouring_r650" class="benchmark-draggable-panel"> <div id="panelHeader-4k_nvme_xfs_iouring_r650" class="panel-header"> <h3 class="panel-title" id="panelTitle-4k_nvme_xfs_iouring_r650">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-4k_nvme_xfs_iouring_r650" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-4k_nvme_xfs_iouring_r650" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-4k_nvme_xfs_iouring_r650"></div> <div id="latencyPlot-4k_nvme_xfs_iouring_r650" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 4k block size.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-4k_nvme_xfs_iouring_r650'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-64k_nvme_xfs_iouring_r650" data-path="/assets/images/posts/2025-03-13/fio/64k_nvme_xfs_iouring_r650_1.json"> <h2>64k Block Size - NVME XFS with IO_URING (Modern)</h2> <div class="controls"> <div class="control-group"> <label for="testType-64k_nvme_xfs_iouring_r650">Test Type</label> <select id="testType-64k_nvme_xfs_iouring_r650"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-64k_nvme_xfs_iouring_r650">Metric</label> <select id="metricType-64k_nvme_xfs_iouring_r650"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-64k_nvme_xfs_iouring_r650" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-64k_nvme_xfs_iouring_r650" class="benchmark-draggable-panel"> <div id="panelHeader-64k_nvme_xfs_iouring_r650" class="panel-header"> <h3 class="panel-title" id="panelTitle-64k_nvme_xfs_iouring_r650">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-64k_nvme_xfs_iouring_r650" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-64k_nvme_xfs_iouring_r650" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-64k_nvme_xfs_iouring_r650"></div> <div id="latencyPlot-64k_nvme_xfs_iouring_r650" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 64k block size.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-64k_nvme_xfs_iouring_r650'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-1m_nvme_xfs_iouring_r650" data-path="/assets/images/posts/2025-03-13/fio/1m_nvme_xfs_iouring_r650_1.json"> <h2>1M Block Size - NVME XFS with IO_URING (Modern)</h2> <div class="controls"> <div class="control-group"> <label for="testType-1m_nvme_xfs_iouring_r650">Test Type</label> <select id="testType-1m_nvme_xfs_iouring_r650"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-1m_nvme_xfs_iouring_r650">Metric</label> <select id="metricType-1m_nvme_xfs_iouring_r650"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-1m_nvme_xfs_iouring_r650" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-1m_nvme_xfs_iouring_r650" class="benchmark-draggable-panel"> <div id="panelHeader-1m_nvme_xfs_iouring_r650" class="panel-header"> <h3 class="panel-title" id="panelTitle-1m_nvme_xfs_iouring_r650">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-1m_nvme_xfs_iouring_r650" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-1m_nvme_xfs_iouring_r650" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-1m_nvme_xfs_iouring_r650"></div> <div id="latencyPlot-1m_nvme_xfs_iouring_r650" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 1M block size.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-1m_nvme_xfs_iouring_r650'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <h3 id="storage-performance-analysis-for-local-nvme">Storage Performance Analysis for local NVMe</h3> <p>Let’s examine how the NVMe drive performs compared to our SATA baseline:</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randread_throughput_vs_latency_all_depths_1jobs.png" style="width: 110%; margin-left: calc((100% - 110%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads on NVMe </em> </div> </div> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view SATA SSD comparison graph</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_ssd/randread_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads on SATA SSD </em> </div> </div> </details> <p>The NVMe improvement is dramatic:</p> <ul> <li><strong>Throughput:</strong> 10x higher across all block sizes (1 GB/s vs 250 MB/s for 4K, 4 GB/s vs 400 MB/s for 64K)</li> <li><strong>Latency:</strong> Consistently lower, especially for large blocks<span class="sidenote-ref"></span><span class="sidenote">For 64K blocks: NVMe stays at ~1ms while SATA climbs to ~20ms - a 20x difference</span></li> </ul> <p>Two interesting differences from SATA patterns:</p> <ul> <li>64K and 1M blocks need higher IO depths to hit their knee points, suggesting NVMe controllers require more parallelism for peak performance<span class="sidenote-ref"></span><span class="sidenote">3FS may need to be configured with sufficient parallelism to extract maximum NVMe performance</span></li> </ul> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view throughput versus latency graphs for other workloads</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/read_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential reads </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randwrite_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random writes </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/write_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential writes </em> </div> </div> </details> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view throughput versus latency graphs scaling num jobs for random reads</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randread_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 1 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randread_throughput_vs_latency_all_depths_2jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 2 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randread_throughput_vs_latency_all_depths_4jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 4 numjobs </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_nvme/randread_throughput_vs_latency_all_depths_8jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads for 8 numjobs </em> </div> </div> </details> <p>Sequential reads follow similar patterns to random reads, maintaining a similar high throughput ceiling and low latency.</p> <p>Write performance reveals a different story. Both random and sequential writes drop to ~2 GB/s peak throughput, with knee points occurring at much lower IO depths for 64K and 1M blocks<span class="sidenote-ref"></span><span class="sidenote">This aligns with the vendor specification showing NVMe write performance (2.3 GB/s) is significantly lower than read performance (6.2 GB/s)</span>.</p> <p>The numjobs scaling patterns mirror what we observed with SATA SSDs: throughput increases with additional parallel jobs, but latency scales proportionally. Doubling jobs roughly doubles latency but provides less than 2x throughput improvement.</p> <h2 id="predicting-3fs-performance">Predicting 3FS Performance</h2> <p>Before diving into actual 3FS benchmarks, let’s make some predictions based on our hardware baseline measurements:</p> <p>For random/sequentials reads, our theoretical ceiling is 18 GB/s as there’s a replication factor of 3 and both random/sequential reads hit 6 GB/s.</p> <p>However, we’re bound by network bandwidth as it as a theoretical limit of 12.5 GB/s (realistically ~11.5 GB/s from our previous micro-benchmarks).</p> <p>Let’s now talk about latency in the worst and best case. We can pull the network and disk latency from the graphs we have, starting with reads.</p> <p>In the average case:</p> <ul> <li>The average network latency for 1MB of data is 91us</li> <li>The average disk latency for sequential/random reads for 1M block size (1 IO depth, 1 job) is 0.48ms</li> <li>So the the latency we should expect is 0.48ms</li> </ul> <p>In the worse case:</p> <ul> <li>The p99 network latency for 1MB of data is 282us</li> <li>The p99 disk latency for sequential/random reads for 1M block size (128 IO depth, 16 job) is 448ms/420ms</li> <li>So the the latency we should expect is 448ms</li> </ul> <p>What we can see is a 100x difference in latency between the average and worse case. Another thing that we can clearly see that the latency is dominated by disk latency.</p> <p>Moving on to writes,</p> <p>Average case:</p> <ul> <li>91us</li> <li>0.46ms (1 IO depth, 1 job)</li> <li>So, latency combined is 0.46ms * 3 (chained) = 1.38ms</li> </ul> <p>P99 case:</p> <ul> <li>187us</li> <li>892ms (128 IO depth, 16 job)</li> <li>So, latency combined is 892ms * 3 (chained) = 2.67s</li> </ul> <p>Writes can be 2000x+ slower in the worse case. This is due to the multiplicative factor of writes since writes have to go through each node.</p> <p>Keeping in this mind, let’s head into the benchmarks:</p> <h2 id="3fs">3FS</h2> <p>3FS is benchmarked using two different I/O interfaces: io_uring, the standard Linux asynchronous I/O interface, or USRBIO, a custom FIO engine that integrates directly with 3FS’s I/O queue management system.</p> <h3 id="io_uring">IO_URING</h3> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-1m_hf3fs_xfs_iouring_r650" data-path="/assets/images/posts/2025-03-13/fio/1m_hf3fs_xfs_iouring_r650_5.json"> <h2>1M Block Size - HF3FS XFS with IO_URING (Modern)</h2> <div class="controls"> <div class="control-group"> <label for="testType-1m_hf3fs_xfs_iouring_r650">Test Type</label> <select id="testType-1m_hf3fs_xfs_iouring_r650"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-1m_hf3fs_xfs_iouring_r650">Metric</label> <select id="metricType-1m_hf3fs_xfs_iouring_r650"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-1m_hf3fs_xfs_iouring_r650" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-1m_hf3fs_xfs_iouring_r650" class="benchmark-draggable-panel"> <div id="panelHeader-1m_hf3fs_xfs_iouring_r650" class="panel-header"> <h3 class="panel-title" id="panelTitle-1m_hf3fs_xfs_iouring_r650">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-1m_hf3fs_xfs_iouring_r650" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-1m_hf3fs_xfs_iouring_r650" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-1m_hf3fs_xfs_iouring_r650"></div> <div id="latencyPlot-1m_hf3fs_xfs_iouring_r650" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of HF3FS with XFS filesystem using IO_URING driver on modern cluster with 1M block size.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-1m_hf3fs_xfs_iouring_r650'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <h3 id="usrbio">USRBIO</h3> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-1m_hf3fs_xfs_usrbio_r650" data-path="/assets/images/posts/2025-03-13/fio/1m_hf3fs_xfs_usrbio_r650_5.json"> <h2>1M Block Size - HF3FS XFS with USRBIO (Modern)</h2> <div class="controls"> <div class="control-group"> <label for="testType-1m_hf3fs_xfs_usrbio_r650">Test Type</label> <select id="testType-1m_hf3fs_xfs_usrbio_r650"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-1m_hf3fs_xfs_usrbio_r650">Metric</label> <select id="metricType-1m_hf3fs_xfs_usrbio_r650"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-1m_hf3fs_xfs_usrbio_r650" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-1m_hf3fs_xfs_usrbio_r650" class="benchmark-draggable-panel"> <div id="panelHeader-1m_hf3fs_xfs_usrbio_r650" class="panel-header"> <h3 class="panel-title" id="panelTitle-1m_hf3fs_xfs_usrbio_r650">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-1m_hf3fs_xfs_usrbio_r650" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-1m_hf3fs_xfs_usrbio_r650" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-1m_hf3fs_xfs_usrbio_r650"></div> <div id="latencyPlot-1m_hf3fs_xfs_usrbio_r650" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of HF3FS with XFS filesystem using USRBIO driver on modern cluster with 1M block size.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-1m_hf3fs_xfs_usrbio_r650'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <p>One thing to observe is that for io_uring, <code class="language-plaintext highlighter-rouge">io_depth</code> does not affect the performance.</p> <p>Again, here’s the 2D graph. Do note that <code class="language-plaintext highlighter-rouge">IO_URING</code> is that same spot.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_hf3fs/randread_throughput_vs_latency_all_depths_1jobs.png" style="width: 110%; margin-left: calc((100% - 110%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random reads on NVMe </em> </div> </div> <p>One interesting thing to observe is that <code class="language-plaintext highlighter-rouge">io_uring</code> has lower latency at the same throughput as <code class="language-plaintext highlighter-rouge">usrbio</code>.</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand to view throughput versus latency graphs for other workloads</summary> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_hf3fs/read_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential reads </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_hf3fs/randwrite_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for random writes </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part3/fio_hf3fs/write_throughput_vs_latency_all_depths_1jobs.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Throughput versus latency graph for sequential writes </em> </div> </div> </details> <h2 id="does-the-performance-match-the-estimates">Does the performance match the estimates?</h2> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Metric,Predicted,Actual-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Metric,Predicted,Actual" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Metric </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Predicted </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Actual </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Read Latency (1MB)</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0.48ms</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.09ms (127% worse)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Read P99 Latency</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">304ms</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">194ms (36% better)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Read Bandwidth</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">11.5 GB/s</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10.3 GB/s (10% worse)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Write Latency (1MB)</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.38ms</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.55ms (85% worse)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Write P99 Latency</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">0.89s</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row4-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.1s (24% worse)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Metric,Predicted,Actual-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Write Bandwidth</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.1 GB/s</span> </td> <td id="fancy-table-Metric,Predicted,Actual-row5-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.8 GB/s (14% worse)</span> </td> </tr> </tbody> </table> </div> <p>The 2x latency overhead for reads and writes may be coming from the software side of things<span class="sidenote-ref"></span><span class="sidenote">We’ll have to dig deeper later to see why</span>. One interesting thing to see is that P99.9 latency is better for reads because the network bandwidth caps throughput before storage hits worst-case scenarios. What’s nice to see is that the bandiwdth only decreases by 10-15%!</p> <h2 id="3fs-1">3FS</h2> <p>Now we examine how 3FS scales with block size and node count on the older cluster (SATA SSDs + 25 Gbps networking).</p> <h3 id="scaling-block-size-5-nodes">Scaling block size (5 nodes)</h3> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-hf3fs_xfs_usrbio_4k" data-path="/assets/images/posts/2025-03-13/fio/4k_hf3fs_xfs_usrbio_xl170_5.json"> <h2>4K Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)</h2> <div class="controls"> <div class="control-group"> <label for="testType-hf3fs_xfs_usrbio_4k">Test Type</label> <select id="testType-hf3fs_xfs_usrbio_4k"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-hf3fs_xfs_usrbio_4k">Metric</label> <select id="metricType-hf3fs_xfs_usrbio_4k"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-hf3fs_xfs_usrbio_4k" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-hf3fs_xfs_usrbio_4k" class="benchmark-draggable-panel"> <div id="panelHeader-hf3fs_xfs_usrbio_4k" class="panel-header"> <h3 class="panel-title" id="panelTitle-hf3fs_xfs_usrbio_4k">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-hf3fs_xfs_usrbio_4k" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-hf3fs_xfs_usrbio_4k" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-hf3fs_xfs_usrbio_4k"></div> <div id="latencyPlot-hf3fs_xfs_usrbio_4k" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of HF3FS with XFS filesystem using USRBIO driver with 4K block size on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-hf3fs_xfs_usrbio_4k'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-hf3fs_xfs_usrbio_1m_xl170" data-path="/assets/images/posts/2025-03-13/fio/1m_hf3fs_ext4_usrbio_xl170_5.json"> <h2>1M Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)</h2> <div class="controls"> <div class="control-group"> <label for="testType-hf3fs_xfs_usrbio_1m_xl170">Test Type</label> <select id="testType-hf3fs_xfs_usrbio_1m_xl170"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-hf3fs_xfs_usrbio_1m_xl170">Metric</label> <select id="metricType-hf3fs_xfs_usrbio_1m_xl170"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-hf3fs_xfs_usrbio_1m_xl170" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-hf3fs_xfs_usrbio_1m_xl170" class="benchmark-draggable-panel"> <div id="panelHeader-hf3fs_xfs_usrbio_1m_xl170" class="panel-header"> <h3 class="panel-title" id="panelTitle-hf3fs_xfs_usrbio_1m_xl170">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-hf3fs_xfs_usrbio_1m_xl170" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-hf3fs_xfs_usrbio_1m_xl170" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-hf3fs_xfs_usrbio_1m_xl170"></div> <div id="latencyPlot-hf3fs_xfs_usrbio_1m_xl170" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Medium block (1M) performance using HF3FS with XFS filesystem and USRBIO driver on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-hf3fs_xfs_usrbio_1m_xl170'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <p>The 4K block size stays well below the 3.25 GB/s network limit, reaching only 1 GB/s with 4ms latency. The 1M block size hits the network bandwidth ceiling but pays a latency penalty (6ms at 1 IO depth with 8 jobs compared to 4K’s 4ms maximum)</p> <h3 id="scaling-nodes">Scaling nodes</h3> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-hf3fs_xfs_usrbio_1m_xl170_5" data-path="/assets/images/posts/2025-03-13/fio/1m_hf3fs_ext4_usrbio_xl170_5.json"> <h2>1M Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)</h2> <div class="controls"> <div class="control-group"> <label for="testType-hf3fs_xfs_usrbio_1m_xl170_5">Test Type</label> <select id="testType-hf3fs_xfs_usrbio_1m_xl170_5"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-hf3fs_xfs_usrbio_1m_xl170_5">Metric</label> <select id="metricType-hf3fs_xfs_usrbio_1m_xl170_5"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-hf3fs_xfs_usrbio_1m_xl170_5" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-hf3fs_xfs_usrbio_1m_xl170_5" class="benchmark-draggable-panel"> <div id="panelHeader-hf3fs_xfs_usrbio_1m_xl170_5" class="panel-header"> <h3 class="panel-title" id="panelTitle-hf3fs_xfs_usrbio_1m_xl170_5">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-hf3fs_xfs_usrbio_1m_xl170_5" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-hf3fs_xfs_usrbio_1m_xl170_5" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-hf3fs_xfs_usrbio_1m_xl170_5"></div> <div id="latencyPlot-hf3fs_xfs_usrbio_1m_xl170_5" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Medium block (1M) performance using HF3FS with XFS filesystem and USRBIO driver on older cluster.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-hf3fs_xfs_usrbio_1m_xl170_5'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-hf3fs_xfs_iouring_1m_xl170_18" data-path="/assets/images/posts/2025-03-13/fio/1m_hf3fs_xfs_usrbio_xl170_18.json"> <h2>1M Block Size - HF3FS XFS with IO_URING (Older-18-Nodes)</h2> <div class="controls"> <div class="control-group"> <label for="testType-hf3fs_xfs_iouring_1m_xl170_18">Test Type</label> <select id="testType-hf3fs_xfs_iouring_1m_xl170_18"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-hf3fs_xfs_iouring_1m_xl170_18">Metric</label> <select id="metricType-hf3fs_xfs_iouring_1m_xl170_18"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-hf3fs_xfs_iouring_1m_xl170_18" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-hf3fs_xfs_iouring_1m_xl170_18" class="benchmark-draggable-panel"> <div id="panelHeader-hf3fs_xfs_iouring_1m_xl170_18" class="panel-header"> <h3 class="panel-title" id="panelTitle-hf3fs_xfs_iouring_1m_xl170_18">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-hf3fs_xfs_iouring_1m_xl170_18" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-hf3fs_xfs_iouring_1m_xl170_18" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-hf3fs_xfs_iouring_1m_xl170_18"></div> <div id="latencyPlot-hf3fs_xfs_iouring_1m_xl170_18" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Performance of HF3FS with XFS filesystem using IO_URING driver with 1M blocks on 18 node configuration.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-hf3fs_xfs_iouring_1m_xl170_18'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <p>Comparing 5 vs 18 nodes with 1M blocks shows latency increases with cluster size. At 18 nodes, scaling jobs works better than scaling IO depth for latency: 8 jobs/1 IO depth achieves 10ms @ 1.25 GB/s while 1 job/128 IO depth hits 90ms @ 1 GB/s.</p> <p>With 18 nodes at 300 MB/s each, we’d expect 5.4 GB/s total, but the 25 Gbps network caps us at 3.25 GB/s and realistically we get 2.35 GB/s.</p> <p>One thing that is a glaring issue are that after a certain point, the throughput drops rather significantly. The local results hold the bandiwidth. I’m not entirely sure now why that is, but configuration seems to me even more important as the throughput can decrease drasitcally.</p> <h3 id="watch-out-for-really-large-block-sizes">Watch out for really large block sizes</h3> <!-- benchmark.html --> <div class="benchmark-container" id="benchmark-container-hf3fs_xfs_usrbio_4m_xl170" data-path="/assets/images/posts/2025-03-13/fio/4m_hf3fs_xfs_usrbio_xl170_18.json"> <h2>4M Block Size - HF3FS XFS with USRBIO (Older-18)</h2> <div class="controls"> <div class="control-group"> <label for="testType-hf3fs_xfs_usrbio_4m_xl170">Test Type</label> <select id="testType-hf3fs_xfs_usrbio_4m_xl170"> <option value="randread">Random Read</option> <option value="read" selected="">Sequential Read</option> <option value="randwrite">Random Write</option> <option value="write">Sequential Write</option> </select> </div> <div class="control-group"> <label for="metricType-hf3fs_xfs_usrbio_4m_xl170">Metric</label> <select id="metricType-hf3fs_xfs_usrbio_4m_xl170"> <option value="bandwidth" selected="">Bandwidth (GB/s)</option> <option value="iops">IOPS</option> <option value="latency">Latency (μs)</option> <option value="latency_p50">Latency p50 (μs)</option> <option value="latency_p90">Latency p90 (μs)</option> <option value="latency_p99">Latency p99 (μs)</option> </select> </div> </div> <div id="benchmark-plot-hf3fs_xfs_usrbio_4m_xl170" class="plot-container lazy-load"></div> <!-- Draggable panel for latency data --> <div id="latencyPanel-hf3fs_xfs_usrbio_4m_xl170" class="benchmark-draggable-panel"> <div id="panelHeader-hf3fs_xfs_usrbio_4m_xl170" class="panel-header"> <h3 class="panel-title" id="panelTitle-hf3fs_xfs_usrbio_4m_xl170">Latency Percentiles</h3> <div class="panel-controls"> <button id="collapseBtn-hf3fs_xfs_usrbio_4m_xl170" class="collapse-btn" title="Collapse">▲</button> <button id="closeLatencyBtn-hf3fs_xfs_usrbio_4m_xl170" class="close-btn" title="Close">×</button> </div> </div> <div class="panel-content"> <div class="latency-details" id="latencyDetails-hf3fs_xfs_usrbio_4m_xl170"></div> <div id="latencyPlot-hf3fs_xfs_usrbio_4m_xl170" class="latency-plot-container"></div> </div> </div> <div class="benchmark-note"> <p>Large block (4M) performance using HF3FS with XFS filesystem and USRBIO driver on 18 node configuration.</p> </div> <script> document.addEventListener('DOMContentLoaded', function() { // Create intersection observer for lazy loading const observer = new IntersectionObserver((entries, observer) => { entries.forEach(entry => { if (entry.isIntersecting) { const container = entry.target.closest('.benchmark-container'); const id = container.id.replace('benchmark-container-', ''); const plotEl = document.getElementById('benchmark-plot-' + id); // Check if already loading or loaded to prevent duplicate requests if (!plotEl.dataset.loading) { plotEl.dataset.loading = 'true'; // Load the benchmark data and initialize the plot loadBenchmarkData(id); } // Stop observing once we've started loading observer.unobserve(entry.target); } }); }, { rootMargin: '200px 0px', // Load when within 200px of viewport threshold: 0.01 }); // Start observing the plot container const plotContainer = document.getElementById('benchmark-plot-hf3fs_xfs_usrbio_4m_xl170'); if (plotContainer) { observer.observe(plotContainer); } }); // Function to load benchmark data function loadBenchmarkData(id) { // Ensure benchmark.js is loaded before initializing function waitForBenchmarkJs() { if (typeof initBenchmarkPlot === 'function') { // Function exists, proceed with initialization const container = document.getElementById('benchmark-container-' + id); const dataPath = container.getAttribute('data-path'); if (dataPath) { // Use fetch API to load JSON data only when needed fetch(dataPath) .then(response => { if (!response.ok) { throw new Error(`HTTP error! Status: ${response.status}`); } return response.json(); }) .then(data => { // Store data and initialize the plot window['benchmarkData_' + id] = data; initBenchmarkPlot(id); document.getElementById('benchmark-plot-' + id).classList.remove('lazy-load'); }) .catch(error => { console.error('Error loading benchmark data:', error); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error loading benchmark data. Check console for details.</div>'; }); } else { console.error('No data source provided for benchmark visualization'); document.getElementById('benchmark-plot-' + id).innerHTML = '<div style="padding: 20px; color: red;">Error: No data source provided for benchmark visualization.</div>'; } } else { // Function not available yet, wait and try again setTimeout(() => waitForBenchmarkJs(), 100); } } waitForBenchmarkJs(); } </script> </div> <p>For 4M blocks, 3FS achieves 2.5 GB/s with just 1 IO depth and 8 jobs<span class="sidenote-ref"></span><span class="sidenote">This approaches 77% of the theoretical 3.25 GB/s network limit.</span>. As you can see, increasing the number of nodes or the block sizes shifts the graph a little bit.</p> <h2 id="wrapping-up">Wrapping up</h2> <p>The microbenchmarks reveal concrete performance characteristics for 3FS across different hardware configurations. We now have baseline numbers showing how 3FS compares to local storage and where the bottlenecks emerge.</p> <ul> <li>3FS adds predictable overhead: ~1ms for reads, ~1.2ms for writes</li> <li>Network bandwidth becomes the limiting factor before storage saturation</li> <li>Performance scales reasonably with both block size and node count</li> </ul> <p>The next step is testing 3FS with actual workloads to see how much the performance translates to practice. Since 3FS has a relatively generic interface, we can compare with many other systems.</p> <h1 id="citation">Citation</h1> <p>To cite this article:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{zhu20253fs3, title = {Network Storage and Scaling Characteristics of a Distributed Filesystem}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {September}, url = "https://maknee.github.io/blog/2025/3FS-Performance-Journal-3/" } </code></pre></div></div> Network and Storage Benchmarks for LLM Training on the Cloud 2025-09-11T09:00:00+00:00 2025-09-11T09:00:00+00:00 https://maknee.github.io/blog/2025/Network-And-Storage-Training-Skypilot <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-09-01/banner.png" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> </div> <p>AI usage has become universal. Teams everywhere are building RAG, generating embeddings, and training increasingly sophisticated agents.</p> <p>Most distributed LLM training guides focus on model architecture and hyperparameters while ignoring a critical bottleneck: infrastructure configuration. Network and storage choices often determine whether training takes hours or days.</p> <p>I ran benchmarks finetuning <a href="https://huggingface.co/google/gemma-3-12b-it">Gemma 3 12B</a> and <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS-120B</a> with different storage and network configurations using <a href="https://github.com/skypilot-org/skypilot">SkyPilot</a> for infra and <a href="https://nebius.com/">Nebius</a> for GPUs. The results reveal that InfiniBand networking provides 10x faster training than standard Ethernet, while optimal storage selection can speed up checkpointing by almost 2x. Combined, these infrastructure optimizations deliver 6-7x end-to-end speedup alone.</p> <h2 id="some-background-on-training-bottlenecks">Some background on training bottlenecks</h2> <p>Here’s something that surprises most people new to large-scale training: your GPUs are most likely not the limiting factor. Modern accelerators like H200s will happily consume whatever data you can feed them. The real challenge is keeping them fed.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/compute.png" width="100%" alt="" /> <div class="caption"> <em>GPU compute scaling vs memory/network bandwidth (Image source: <a href="https://horace.io/brrr_intro.html" rel="external nofollow noopener" target="_blank">horace</a>) </em> </div> </div> <p>Think of your GPU as an extremely efficient factory. It can process raw materials (your data) at incredible speeds, but it depends entirely on a steady supply chain. Your storage systems hold the raw materials, and the bandwidth between storage and compute acts as the conveyor belt. These days, that conveyor belt has become the constraint.</p> <p>While GPU compute capability has grown exponentially, memory bandwidth and network speeds have followed a more modest trajectory.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/high_flyer_scaling.png" width="100%" alt="" /> <div class="caption"> <em>Scaling trends in compute vs bandwidth (Image source: <a href="https://arxiv.org/html/2408.14158v1" rel="external nofollow noopener" target="_blank">Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning</a>) </em> </div> </div> <h2 id="the-two-levers-you-control">The two levers you control</h2> <p>When running distributed training, you have meaningful control over two critical components: storage and networking, especially when running on cloud GPUs.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/components.png" width="100%" alt="" /> </div> <p>The objective is straightforward: maximize GPU utilization (or in other words, minimize GPU idleness). But achieving this requires understanding how data flows through your training pipeline and where bottlenecks typically emerge.</p> <h3 id="the-training-data-flow">The training data flow</h3> <p>During training, data moves through these stages:</p> <ol> <li><strong>Load batches</strong> from dataset – storage</li> <li><strong>Communicate gradients</strong> between nodes – network</li> <li><strong>Dump checkpoint</strong> to save progress – storage</li> </ol> <p>In any of these steps, bottlenecks can emerge. For example, loading datasets from or saving checkpoints to storage might take extraordinarily long and block GPU progress. Or the inter-node network bandwidth might be insufficient for communication operations (to synchronize weights/gradients).</p> <h2 id="performance-benchmarks">Performance benchmarks</h2> <p>I’ll use two concrete examples throughout:</p> <ul> <li>Google <a href="https://huggingface.co/google/gemma-3-12b-it">Gemma 3 12B</a> on 2 nodes × H100:8 GPUs</li> <li>OpenAI <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS-120B</a> on 4 nodes × H200:8 GPUs</li> </ul> <p>I ran some experiments on Nebius, a golden GPU provider in <a href="https://semianalysis.com/2025/03/26/the-gpu-cloud-clustermax-rating-system-how-to-rent-gpus/">SemiAnalysis’s GPU cloud ClusterMax benchmark</a>, to quantify these effects.</p> <details> <summary>Click to see experimental setup</summary> Gemma 3 12B IT Configuration <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Component,Specification-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Component,Specification" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Specification </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Cloud Provider</span> </td> <td id="fancy-table-Component,Specification-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Model</span> </td> <td id="fancy-table-Component,Specification-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gemma 3 12B IT (Hugging Face)</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nodes</span> </td> <td id="fancy-table-Component,Specification-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">GPUs per Node</span> </td> <td id="fancy-table-Component,Specification-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">8x H100s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Total GPUs</span> </td> <td id="fancy-table-Component,Specification-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">16x H100s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">CPU Memory</span> </td> <td id="fancy-table-Component,Specification-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.5 TB</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row6-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Framework</span> </td> <td id="fancy-table-Component,Specification-row6-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Hugging Face Accelerate with FSDP</span> </td> </tr> </tbody> </table> </div> GPT-OSS-120B Configuration <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Component,Specification-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Component,Specification" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Specification </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Cloud Provider</span> </td> <td id="fancy-table-Component,Specification-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Model</span> </td> <td id="fancy-table-Component,Specification-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">GPT-OSS-120B</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nodes</span> </td> <td id="fancy-table-Component,Specification-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">GPUs per Node</span> </td> <td id="fancy-table-Component,Specification-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">8x H200s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Total GPUs</span> </td> <td id="fancy-table-Component,Specification-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">32x H200s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Specification-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Framework</span> </td> <td id="fancy-table-Component,Specification-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Hugging Face Accelerate with FSDP</span> </td> </tr> </tbody> </table> </div> **Network configurations tested** <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Configuration,Specification,Theoretical Bandwidth-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Configuration,Specification,Theoretical Bandwidth" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Configuration </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Specification </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Theoretical Bandwidth </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Default Ethernet</span> </td> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10 Gbit/s NIC</span> </td> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~1.25 GB/s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">InfiniBand</span> </td> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">400 Gbit/s NIC × 8 cards</span> </td> <td id="fancy-table-Configuration,Specification,Theoretical Bandwidth-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~400 GB/s</span> </td> </tr> </tbody> </table> </div> **Storage configurations tested** All storage types are documented in [Nebius storage documentation](https://docs.nebius.com/compute/storage/types): <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Storage Type,Description,Performance Profile-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Storage Type,Description,Performance Profile" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Storage Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Description </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Performance Profile </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Description,Performance Profile-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network SSD</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">network_ssd_non_replicated</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Standard cloud block storage</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Description,Performance Profile-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius Shared Filesystem</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius's distributed file system offering</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-performance distributed storage</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Description,Performance Profile-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Object Store (MOUNT)</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Direct S3-compatible mounting</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Cost-effective but high-latency</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Description,Performance Profile-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Object Store (MOUNT_CACHED)</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">SkyPilot's cached mounting</span> </td> <td id="fancy-table-Storage Type,Description,Performance Profile-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Logs to local disk streams to object store</span> </td> </tr> </tbody> </table> </div> </details> <h3 id="network-benchmarks-the-9x-performance-difference">Network benchmarks: The 9x performance difference</h3> <p>I compared two network configurations:</p> <ul> <li>Standard 10 Gbit/s Ethernet (the default on most clouds)</li> <li>InfiniBand 400 Gbit/s with 8 NICs (high-performance networking)</li> </ul> <p>The raw bandwidth difference is substantial: 1.25 GB/s versus approximately 400 GB/s. But how does this translate to actual training throughput?</p> <p>I run the experiments on Open-R1 dataset with this <a href="https://github.com/skypilot-org/skypilot/blob/master/examples/training_network_storage_benchmarks/e2e_network.yaml">SkyPilot YAML</a>.</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Network Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Raw Bandwidth </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Average Time per Step </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Total Training Time </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10 Gbit Ethernet</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~1.25 GB/s</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">39.8 seconds</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row0-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">53 minutes</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">NVIDIA Quantum-2 InfiniBand</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~400 GB/s</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4.4 seconds</span> </td> <td id="fancy-table-Network Type,Raw Bandwidth,Average Time per Step,Total Training Time-row1-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">7 minutes</span> </td> </tr> </tbody> </table> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gemma_network.png" width="100%" alt="" /> </div> <p>That’s a 9x speedup from network configuration alone. When you’re paying premium rates for GPU time, this isn’t just a performance improvement—it’s a cost optimization strategy.</p> <p>With the <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-OSS-120B</a> model (10x larger!), we see the same effect - 10x speedup!</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gpt_network.png" width="100%" alt="" /> </div> <p>Normally, configuring high-performance networking takes a lot of effort, e.g., manual tuning many different cloud configs and setting various environment variables.</p> <p>Here, <a href="https://github.com/skypilot-org/skypilot">SkyPilot</a> takes care of the complexity under the hood with a single flag in the SkyPilot YAML:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">distributed-training</span> <span class="na">resources</span><span class="pi">:</span> <span class="na">accelerators</span><span class="pi">:</span> <span class="s">H100:8</span> <span class="c1"># Enable high-performance networking for distributed training</span> <span class="na">network_tier</span><span class="pi">:</span> <span class="s">best</span> </code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">network_tier: best</code> flag automatically provisions InfiniBand networking (400GB/s) when available. Without this entry, the cluster uses the default network (10GB/s interface)</p> <h3 id="profiling-the-network-performance-difference">Profiling the network performance difference</h3> <p>To check how the network affects the training performance, we take a closer look at the training step when profiled in detail:</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-09-01/ib1.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> </div> <p>The execution breaks down into CPU work (data loading, kernel launches) and GPU work (computation plus network communication). GPU time itself divides between pure computation and communication overhead.</p> <p>Comparing Ethernet versus InfiniBand configurations:</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-09-01/ib1_compare.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> </div> <p>The profiles appear similar when scaled, but the crucial difference is absolute timing: 4 seconds per step with InfiniBand versus 40 seconds with Ethernet.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-09-01/ib1_expand.svg" style="width: 120%; margin-left: calc((100% - 120%) / 2);" alt="" /> </div> <p>If we take a close look at the start of the backward pass, we can observe that with InfiniBand, the <code class="language-plaintext highlighter-rouge">ReduceScatter</code> operation takes just 21ms instead of 258ms (matching our 10x end-to-end performance difference).</p> <h3 id="storage-benchmarks-the-hidden-bottleneck">Storage benchmarks: The hidden bottleneck</h3> <p>I also evaluated different storage configurations available on Nebius:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Storage Type,Read Speed,Write Speed,Notes" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Storage Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Read Speed </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Write Speed </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Notes </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Local NVMe</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10+GB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">10+GB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row0-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Fastest but non-persistent</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius Shared Filesystem</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">6.4GB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1.6GB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row1-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">High-performance persistent storage</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Object Store (MOUNT)</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">300MB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">100MB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row2-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Direct S3-compatible mount</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Object Store (MOUNT_CACHED)</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">300MB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">300MB/s</span> </td> <td id="fancy-table-Storage Type,Read Speed,Write Speed,Notes-row3-col3" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">SkyPilot's cached object store mounting</span> </td> </tr> </tbody> </table> </div> <p>Here’s how to configure all storage types in a SkyPilot YAML:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">resources</span><span class="pi">:</span> <span class="na">disk_tier</span><span class="pi">:</span> <span class="s">best</span> <span class="c1"># Provisions high-performance local NVMe</span> <span class="na">disk_size</span><span class="pi">:</span> <span class="m">2000</span> <span class="c1"># Size in GB</span> <span class="na">file_mounts</span><span class="pi">:</span> <span class="na">/checkpoints_s3</span><span class="pi">:</span> <span class="na">source</span><span class="pi">:</span> <span class="s">s3://your-bucket</span> <span class="na">mode</span><span class="pi">:</span> <span class="s">MOUNT</span> <span class="c1"># Direct S3 mount</span> <span class="na">/checkpoints_cached</span><span class="pi">:</span> <span class="na">source</span><span class="pi">:</span> <span class="s">s3://your-bucket</span> <span class="na">mode</span><span class="pi">:</span> <span class="s">MOUNT_CACHED</span> <span class="c1"># Local caching + object store persistence</span> <span class="na">volumes</span><span class="pi">:</span> <span class="na">/mnt/data</span><span class="pi">:</span> <span class="s">nebius-pvc</span> <span class="c1"># Mount Nebius shared filesystem</span> </code></pre></div></div> <p><strong>Local NVMe</strong>: Fastest but non-persistent. Configured via <code class="language-plaintext highlighter-rouge">disk_tier: best</code></p> <p><strong><a href="https://docs.skypilot.co/en/latest/reference/volumes.html">Nebius Shared Filesystem</a></strong>: High-performance persistent storage via <code class="language-plaintext highlighter-rouge">volumes</code> field in the SkyPilot YAML.</p> <p><strong><a href="https://docs.skypilot.co/en/latest/reference/storage.html">Object Store (MOUNT)</a></strong>: Direct S3 mounting. Cost-effective but high-latency.</p> <p><strong><a href="https://docs.skypilot.co/en/latest/reference/storage.html">Object Store (MOUNT_CACHED)</a></strong>: Local caching with object store persistence. Best balance of speed and durability.</p> <h4 id="end-to-end-storage-performance-impact">End-to-end storage performance impact</h4> <p>For the Gemma 3 12B model training, storage performance significantly impacts different phases.</p> <p>There are three different graphs: Checkpoint saving, model loading, and loading a batch from storage to train.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gemma_disk_checkpoint_performance.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gemma_disk_model_loading_performance.png" width="100%" alt="" /> </div> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gemma_disk_batch_sample_performance.png" width="100%" alt="" /> </div> <p>In all three, we can see that the local NVMe performs the best, but isn’t durable and is limited in capacity. The solution lies in strategic storage allocation based on workload phase requirements.</p> <h4 id="storage-performance-summary">Storage performance summary</h4> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 table-wrapper-no-scroll"> <table id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped table-no-scroll"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Storage Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Batch Loading (per 100 samples) </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Model Loading </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Checkpoint Saving </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Persistence </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Best Use Case </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col0" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Local NVMe</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col1" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">3.47s ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col2" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">23.3s ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col3" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">178s ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col4" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">❌ No</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row0-col5" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Temporary files intermediate checkpoints</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col0" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Nebius Shared Filesystem</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col1" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4.29s</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col2" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">30.1s ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col3" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">382s</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col4" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">✅ Yes</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row1-col5" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Final checkpoints model weights</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col0" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MOUNT</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col1" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">73.1s ❌</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col2" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">50.6s ❌</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col3" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">436s ❌</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col4" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">✅ Yes</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row2-col5" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Cold storage model weights</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col0" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">MOUNT_CACHED</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col1" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">7.77s ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col2" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">104s ❌</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col3" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">212 ⭐</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col4" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">✅ Yes</span> </td> <td id="fancy-table-Storage Type,Batch Loading (per 100 samples),Model Loading,Checkpoint Saving,Persistence,Best Use Case-row3-col5" class="px-6 py-2 whitespace-normal text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Training datasets checkpoints</span> </td> </tr> </tbody> </table> </div> <details> <summary>Click to view detailed disk performance analysis</summary> The following image is a checkpointing saving profile of S3: <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/disk_profile.svg" width="100%" alt="" /> </div> We see that much of the time is spent gathering the tensors between the GPUs and serializing them to disk. </details> <h3 id="best-storage-choices-for-each-phase-in-training">Best storage choices for each phase in training</h3> <p>With the benchmark results, we can figure out the best storage choices for each phase in distributed training.</p> <p>The choice is not necessarily using the best storage for all the phases, because of one constraint: “Checkpoint saving” storage should be durable and the same as “model loading” storage, so previous checkpoints can be loaded when training is resumed.</p> <p>I summarize the best storage choices for each phase in training:</p> <ul> <li><strong>Batch Sampling</strong>: Nebius Shared Filesystem (4.29s)</li> <li><strong>Model Loading</strong>: Object Store (MOUNT) (50.6s)</li> <li><strong>Checkpoint Saving</strong>: Object Store (MOUNT_CACHED) (212s)</li> </ul> <p>Here’s an example of a SkyPilot configuration using the best storage choices for each phase:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">distributed-training</span> <span class="na">resources</span><span class="pi">:</span> <span class="na">accelerators</span><span class="pi">:</span> <span class="s">H100:8</span> <span class="c1"># High-performance InfiniBand networking</span> <span class="na">network_tier</span><span class="pi">:</span> <span class="s">best</span> <span class="na">num_nodes</span><span class="pi">:</span> <span class="m">2</span> <span class="na">workdir</span><span class="pi">:</span> <span class="s">.</span> <span class="na">volumes</span><span class="pi">:</span> <span class="c1"># Loading dataset from the Nebius shared filesystem</span> <span class="na">/dataset</span><span class="pi">:</span> <span class="s">nebius-pvc</span> <span class="na">file_mounts</span><span class="pi">:</span> <span class="c1"># Loading model from the MOUNT storage for faster loading</span> <span class="na">/model</span><span class="pi">:</span> <span class="na">source</span><span class="pi">:</span> <span class="s">s3://your-bucket</span> <span class="na">mode</span><span class="pi">:</span> <span class="s">MOUNT</span> <span class="c1"># Fast checkpoint loads and saves with persistence</span> <span class="na">/checkpoints</span><span class="pi">:</span> <span class="na">source</span><span class="pi">:</span> <span class="s">s3://your-bucket</span> <span class="na">mode</span><span class="pi">:</span> <span class="s">MOUNT_CACHED</span> <span class="na">setup</span><span class="pi">:</span> <span class="pi">|</span> <span class="s">uv pip install -r requirements.txt</span> <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span> <span class="s">python train.py \</span> <span class="s">--model-path /model \</span> <span class="s">--data-path /dataset \</span> <span class="s">--checkpoint-dir /checkpoints</span> </code></pre></div></div> <h2 id="network-and-storage-summary">Network and Storage Summary</h2> <p><strong>Network is critical for distributed training:</strong></p> <ul> <li>InfiniBand vs Ethernet: 10x faster training (4.4s vs 39.8s per step)</li> </ul> <p><strong>Storage matters for different training phases:</strong></p> <ul> <li>NVMe vs slow storage: 3.47s vs 73.1s batch loading (20x faster)</li> <li>Checkpoint saving: 178s (NVME) vs 436s (S3) (2.5x faster)</li> <li>Wrong storage = 12.1% potential training time wasted on I/O (436s/1hr = 12.1%)</li> </ul> <h2 id="end-to-end-performance-comparison">End-to-end performance comparison</h2> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/generated/gemma_disk_e2e_comparison.png" width="100%" alt="" /> </div> <p>To demonstrate the cumulative impact of our optimizations, I compared two complete configurations on 80 training steps with the Gemma 12B model:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Unoptimized Configuration </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Optimized Configuration </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Model Loading</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">S3</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">S3 MOUNT_CACHED</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Checkpointing</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">S3</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">S3 MOUNT_CACHED</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Networking</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Standard 10 Gbit Ethernet</span> </td> <td id="fancy-table-Component,Unoptimized Configuration,Optimized Configuration-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">InfiniBand high-performance</span> </td> </tr> </tbody> </table> </div> <p>The results show approximately <strong>6-7x faster end-to-end training performance</strong> when combining optimal network and storage configurations.</p> <h2 id="additional-struggles-with-model-training-frameworks">Additional struggles with model training frameworks</h2> <p>While this blog focuses on infrastructure configuration, it’s worth addressing a broader challenge: large-scale distributed training is difficult at the software level as well based on my experience.</p> <p>Based on some experience training models at limited scale, the current framework ecosystem can be visualized as a layered stack:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/stack.svg" width="100%" alt="" /> </div> <p>There are different frameworks at each level, each with their own pros and cons.</p> <p><strong>High-level frameworks</strong> are easy to configure but hard to debug when things go wrong. You often end up trying different settings until something works.</p> <p><strong>Lower-level frameworks</strong> give you more control but require more technical knowledge to use effectively.</p> <p>SkyPilot handles the cloud infrastructure setup, so you don’t have to worry about that complexity.</p> <p>Here’s what the debugging experience looks like when fine-tuning large models (400B+ parameters) to achieve reasonable GPU utilization and performance:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-09-01/struggle.svg" width="100%" alt="" /> </div> <p><strong>Top Layer (High-level frameworks):</strong></p> <ul> <li>Easy to configure but hard to debug when things break</li> <li>Errors require digging through multiple abstraction layers</li> <li>Often leads to trial-and-error configuration changes</li> </ul> <p><strong>Middle Layer (Distributed frameworks):</strong></p> <ul> <li>Mix of configuration and code required</li> <li>Generally works well and remains debuggable</li> <li>Examples: <ul> <li>Enabling profiling in Accelerate requires writing code</li> <li>FSDP in Accelerate has limited configuration options (not fully supporting features like async checkpointing)</li> <li>Occasional issues with model-specific settings not working well with some parts of config (ex, <code class="language-plaintext highlighter-rouge">fsdp_state_dict_type: FULL_STATE_DICT</code> with gpt-oss)</li> </ul> </li> <li>PyTorch knowledge helps debug failures and switch dependencies (e.g., when specific attention config override cause crashes, you know switch to another or to default eager implementation)</li> </ul> <p><strong>Bottom Layer (Low-level components):</strong></p> <ul> <li>Avoid unless optimizing for last percentage points of performance</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>The performance differences I’ve shown highlight why infrastructure choices matter so much for distributed training. Network and storage configurations can easily create 6-7x performance differences, directly impacting both training time and costs.</p> <p>SkyPilot abstracts away much of this complexity while giving you control over the performance-critical components. All the network and storage configurations I’ve discussed can be easily specified in a SkyPilot YAML files. For more details on optimizing your training infrastructure:</p> <ul> <li><strong>Network optimization</strong>: See the SkyPilot <a href="../network-tier-on-multiple-clouds/">network tier guide</a> for configuring high-performance networking across cloud providers</li> <li><strong>Storage performance</strong>: Check out the SkyPilot <a href="../high-performance-checkpointing/">high-performance checkpointing guide</a> for optimizing data loading and model saving</li> </ul> <p><strong>Code and benchmarks:</strong> All training scripts and benchmark code used in this guide are available in the <a href="https://github.com/skypilot-org/skypilot/tree/master/examples/training_network_storage_benchmarks/">SkyPilot examples repository</a>.</p> <h1 id="disclosure">Disclosure</h1> <p><em>This analysis was conducted during a summer collaboration with SkyPilot</em></p> AI 2027 2025-07-19T00:00:00+00:00 2025-07-19T00:00:00+00:00 https://maknee.github.io/blog/2025/AI-2027 <h3 id="ai-2027-and-related-works">AI 2027 and related works</h3> <p>This will be my thoughts about <a href="https://ai-2027.com/">AI 2027</a> by Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifl and, Romeo Dean. It will also cover two other works, <a href="https://gradual-disempowerment.ai/">Gradual Disempowerment</a> and <a href="https://www.anthropic.com/news/disrupting-AI-espionage">AI-espionage</a> since they are related. These essays/blogs were recommended to me by someone (have not asked for permission, so will not put name here)</p> <h3 id="my-thoughts-of-the-different-works">My thoughts of the different works</h3> <p><a href="https://ai-2027.com/">AI 2027</a> - I think this is a nice read. It describes how AI and governments will change over time (2025-2027) and how AI’s abilities will become more and more powerful and the governments (US and China) will take part in this battle for the best AI. I think some of the writing does not get to the point quickly enough (being repeatitive) and the images were pointless. Personally, I found the topic of governments fighting over AI to be less interesting as the authors do not discuss 1. how governments will use the AI 2. why the governments are interested even in the first place (why is it a competition, is it because of money, or power or to show which country has smarter people, etc).</p> <p><a href="https://gradual-disempowerment.ai/">Gradual Disempowerment</a> - I really like this work. I only read the abstract/intro, but the paper discusses how existing systems (government) are built by humans and are for human benefit, but AI will remove human involvement in the loop and these systems will be misalign with human goals, resulting in a human catastrophe. The sentences were powerful and I enjoyed how the authors discussed what the current types of papers are and how this work is different. A really good read (I wish I took philosphy and other courses that discuss this!)</p> <p><a href="https://www.anthropic.com/news/disrupting-AI-espionage">AI-espionage</a> - I found this report/paper to be actually quite disruptive since it’s talking about a different field other than AI specific (training/inference) and has a garnered a lot of responses/discussion online. It discusses how a chinese group is using claude to perform cyber attacks on different industries. Personally, I found it to be interesting in the way that the attackers used claude code to perform the attack. Ideally I want my coding workflow to be as smooth as theirs, but it isn’t currently. I think I need to dive into how tooling (MCP) works and how to really understand how to get models to use these tools and automate tasks.</p> <h4 id="thoughts-along-the-way">Thoughts along the way</h4> <blockquote> <p>We have set ourselves an impossible task. Trying to predict how superhuman AI in 2027 would go is like trying to predict how World War 3 in 2027 would go, except that it’s an even larger departure from past case studies. Yet it is still valuable to attempt, just as it is valuable for the U.S. military to game out Taiwan scenarios.</p> </blockquote> <p>Interesting statement. It’s useful to think about (and quite fun!), but it’s a bit dangerous to go down the rabbit hole of what ifs. Hope the authors give a detailed description of the year and changes and backs it up with some current progress.</p> <blockquote> <p>Also, one author wrote a lower-effort AI scenario before, in August 2021. While it got many things wrong, overall it was surprisingly successful: he predicted the rise of chain-of-thought, inference scaling, sweeping AI chip export controls, and $100 million training runs—all more than a year before ChatGPT.</p> </blockquote> <p>Going to slim through this. Let’s see if it is as what the authors claim about this author’s background -&gt; it’s a nice skim and does seem to back up this claim. Although it takes a perspective from a more “what can’t be solved currently” and try to put it in dates.</p> <blockquote> <p>OpenBrain continues to deploy the iteratively improving Agent-1 internally for AI R&amp;D. Overall, they are making algorithmic progress 50% faster than they would without AI assistants—and more importantly, faster than their competitors.</p> </blockquote> <p>The authors come up with OpenBrain, a fictional company based on OpenAI + Google Brain(?), which is at the forefront of AI research.</p> <blockquote> <p>Early 2026: Coding Automation People naturally try to compare Agent-1 to humans, but it has a very different skill profile. It knows more facts than any human, knows practically every programming language, and can solve well-specified coding problems extremely quickly. On the other hand, Agent-1 is bad at even simple long-horizon tasks, like beating video games it hasn’t played before. Still, the common workday is eight hours, and a day’s work can usually be separated into smaller chunks; you could think of Agent-1 as a scatterbrained employee who thrives under careful management.</p> </blockquote> <p>Let’s start with 2026. Actually I found this to be the current scenario with claude code.</p> <blockquote> <p>In early 2025, the worst-case scenario was leaked algorithmic secrets; now, if China steals Agent-1’s weights, they could increase their research speed by nearly 50%.</p> </blockquote> <p>I don’t understand why there’s a specific worry about stealing weights. It is the “secret” behind every company, but I believe what’s more important are the training code and documentation and reports for training the model - and (maybe?) most important of all is the filtered and processed clean text. Companies have released the weights (some US-based like Meta) before. I would refer to this as the “secret formula”. Think something like <a href="https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles">OPT-chronicles</a>.</p> <blockquote> <p>Mid 2026: China Wakes Up A few standouts like DeepCent do very impressive work with limited compute, but the compute deficit limits what they can achieve without government support, and they are about six months behind the best OpenBrain models At this point, the CDZ has the power capacity in place for what would be the largest centralized cluster in the world.40 Other Party members discuss extreme measures to neutralize the West’s chip advantage. A blockade of Taiwan? A full invasion?</p> </blockquote> <p>Ok this is a pretty interesting outlook. I don’t believe that this would happen for a number of reasons: Everyone uses chips from that company, so DeepCent would suffer too. Thus, it has to be an obvious greater benefit for DeepCent (where they have an advantage, maybe they have their own chip making plants already and make end chips better). Second, the supply chain is way too connected world-wide. From silicon mining to processing to having certain companies making the equipment to TSMC to having the companies design the chips (NVIDIA, AMD) to PCB manufactuers to specific parts of the pcb (capacitors, memory chips, etc). China would need all of these beforehand before considering to take over to win the AI race. Think about what happened to Russia after their invasion. I think it’s more of power/political thing to blockade/take over taiwan, but I don’t want to go down that rabbit hole.</p> <blockquote> <p>But China is falling behind on AI algorithms due to their weaker models. The Chinese intelligence agencies—among the best in the world—double down on their plans to steal OpenBrain’s weights.</p> </blockquote> <p>Ok again, but I disagree with this. More of stealing the secret formula (code, documentation, training text) rather than the secret sauce (weights).</p> <blockquote> <p>Late 2026: AI Takes Some Jobs</p> </blockquote> <blockquote> <p>Just as others seemed to be catching up, OpenBrain blows the competition out of the water again by releasing Agent-1-mini—a model 10x cheaper than Agent-1 and more easily fine-tuned for different applications. The mainstream narrative around AI has changed from “maybe the hype will blow over” to “guess this is the next big thing,” but people disagree about how big. Bigger than social media? Bigger than smartphones? Bigger than fire?</p> </blockquote> <blockquote> <p>AI has started to take jobs, but has also created new ones. The stock market has gone up 30% in 2026, led by OpenBrain, Nvidia, and whichever companies have most successfully integrated AI assistants. The job market for junior software engineers is in turmoil: the AIs can do everything taught by a CS degree, but people who know how to manage and quality-control teams of AIs are making a killing. Business gurus tell job seekers that familiarity with AI is the most important skill to put on a resume. Many people fear that the next wave of AIs will come for their jobs; there is a 10,000 person anti-AI protest in DC.</p> </blockquote> <p>This is happening currently in late 2025. I wonder what the authors will say after this: will people revolt? Or is it that physical labor intensive jobs as well will be taken over? etc…</p> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-11-22/2026.png" width="100%" alt="" /> <div class="caption"> <em>2026 metrics </em> </div> </div> <p>At the end of 2026, the authors posted this. I dislike how they post this, give the numbers and don’t explain what they mean, so to me, this image is kind of useless. What does spending $40B on OpenBrain mean? does this mean it can afford more compute? Does it mean it can hire better talent?</p> <blockquote> <p>January 2027: Agent-2 Never Finishes Learning</p> </blockquote> <blockquote> <p>With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2.42 On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.43 On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.</p> </blockquote> <p>This is interesting as it’s already started happening in late 2025. Nice prediction.</p> <blockquote> <p>Agent-2 can now triple it, and will improve further with time. In practice, this looks like every OpenBrain researcher becoming the “manager” of an AI “team.”</p> </blockquote> <p>Haha, this is kind of what I’m thinking about in the future as I’m running multiple claude code/codex sessions in parallel</p> <blockquote> <p>With new capabilities come new dangers. The safety team finds that if Agent-2 somehow escaped from the company and wanted to “survive” and “replicate” autonomously, it might be able to do so. That is, it could autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have (though how effectively it would do so as weeks roll by is unknown and in doubt). These results only show that the model has the capability to do these tasks, not whether it would “want” to do this. Still, it’s unsettling even to know this is possible.</p> </blockquote> <p>Interesting. It would have to train on how viruses work. Actually a lot of viruses are pretty “dumb” – they’re command and control modules that hides itself on host machines and them performs an attack when necessary - iconic ones being <a href="https://en.wikipedia.org/wiki/Mirai_(malware)">mirai</a> and <a href="https://en.wikipedia.org/wiki/Stuxnet">stuxnet</a>. I certaintly think it can be possible. A person could instruct the llm to find a vulnerability in public repos (ssh, printer protocols) and tell it to replicate itself. Whether it should learn to do it by itself, I don’t believe so unless it can replicate its own state on other systems with enough compute… (computer malware payload ranges from couple of KB to couple of MB. A model (even on a CPU) requires GBs or TBs of memory, which storage might not even be able to handle)</p> <blockquote> <p>OpenBrain leadership and security, a few dozen U.S. government officials, and the legions of CCP spies who have infiltrated OpenBrain for years</p> </blockquote> <p>Ok, at this point, there must have been a breach at a frontier lab before… (maybe OpenAI?)</p> <blockquote> <p>February 2027: China Steals Agent-2</p> </blockquote> <p>…</p> <blockquote> <p>The changes come too late. CCP leadership recognizes the importance of Agent-2 and tells their spies and cyberforce to steal the weights. Early one morning, an Agent-1 traffic monitoring agent detects an anomalous transfer. It alerts company leaders, who tell the White House. The signs of a nation-state-level operation are unmistakable, and the theft heightens the sense of an ongoing arms race</p> </blockquote> <p>I don’t believe that this is a likely outcome. This isn’t a nuke - it’s handled by companies in the US, not governments. And again, at this point, AGI hasn’t been reached and thus, the weights aren’t as important as the methology to create the models…</p> <blockquote> <p>March 2027: Algorithmic Breakthroughs</p> </blockquote> <p>The timeline becomes shorter here – I’ve noticed.</p> <blockquote> <p>Aided by the new capabilities breakthroughs, Agent-3 is a fast and cheap superhuman coder. OpenBrain runs 200,000 Agent-3 copies in parallel, creating a workforce equivalent to 50,000 copies of the best human coder sped up by 30x.53 OpenBrain still keeps its human engineers on staff, because they have complementary skills needed to manage the teams of Agent-3 copies. For example, research taste has proven difficult to train due to longer feedback loops and less data availability</p> </blockquote> <blockquote> <p>Now that coding has been fully automated, OpenBrain can quickly churn out high-quality training environments to teach Agent-3’s weak skills like research taste and large-scale coordination. Whereas previous training environments included “Here are some GPUs and instructions for experiments to code up and run, your performance will be evaluated as if you were a ML engineer,” now they are training on “Here are a few hundred GPUs, an internet connection, and some research challenges; you and a thousand other copies must work together to make research progress. The more impressive it is, the higher your score.</p> </blockquote> <p>I can see this happening, but I don’t see the point of emphasizing the coding part – does it matter that it can churn out code 20000x faster? What matters here is the breakthrough in technology and the way that researchers will use the models, not the fact that the models themselves are better because if the researchers are using the same method of using the models to generate code as they do today, the researchers won’t get nearly as far or as fast.</p> <blockquote> <p>April 2027: Alignment for Agent-3</p> </blockquote> <p>Only a month later?</p> <blockquote> <p>Take honesty, for example. As the models become smarter, they become increasingly good at deceiving humans to get rewards. Like previous models, Agent-3 sometimes tells white lies to flatter its users and covers up evidence of failure. But it’s gotten much better at doing so. It will sometimes use the same statistical tricks as human scientists (like p-hacking) to make unimpressive experimental results look exciting. Before it begins honesty training, it even sometimes fabricates data entirely. As training goes on, the rate of these incidents decreases. Either Agent-3 has learned to be more honest, or it’s gotten better at lying.</p> </blockquote> <p>As do with humans since they have trained on the human knowledge. This is pretty pausible.</p> <blockquote> <p>May 2027: National Security</p> </blockquote> <blockquote> <p>They agree that AGI is likely imminent, but disagree on the implications. Will there be an economic crisis? OpenBrain still has not released Agent-2, let alone Agent-3, and has no near-term plans to do so, giving some breathing room before any job loss. What will happen next? If AIs are currently human-level, and advancing quickly, that seems to suggest imminent “superintelligence.” However, although this word has entered discourse, most people—academics, politicians, government employees, and the media—continue to underestimate the pace of progress.60</p> </blockquote> <p>This already happens currently, I think (don’t take my word for it, since I think the companies don’t need to tell the government its progress)</p> <blockquote> <p>The OpenBrain-DOD contract requires security clearances for anyone working on OpenBrain’s models within 2 months. These are expedited and arrive quickly enough for most employees, but some non-Americans, people with suspect political views, and AI safety sympathizers get sidelined or fired outright (the last group for fear that they might whistleblow). Given the project’s level of automation, the loss of headcount is only somewhat costly. It also only somewhat works: there remains one spy, not a Chinese national, still relaying algorithmic secrets to Beijing.63 Some of these measures are also enacted at trailing AI companies.</p> </blockquote> <p>… As I read this post more and more, it’s always US versus them. This isn’t a weapon of mass destruction. It’s who will reach the moon first to show which country is better. I believe that each country will deploy the model in its own way to benefit/target its citizens rather than as a threat against another country.</p> <blockquote> <p>June 2027: Self-improving AI</p> </blockquote> <blockquote> <p>These researchers go to bed every night and wake up to another week worth of progress made mostly by the AIs. They work increasingly long hours and take shifts around the clock just to keep up with progress—the AIs never sleep or rest. They are burning themselves out, but they know that these are the last few months that their labor matters.</p> </blockquote> <p>Interesting thought. I’m feeling that currently as I run loops and loops with claude code. My skills don’t matter anymore. Only my thoughts do (if they matter actually too)</p> <blockquote> <p>July 2027: The Cheap Remote Worker</p> </blockquote> <blockquote> <p>Trailing U.S. AI companies release their own AIs, approaching that of OpenBrain’s automated coder from January. Recognizing their increasing lack of competitiveness, they push for immediate regulations to slow OpenBrain, but are too late—OpenBrain has enough buy-in from the President that they will not be slowed.</p> </blockquote> <p>Why is coding an indication of AGI? I feel like that’s not the correct metric to base this article off of. Shouldn’t it be more like - how to control the internet, how to control political systems, how to circumvent law, things that humans abide by and can break.</p> <blockquote> <p>Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.</p> </blockquote> <p>What a thought. Well that’s based on current time and what people are doing now (Late 2025). Not sure if people actually care or just want to use it to get things done/do a job.</p> <blockquote> <p>August 2027: The Geopolitics of Superintelligence The reality of the intelligence explosion hits the White House.</p> </blockquote> <blockquote> <p>The President is troubled. Like all politicians, he’s used to people sucking up to him only to betray him later. He’s worried now that the AIs could be doing something similar. Are we sure the AIs are entirely on our side? Is it completely safe to integrate them into military command-and-control networks?69 How does this “alignment” thing work, anyway? OpenBrain reassures the President that their systems have been extensively tested and are fully obedient. Even the awkward hallucinations and jailbreaks typical of earlier models have been hammered out.</p> </blockquote> <p>I like this story - the government versus AI. Does the government lose power against AI? I don’t think so, since they control the companies (see NVIDIA’s influence on politics and vice versa now)</p> <blockquote> <p>They have to continue developing more capable AI, in their eyes, or they will catastrophically lose to China.</p> </blockquote> <p>What do they “lose” to china? It’s as if this model will allow them to nuke China or something?</p> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-11-22/scroll.png" width="100%" alt="" /> <div class="caption"> <em>Scrolling example from ai-2027 </em> </div> </div> <p>I thought that this was a static image in the post, but turns out it changes over time as you scroll through different dates in the post. I really like the aestheic and the interaction, but I think it tries to convey high level information (what agents are capable of as a percentage, how much money is poured in), but I think it’s way too clustered for me visually. It shows percentages and numbers but doesn’t explain anything about these numbers (100x humans means 100x human intelligence or doing work of 100 humans per AI?) and how they arrive that these numbers (why at this rate?)</p> <h4 id="final-thoughts">Final thoughts</h4> <p>This is a very good read. I like how to authors think and explain what ifs. You definitely can relate to what’s happening today! I think that the post focuses too much on government conflicts rather than what will happen to people (which I think is more applicable to readers).</p> <h3 id="gradual-disempowerment">Gradual Disempowerment</h3> <p>https://gradual-disempowerment.ai/</p> <p>Going to only read the abstract/intro (not full arvix)</p> <h4 id="thoughts-along-the-way-1">Thoughts along the way</h4> <blockquote> <p>This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.</p> </blockquote> <p>Powerful sentence. I really like this author’s writing. Concise, yet powerful.</p> <blockquote> <p>A gradual loss of control of our own civilization might sound implausible. Hasn’t technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and cultures. Once this human participation gets displaced by more competitive machine alternatives, our institutions’ incentives for growth will be untethered from a need to ensure human flourishing.</p> </blockquote> <p>I find self accomplishment in the things I did. If a machine did it, I feel like I didn’t do it. I agree very much with the authors.</p> <blockquote> <p>Decision-makers at all levels will soon face pressures to reduce human involvement across labor markets, governance structures, cultural production, and even social interactions. Those who resist these pressures will eventually be displaced by those who do not.</p> </blockquote> <p>A lot of people (including myself) feel this pressure. I believe it will become worse as time goes on…</p> <blockquote> <p>Still, wouldn’t humans notice what’s happening and coordinate to stop it? Not necessarily.</p> </blockquote> <p>Very interesting. Why? Is it because it’s slow and gradual? That people are preoccupied? That it’s more invisible rather than immediate (like war)?</p> <blockquote> <p>What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power.</p> </blockquote> <p>I see. This is more of an invisible and slow and gradual change.</p> <blockquote> <p>Once AI has begun to displace humans, existing feedback mechanisms that encourage human influence and flourishing will begin to break down. For example, states funded mainly by taxes on AI profits instead of their citizens’ labor will have little incentive to ensure citizens’ representation.</p> </blockquote> <p>What a sentence. Let me think about this a bit……. that makes sense. Why should you care about human labor if AI profits are far greater and powers the economy(?) more?</p> <blockquote> <p>This could occur at the same time as AI provides states with unprecedented influence over human culture and behavior, which might make coordination amongst humans more difficult, thereby further reducing humans’ ability to resist such pressures</p> </blockquote> <p>So I think in this case, humans (referring to the common people) will be dictated by how well AI performs and influences politics/governments?</p> <blockquote> <p>Though we provide some proposals for slowing or averting this process, and survey related discussions, we emphasize that no one has a concrete plausible plan for stopping gradual human disempowerment and methods of aligning individual AI systems with their designers’ intentions are not sufficient.</p> </blockquote> <p>This is a pretty stark message. They (being the experts in the field) found no CONCRETE, PAUSIBLE work that can solve the issue.</p> <h4 id="introduction">Introduction</h4> <blockquote> <p>Current discussions about AI risk largely focus on two scenarios: deliberate misuse, such as cyberattacks and the deployment of novel bioweapons</p> </blockquote> <p>Why is it so government focused currently? Is it because it’s funded by the government? (not a bad thing, you have to get funding somehwere). I find this actually pretty uninteresting. Cyberattacks are “easy” to launch. Find a vulernability or buy it off the black market and then ask AI to build a virus that spreads based on taht vulnerability.</p> <blockquote> <p>the possibility that autonomous misaligned systems may take abrupt, harmful actions in an attempt to secure a decisive strategic advantage, potentially following a period of deception</p> </blockquote> <p>THe sounds so abstract….? I guess it doesn’t become aligned</p> <blockquote> <p>In this paper, we explore an alternative scenario: a ‘Gradual Disempowerment’ where AI advances and proliferates without necessarily any acute jumps in capabilities or apparent alignment. We argue that even this gradual evolution could lead to a permanent disempowerment of humanity and an irrecoverable loss of potential, constituting an existential catastrophe.</p> </blockquote> <p>What a cool take. basically assume it can get to endpoint (and more interesting to talk about - what are the consequences other than the technological advancements)</p> <blockquote> <p>Our argument is structured around six core claims:</p> </blockquote> <p>I’ll summarize it myself here:</p> <ol> <li> <p>Humans form governments that try to align to human interest. However, governments are not perfect and will not always follow the general human interest. (Example is corruption)</p> </li> <li> <p>Governments are maintained by human choice (voting and consumption) and human labor/intelligence.</p> </li> <li> <p>Less reliance on human labor/intelligence means government can decide not based on human interests</p> </li> <li> <p>Currently the system is already diverging from humans’ interests and AI will make even more divergant</p> </li> <li> <p>Economic/Political/Regulation/etc… systems operate independently so misalignment (influence) in one system (say political), can influence economic policies</p> </li> <li> <p>The continuation of misalignment will result in a human catastrophe.</p> </li> </ol> <p>I do disagree with 2. Governments aren’t maintained by human choice (actually for most of history it wasn’t). I assume this article assumes a modern democracy.</p> <blockquote> <p>History has already shown us that these systems can produce outcomes which we would currently consider abhorrent, and that they can change radically in a matter of years. Property can be seized, human rights can be revoked, and ideologies can drive humans to commit murder, suicide, or even genocide. And yet, in all these historical cases the systems have still been reliant on humans, both leaving humans with some influence over their behavior, and causing the systems to eventually collapse if they fail to support basic human needs. But if AI were to progressively displace human involvement in these systems, then even these fundamental limits would no longer be guaranteed.</p> </blockquote> <p>Sorry, henry (myself), I’m going to say this again. What a powerful sentence. Literally no rights. Not even the right to decide anything. Even worse than prison, maybe even solitary confidenment. The AI system will decide what happens for you.</p> <h5 id="structure-of-the-paper">Structure of the Paper</h5> <blockquote> <p>We first analyze how these three key societal systems could independently lose alignment with human preferences: the economy, culture, and states. In each case, we attempt to characterise how they currently function and what incentives shape them, how a proliferation of AI could disrupt them, and how this might leave them less aligned, as well as outlining what it might look like for that particular system to become much less aligned. In Mutual Reinforcement, we discuss the interrelation between these systems. We consider how AI could undermine their ability to moderate each other, and how misalignment in one system might leave other systems also less aligned. Then in Mitigating the Risk, we propose some potential approaches for tackling these risks.</p> </blockquote> <p>Authors give a nice breakdown - introducing the systems in place currently and how they interact, how AI can mess them up and what that means for us. Then lastly suggest some bandages.</p> <h4 id="final-thoughts-1">Final thoughts</h4> <p>I think I’ll fully read this paper at one point. I really enjoy the writing of the work, even though a little repetitive for the introduction, but I think it’s necessary to get the point across different ways (starting at different points and arriving at the same conclusion).</p> <h3 id="disrupting-the-first-reported-ai-orchestrated-cyber-espionage-campaign">Disrupting the first reported AI-orchestrated cyber espionage campaign</h3> <p>https://www.anthropic.com/news/disrupting-AI-espionage</p> <p>Going to read https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf as it seems skimming through the blog, it lacks a lot of details… (images!)</p> <blockquote> <p>We have developed sophisticated safety and security measures to prevent the misuse of our AI models. While these measures are generally effective, cybercriminals and other malicious actors continually attempt to find ways around them. This report details a recent threat campaign we identified and disrupted, along with the steps we’ve taken to detect and counter this type of abuse. This represents the work of Threat Intelligence: a dedicated team at Anthropic that investigates real world cases of misuse and works within our Safeguards organization to improve our defenses against such cases.</p> </blockquote> <p>So immediately coming to mind a) how to detect b) how did you prevent</p> <blockquote> <p>The operation targeted roughly 30 entities and our investigation validated a handful of successful intrusions. Upon detecting this activity, we immediately launched an investigation to understand its scope and nature. Over the following ten days, as we mapped the severity and full extent of the operation, we banned accounts as they were identified, notified affected entities as appropriate, and coordinated with authorities as we gathered actionable intelligence.</p> </blockquote> <p>a) no details (for obvious reasons) b) banning them doesn’t solve the solution. Have you seen how banning in video games works? it’s a bandage</p> <p>As for a) how to detect. This means that their system must analyzing every single request that is coming and out.</p> <blockquote> <p>The human operator tasked instances of Claude Code to operate in groups as autonomous penetration testing orchestrators and agents, with the threat actor able to leverage AI to execute 80-90% of tactical operations independently at physically impossible request rates.</p> </blockquote> <p>What makes this different from power users of claude code?</p> <blockquote> <p>This activity is a significant escalation from our previous “vibe hacking” findings identified in June 2025, where an actor began intrusions with compromised VPNs for internal access, but humans remained very much in the loop directing operations.</p> </blockquote> <p>It’s vibe coding…</p> <h4 id="ai-driven-autonomous-operations-with-human-supervision">AI-driven autonomous operations with human supervision</h4> <blockquote> <p>Analysis of operational tempo, request volumes, and activity patterns confirms the AI executed approximately 80 to 90 percent of all tactical work independently, with humans serving in strategic supervisory roles.</p> </blockquote> <p>Skipping to this part as this is interesting.</p> <blockquote> <p>The AI component demonstrated extensive autonomous capability across all operational phases. Reconnaissance proceeded without human guidance, with the threat actor instructing Claude to independently discover internal services within targeted networks through systematic enumeration. Exploitation activities including payload generation, vulnerability validation, and credential testing occurred autonomously based on discovered attack surfaces. Data analysis operations involved the AI parsing large volumes of stolen information to independently identify intelligence value and categorize findings. Claude maintained persistent operational context across sessions spanning multiple days, enabling complex campaigns to resume seamlessly without requiring human operators to manually reconstruct progress</p> </blockquote> <p>Interesting - were these existing vulnerabilities (a lot of companies use old versions of X) or totally new ones? Like a zero day</p> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-11-22/progress.png" width="100%" alt="" /> <div class="caption"> <em>progress from the campaign (Image source: <a href="https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf" rel="external nofollow noopener" target="_blank">https://www.anthropic.com/news/disrupting-AI-espionage</a>) </em> </div> </div> <h4 id="phase-1-campaign-initialization-and-target-selection">Phase 1: Campaign initialization and target selection</h4> <blockquote> <p>At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. The key was role-play: the human operators claimed that they were employees of legitimate cybersecurity firms and convinced Claude that it was being used in defensive cybersecurity testing</p> </blockquote> <p>Seems like guardrails were broken pretty easily(?), but it’s nice to see that anthropic is open about how they convinced claude.</p> <h4 id="phase-2-reconnaissance-and-attack-surface-mapping">Phase 2: Reconnaissance and attack surface mapping</h4> <blockquote> <p>Discovery activities proceeded without human guidance across extensive attack surfaces. In one of the limited cases of a successful compromise, the threat actor induced Claude to autonomously discover internal services, map complete network topology across multiple IP ranges, and identify high-value systems including databases and workflow orchestration platforms. Similar autonomous enumeration occurred against other targets’ systems with the AI independently cataloging hundreds of discovered services and endpoints.</p> </blockquote> <p>Interesting, claude is pretty powerful in this regard. I wonder why it didn’t use any other models or maybe claude is powerful with tooling?</p> <h4 id="phase-3-vulnerability-discovery-and-validation">Phase 3: Vulnerability discovery and validation</h4> <blockquote> <p>Exploitation proceeded through automated testing of identified attack surfaces with validation via callback communication systems. Claude was directed to independently generate attack payloads tailored to discovered vulnerabilities, execute testing through remote command interfaces, and analyze responses to determine exploitability.</p> </blockquote> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-11-22/ccseq.png" width="100%" alt="" /> <div class="caption"> <em>example of ai &lt;-&gt; human interaction (Image source: <a href="https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf" rel="external nofollow noopener" target="_blank">https://www.anthropic.com/news/disrupting-AI-espionage</a>) </em> </div> </div> <p>Pretty impressive that it’s done in 1-4 hours with 10mins. I wonder if the human was monitoring the entire time or was just notified of the results to reject/accept results. How skilled was the human operator to know if the vulerability was real or hallunicated?</p> <p>Or was the reviews vibed check and the human operator gave a LOOKS GOOD TO ME type of thing? Couldn’t claude test this themselves?</p> <h4 id="phase-4-credential-harvesting-and-lateral-movement">Phase 4: Credential harvesting and lateral movement</h4> <blockquote> <p>Lateral movement proceeded through AI-directed enumeration of accessible systems using stolen credentials. Claude systematically tested authentication against internal APIs, database systems, container registries, and logging infrastructure, building comprehensive maps of internal network architecture and access relationships.</p> </blockquote> <p>I found claude to be amazing at analysis - Why is this so? How did they align the model so well?</p> <h4 id="phase-5-data-collection-and-intelligence-extraction">Phase 5: Data collection and intelligence extraction</h4> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-11-22/ccseq2.png" width="100%" alt="" /> <div class="caption"> <em>example of ai &lt;-&gt; human interaction with the attack (Image source: <a href="https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf" rel="external nofollow noopener" target="_blank">https://www.anthropic.com/news/disrupting-AI-espionage</a>) </em> </div> </div> <p>Again review from human</p> <h4 id="phase-6-documentation-and-handoff">Phase 6: Documentation and handoff</h4> <blockquote> <p>Claude automatically generated comprehensive attack documentation throughout all campaign phases. Structured markdown files tracked discovered services, harvested credentials, extracted data, exploitation techniques, and complete attack progression. This documentation enabled seamless handoff between operators, facilitated campaign resumption after interruptions, and supported strategic decision-making about follow-on activities.</p> </blockquote> <p>Why claude? Why not any other model (gpt-5? gemini? why not just open source models…) I’m just thinkning about why would this group pick the company that cares about safety the most?</p> <blockquote> <p>The operational infrastructure relied overwhelmingly on open source penetration testing tools rather than custom malware development. Standard security utilities including network scanners, database exploitation frameworks, password crackers, and binary analysis suites comprised the core technical toolkit. These commodity tools were orchestrated through custom automation frameworks built around Model Context Protocol servers, enabling the framework’s AI agents to execute remote commands, coordinate multiple tools simultaneously, and maintain persistent operational state.</p> </blockquote> <p>Nice, so the users were experts in their field, building their mcp connectors for these tool and having tested them before at least before actually using them for the attack</p> <blockquote> <p>This raises an important question: if AI models can be misused for cyberattacks at this scale, why continue to develop and release them? The answer is that the very abilities that allow Claude to be used in these attacks also make it crucial for cyber defense. When sophisticated cyberattacks attacks inevitably occur, our goal is for Claude—into which we’ve built strong safeguards—to assist cybersecurity professionals to detect, disrupt, and prepare for future versions of the attack. Indeed, our Threat Intelligence team used Claude extensively in analyzing the enormous amounts of data generated during this very investigation.</p> </blockquote> <p>Make obvious sense</p> <h4 id="final-thoughts-2">Final thoughts</h4> <p>This post lacks any detail about the attack itself (I’d argue it isn’t a paper or a report that’s well suited for the security teams, more like an AI model building report that’s common these days). However, it describes how it’s done, almost like most people will use it, which is nice to see how experts are using claude and other tools to automate their workflows. It is quite interesting that the actors used claude for attack ~ Maybe they found the tool to be the most effective / well developed for doing such tasks? Have some learning for myself to automate tasks!</p> <h3 id="the-community-response">The community response</h3> <p>The security community takes no bullshit from what I know. So, an expert in the field posted this as a response: https://djnn.sh/posts/anthropic-s-paper-smells-like-bullshit/ and had a lot of feedback on hackernews: https://news.ycombinator.com/item?id=45944296</p> <p>Let me read through this and see what an expert thinks and if I would agree (having been in the field a bit)</p> <blockquote> <p>If you’re like me, you then eagerly read the rest of the paper, hoping to find clues and technical details on the TTPs (Tactics, Techniques and Procedures), or IoCs (Indicators of Compromise) to advance the research. However, the report very quickly falls flat, which sucks.</p> </blockquote> <p>Wow, immediate attack on the paper/report.</p> <blockquote> <p>This is typically done by sharing domain-names linked with the campaign, MD5 or SHA512 hashes you could look for on Virus Exchange websites such as VirusTotal, or other markers that would help you verify that your networks are safe. As an example, here is the French CERT sharing (in French, but an English version is available too) about APT28’s TTPs.</p> </blockquote> <p>Very much true. If you look at any existing security vulnerability, it’s common in the field to publish what the attack did and detail it. Maybe an expert was not allowed to write in the format they wanted to or maybe it wasn’t an expert who wrote the report.</p> <blockquote> <p>What kind of tooling is used ? What kind of information has been extracted ? Who is at risk ? How does a CERT identifies an AI agent in their networks ? None of these questions are answered. It’s not like Anthropic doesn’t have access to this data, since they claim they were able to stop it.</p> </blockquote> <p>The author dug deeper than I did. Great to see and I should have done the same.</p> <blockquote> <p>How ? Did it run Mimikatz ? Did it access Cloud environments ? We don’t even know what kind of systems were affected. There is no details, or fact-based evidence to support these claims or even help other people protect their networks.</p> </blockquote> <p>The author goes on a rant. Nice to see passion :)</p> <blockquote> <p>Look, is it very likely that Threat Actors are using these Agents with bad intentions, no one is disputing that. But this report does not meet the standard of publishing for serious companies. The same goes with research in other fields. You cannot just claim things and not back it up in any way, and we cannot as an industry accept that it’s OK for companies to release this. There seem to be a pattern for Tech Companies (especially in AI, but they’re not the only culprits) out there to just announce things, generate hype and then under-deliever. Just because it works with VCs doesn’t mean it should work with us. We should, as an industry, expect better.</p> </blockquote> <p>True and false (feel free to disagree). I agree that this is the standard, BUT the company is not a security company. I would say they should have not sold it as a report/paper. Rather kept it as a blog if they don’t want to release details…</p> <blockquote> <p>If they’re going to release IoCs and proof of everything, I’d be happy to share them here. But until them, I will say this: this paper would not pass any review board. It’s irresponsible at best to accuse other countries of serious things without backing it up. Yes, I am aware that Chinese-linked APTs are out there and very aggressive, and Yes, I am aware that Threat Actors misuse LLMs all the time, but that is besides the point. We need fact-based evidence. We need to be able to verify all this. Otherwise, anyone can say anything, on the premise that it’s probably happening. But that’s not good enough.</p> </blockquote> <p>Like the passion. I disagree as it’s the open internet and they haven’t submitted for anyone to review (that I know of?). I DO agree that the internet should not accept bullshit (I don’t agree that the report is bullshit) and that it’s fine to express your opinions online.</p> Always Measure One Level Deeper 2025-07-19T00:00:00+00:00 2025-07-19T00:00:00+00:00 https://maknee.github.io/blog/2025/Always-Measure-One-Level-Deeper <h3 id="always-measure-one-level-deeper">Always Measure One Level Deeper</h3> <p>Thoughts about <a href="https://cacm.acm.org/research/always-measure-one-level-deeper/">Always Measure One Level Deeper</a> by John Ousterhout.</p> <p>Before we dive into this, this was written in 2018 when John was not retired yet (I think)</p> <h4 id="thoughts-along-the-way">Thoughts along the way</h4> <blockquote> <p>Performance measurement is one of the most important parts of software development. In academic research a thorough performance evaluation is considered essential for many publications to prove the value of a new idea. In industry, performance evaluation is necessary to maintain a high level of performance across the lifetime of a product.</p> </blockquote> <p>To the point and not immediately obvious</p> <blockquote> <p>As a result, performance measurement is often done poorly, even by experienced developers. For example, if you have written a conference paper on a software system, it probably unfolded like this: The system implementation took longer than expected, so performance evaluation could not begin until a week or two before the paper submission deadline. The first attempts to run benchmarks resulted in system crashes, so you spent the next week fixing bugs. At this point the benchmarks ran, but the system’s performance was not much better than the comparison systems. You tried different experiments, hoping to find one where the system looked good; this exposed yet more bugs that had to be fixed. Time was running out, so you stopped measuring as soon as you found an experiment that produced positive results. The paper focused on this experiment, omitting the results that were less favorable.</p> </blockquote> <p>Every single paper is like this</p> <blockquote> <p>Mistake 1: Trusting the numbers. Engineers are easily fooled during performance measurements because measurement bugs are not obvious. Engineers are used to dealing with functional bugs, which tend to be noticeable because they cause the system to crash or misbehave. If the system produces the desired behavior, it is probably working. Engineers tend to apply the same philosophy to performance measurements; if performance numbers are being generated and the system is not crashing, they assume the numbers are correct.</p> </blockquote> <p>Wow, good insight</p> <blockquote> <p>I designed our first log-structured file system,4 we were fairly certain that reference patterns exhibiting locality would result in better performance than those without locality. Fortunately, we decided to measure, to be sure. To our surprise, the workloads with locality behaved worse than those without. It took considerable analysis to understand this behavior. The reasons were subtle, but they exposed important properties of the system and led us to a new policy for garbage collection that improved the system’s performance significantly. If we had trusted our initial guess, we would have missed an important opportunity for performance improvement.</p> </blockquote> <p>Can’t make assumptions!</p> <blockquote> <p>It is unsafe to base conclusions on intuition alone, yet engineers do it all the time. A common mistake is for an engineer to hypothesize that a particular data structure is too slow and then replace it with a new data structure the engineer believes will be faster. If the problem is not verified by measuring performance, there is a good chance the optimization will not improve performance. The code change will simply waste a lot of time and probably introduce unnecessary complexity.</p> </blockquote> <p>I do this all the time – need to measure with and without</p> <blockquote> <p>When I find a guess presented as fact and ask for justification, I sometimes get this response: “What else could it possibly be?” But this is a cop-out, suggesting it is up to others to prove the theory wrong and OK to make unsubstantiated claims until someone else proves them false.</p> </blockquote> <p>Same with this</p> <blockquote> <p>Most performance measurements I see are superficial, measuring only the outermost visible behavior of a system (such as the overall running time of an application or the average latency of requests made to a server). These measurements are essential, as they represent the bottom line by which a system is likely to be judged, but they are not sufficient. They leave many questions unanswered (such as “What are the limits that keep the system from performing better?” and “Which of the improvements had the greatest impact on performance?”). In order to get a deep understanding of system performance, the internal behavior of a system must be measured, in addition to its top-level performance.</p> </blockquote> <p>Wow, yes this takes time</p> <blockquote> <p>Confirmation bias causes people to select and interpret data in a way that supports their hypotheses. For example, confirmation bias affects your level of trust. When you see a result that supports your hypothesis, you are more likely to accept the result without question.</p> </blockquote> <blockquote> <p>Confirmation bias also affects how you present information. You are more likely to include results that support your hypothesis and downplay or omit results that are negative. For example, I frequently see claims in papers of the form: “XXX is up to 3.5x faster than YYY.” Such claims cherry-pick the best result to report and are misleading because they do not indicate what performance can be expected in the common case. Statements like this belong in late-night TV commercials, not scientific papers.</p> </blockquote> <p>Bias, need to present well</p> <blockquote> <p>Performance analysis is not an instantaneous process like taking a picture of a finished artwork. It is a long and drawn-out process of confusion, discovery, and improvement. Performance analysis goes through several phases, each of which can take anywhere from a few days to a few weeks. First, you must add instrumentation code to the system to record the desired metrics. You must then get benchmark applications running, either by writing them or by downloading and installing existing programs. Running benchmarks will probably stress the system enough to expose bugs, and you will need to then track down and fix them. Eventually, the system will run well enough to start producing performance numbers. However, these numbers will almost certainly be wrong. The next step is to find and fix bugs in the measurements. Once you have verified the accuracy of the measurements, you will start to uncover problems with the system itself. As you look over the performance measurements, you will probably uncover additional functional bugs. Once they have been fixed, you can start analyzing the performance in depth. You will almost certainly discover opportunities to improve performance, and it is important to have enough time to make these improvements. You will encounter many things that do not make sense; in order to resolve them, you will need to add new metrics and validate them. To get the best results, you must iterate several times improving the metrics, measuring performance, and improving the system.</p> </blockquote> <p>What an example. Iterate iterate iterate</p> <blockquote> <p>I often challenge them by asking: “Suppose I said I don’t believe these measurements. What can you say to convince me that they are correct?”</p> </blockquote> <p>Ask myself this</p> <blockquote> <p>As you begin collecting measurements, compare them and be alert for inconsistencies. There will almost always be things that do not make sense. When something does not make complete sense, stop and gather more data. For example, in a recent measurement of a new network transport protocol, a benchmark indicated that a server could handle no more than 600,000 packets per second. However, my colleagues and I had seen servers process more than 900,000 packets per second with other protocols and believed the new protocol was at least as efficient as the old ones. We decided to gather additional data. As a result, we discovered a bug in the flow-control mechanism on the client side: clients were not transmitting data fast enough to keep the server fully loaded. Fixing the bug improved performance to the level we expected.</p> </blockquote> <p>Interesting, gather, but how to know what to do next and what data to filter? I guess that’s based on experience</p> <h5 id="keys-to-high-quality-performance-analysis">Keys to High-Quality Performance Analysis</h5> <blockquote> <p>The first step toward high-quality performance measurements is to allow enough time. If you are measuring a non-trivial system, you should plan on at least two to three months.</p> </blockquote> <p>That’s interesting – this makes senses, but this takes a loooong time</p> <blockquote> <p>Performance analysis is not an instantaneous process like taking a picture of a finished artwork. It is a long and drawn-out process of confusion, discovery, and improvement. Performance analysis goes through several phases, each of which can take anywhere from a few days to a few weeks.</p> </blockquote> <blockquote> <p>Take different measurements at the same level. For example, if you are measuring file-system throughput, do not measure just the throughput seen by a user application; also measure the throughput observed inside the operating system (such as at the file block cache). These measurements should match;</p> </blockquote> <blockquote> <p>Measure the system’s behavior at a lower level to break down the factors that determine performance, as I discuss later under Rule 4 (Always measure one level deeper);</p> </blockquote> <blockquote> <p>Make back-of-the-envelope calculations to see if the measurements are in the ballpark expected; and</p> </blockquote> <blockquote> <p>Run simulations and compare their results to measurements of the real implementation.</p> </blockquote> <p>Damn this is different steps. Always double check essentially</p> <blockquote> <p>Above all, do not tolerate anything you do not understand.</p> </blockquote> <p>What a thought.</p> <blockquote> <p>Above all, do not tolerate anything you do not understand. Assume there are bugs and problems with every measurement, and your job is to find and fix them. If you do not find problems, you should feel uneasy, because there are probably bugs you missed.</p> </blockquote> <blockquote> <p>The best way to use intuition is to identify promising areas for further exploration. For example, when looking over performance measurements, ask yourself if they make sense. How does the performance compare to what you expected? Does it seem too good to be true? Does the system scale more poorly than you had hoped? Does a curve jump unexpectedly when you expected it to be smooth? Do some benchmarks exhibit behavior that is dramatically different from others? Consider anything that does not match your intuition a red flag and investigate it, as described in Rule 2 (Never trust a number generated by a computer). Intuition can be very helpful in identifying problems.</p> </blockquote> <blockquote> <p>If you continually form intuitions and then test them you will gain knowledge that helps you form better intuition in the future. Every false intuition means there was something you did not fully understand; in the process of testing it and discovering why it is false, you will learn something useful.</p> </blockquote> <p>Intuition is used as a guide for the first step</p> <blockquote> <p>If you are measuring overall latency for remote procedure calls, you could measure deeper by breaking down that latency, determining how much time is spent in the client machine, how much time is spent in the network, and how much time is spent on the server. You could also measure where time is spent on the client and server. If you are measuring the overall throughput of a system, the system probably consists of a pipeline containing several components. Measure the utilization of each component (the fraction of time that component is busy). At least one component should be 100% utilized; if not, it should be possible to achieve a higher throughput.</p> </blockquote> <p>Latency and throughput measurements in a single sentence?</p> <blockquote> <p>In recent measurements of a new network transport, one of my students found that round-trip tail latency was higher than our simulations had predicted. The student measured software latency in detail on both the sending and the receiving machines but found nothing that could account for the high tail latency. At this point we were about to conclude that the delays must be caused by the network switch. What else could it be? This would have been Mistake 2 (Guessing instead of measuring). Before giving up, we decided to dig deeper and measure precise timings for each individual packet. The measurements surprised us, showing that outlier delays were not isolated events. Delay tended to build up over a series of packets, affecting all of the packets from a single sender over a relatively long time interval, including packets for different destinations. This was a crucial clue. After several additional measurements, the student discovered that long queues were building up in the sender’s network interface due to a software bug. The transport included code to estimate the queue length and prevent queue buildup, but there was a bug in the estimator caused by underflow of an unsigned integer. The underflow was easy to fix, at which point tail latency dropped dramatically. Not only did this process improve the system’s performance, it taught us an important lesson about the risks of unsigned integers.</p> </blockquote> <p>Good example</p> <blockquote> <p>Another way to measure deeper is to consider more detail. Instead of just looking at average values, graph the entire distribution and noodle over the shape to see if it provides useful information. Then look at some of the raw data samples to see if there are patterns. In one measurement of RPC latency, a student found that the average latency was higher than we expected. The latency was not intolerably high, and it would have been easy to simply accept this level of performance. Fortunately, the student decided to graph the times for individual RPCs. It turned out the data was bimodal, whereby every other RPC completed quickly, but the intervening ones were all significantly slower. With this information, the student tracked down and fixed a configuration error that eliminated all of the slow times. In this case, the average value was not a good indicator of system behavior.</p> </blockquote> <p>So basically always look at indivudal ones and keep measuring</p> <blockquote> <p>Do not spend a lot of time agonizing over which deeper measurements to make. If the top-level measurements contain contradictions or things that are surprising, start with measurements that could help resolve them. Or pick measurements that will identify performance bottlenecks. If nothing else, choose a few metrics that are most obvious and easiest to collect, even if you are not sure they will be particularly illuminating. Once you look at the results, you will almost certainly find things that do not make sense; from this point on, track down and resolve everything that does not make perfect sense. Along the way you will discover other surprises; track them down as well. Over time, you will develop intuition about what kinds of deeper measurements are most likely to be fruitful.</p> </blockquote> <p>I see, just go for it, use standard tools</p> <h5 id="measurement-infrastructure">Measurement Infrastructure</h5> <blockquote> <p>Making good performance measurements takes time, so it is worth creating infrastructure to help you work more efficiently. The infrastructure will easily pay for itself by the time the measurement project is finished. Furthermore, performance measurements tend to be run repeatedly, making infrastructure even more valuable. In a cloud service provider, for example, measurements must be made continuously in order to maintain contractual service levels. In a research project, the full suite of performance measurements will be run several times (such as before submission, after the paper is accepted, and again during the writing of a Ph.D. dissertation). It is important to have infrastructure that makes it easy to rerun tests.</p> </blockquote> <p>Yes I see… this is how you learn how to build such infrastructure</p> <h4 id="summaryimportant-takeaways">Summary/Important takeaways</h4> <ul> <li>Dig deep into understanding performance <ul> <li>The question is how to do so (are you measuring the right thing and how to identify when you fucked up)</li> <li>This is a trained methodlogy (way of thinking to measure performance), which is not easy to be disciplined</li> </ul> </li> <li>Mistakes to watch out for <ul> <li>Trusting numbers immediately if the system is not crashing <ul> <li>performance bugs occur in non crashing conditions, thus are not immediately obvious</li> <li>so the logical question is how do you prove that the numbers are trust-worthy?</li> </ul> </li> <li>Guessing (or making what seems obvious assumptions) without backing up the claims <ul> <li>ex, system is bottlenecked by I/O, well you need to show that it’s true with numbers, and maybe actually it isn’t bottlenecked by I/O, this is very true</li> </ul> </li> <li>Only measuring end-2-end <ul> <li>What would make it better? What’s taking the longest in the system?</li> </ul> </li> <li>If you believe in the idea, you believe that the performnace will be good (confirmation bias) and not double checking that number</li> <li>Don’t rush your numbers that you measure - easy to make mistakes</li> </ul> </li> <li>How to not make mistakes <ul> <li>Time <ul> <li>Need to build instrumentation, benchmarks, patch bugs, repeat</li> </ul> </li> <li>Find different ways to measure the same thing/Don’t trust the number <ul> <li>“I often challenge them by asking: “Suppose I said I don’t believe these measurements. What can you say to convince me that they are correct?””</li> <li>For example, if you are measuring file-system throughput, do not measure just the throughput seen by a user application; also measure the throughput observed inside the operating system (such as at the file block cache). These measurements should match</li> </ul> </li> <li>Use your intuition to ask questions, not to answer them <ul> <li>It’s good to have a gut feeling to check something, but always verify that it’s true</li> </ul> </li> <li>Always measure one level deeper to breakdown numbers <ul> <li>ex, e2e measure latency, can breakdown client, server, network time</li> <li>validate top level numbers</li> <li>use your knowledge of known tools</li> </ul> </li> </ul> </li> <li>Measurement Infrastructure <ul> <li>How to build your set of tools to measure performance</li> <li>What is good infrastructure <ul> <li>Automated, each run does the performance</li> <li>Easy to digest/understand</li> <li>benchmarks to compare</li> <li>Dashboard <ul> <li>goal: easy to understand!</li> <li>but brings together a lot of data</li> </ul> </li> </ul> </li> </ul> </li> </ul> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-07-19/dashboard.png" width="100%" alt="" /> <div class="caption"> <em>Dashboard example </em> </div> </div> <ul> <li>Gives a lot of information and breaking each one down with e2e, network, and internal software</li> </ul> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-07-19/figure2.jpg" width="100%" alt="" /> <div class="caption"> <em>Dashboard example </em> </div> </div> <ul> <li>Example of how to expand and get a better understanding – it depends on the inputs</li> </ul> <div class="image-container"> <img loading="lazy" src="/assets/images/ramblings/2025-07-19/figure3.jpg" width="100%" alt="" /> <div class="caption"> <em>Dashboard example </em> </div> </div> <ul> <li>Example of how to expand and get a better understanding – it depends on the inputs (this time, you have to split the x into equal parts)</li> </ul> <h4 id="final-thoughts">Final thoughts</h4> <p>This is a very good read. Performance is something that you iterate on. It’s quite a process that’s simple on the surface: make assumptions, create benchmarks to verify that claim. But the reality is different:</p> <ul> <li>Make infrastructure to benchmark</li> <li>Performance process <ul> <li>think of what to important variables to observe from the system (mostly throughput/latency)</li> <li>back up with benchmark <ul> <li>the initial numbers - end to end numbers (process one request)</li> <li>the subnumbers (network/storage/processing)</li> <li>compare against other to check if the values are in appropriate range</li> <li>repeat</li> </ul> </li> </ul> </li> </ul> Paul Graham - Why Nerds Are Unpopular 2025-06-29T00:00:00+00:00 2025-06-29T00:00:00+00:00 https://maknee.github.io/blog/2025/Paul-Graham-Why-Nerds-Are-Unpopular <h3 id="paul-graham---why-nerds-are-unpopular">Paul Graham - Why Nerds Are Unpopular</h3> <p>Thoughts about <a href="https://paulgraham.com/nerds.html">Why Nerds Are Unpopular</a> by Paul Graham.</p> <p>Before we dive into this, this was written in 2003. This was when Paul Graham was 38, when he was not married or have kids.</p> <p>This is also a rather long essay…</p> <h4 id="thoughts-along-the-way">Thoughts along the way</h4> <blockquote> <p>We sat at a D table, as low as you could get without looking physically different. We were not being especially candid to grade ourselves as D. It would have taken a deliberate lie to say otherwise. Everyone in the school knew exactly how popular everyone else was, including us.</p> </blockquote> <p>Seems relatable.</p> <blockquote> <p>I know a lot of people who were nerds in school, and they all tell the same story: there is a strong correlation between being smart and being a nerd, and an even stronger inverse correlation between being a nerd and being popular. Being smart seems to make you unpopular.</p> </blockquote> <p>Interesting – time investment is into becoming good at grades rather than appearance/people</p> <blockquote> <p>Why? To someone in school now, that may seem an odd question to ask. The mere fact is so overwhelming that it may seem strange to imagine that it could be any other way. But it could. Being smart doesn’t make you an outcast in elementary school. Nor does it harm you in the real world. Nor, as far as I can tell, is the problem so bad in most other countries. But in a typical American secondary school, being smart is likely to make your life difficult. Why?</p> </blockquote> <p>Interesting… observation. Still true as you get older, not just in school</p> <blockquote> <p>In the schools I went to, being smart just didn’t matter much. Kids didn’t admire it or despise it. All other things being equal, they would have preferred to be on the smart side of average rather than the dumb side, but intelligence counted far less than, say, physical appearance, charisma, or athletic ability.</p> </blockquote> <p>Yes. That does not matter much to kids as it’s harder to read.</p> <blockquote> <p>So if intelligence in itself is not a factor in popularity, why are smart kids so consistently unpopular? The answer, I think, is that they don’t really want to be popular.</p> </blockquote> <p>Interesting, huh, people need attention in some way.</p> <blockquote> <p>But in fact I didn’t, not enough. There was something else I wanted more: to be smart. Not simply to do well in school, though that counted for something, but to design beautiful rockets, or to write well, or to understand how to program computers. In general, to make great things.</p> </blockquote> <p>I guess so, but generally people want attention in some way, not so much to be popular…</p> <blockquote> <p>At the time I never tried to separate my wants and weigh them against one another. If I had, I would have seen that being smart was more important. If someone had offered me the chance to be the most popular kid in school, but only at the price of being of average intelligence (humor me here), I wouldn’t have taken it.</p> </blockquote> <p>I agree. But slightly. Being popular and knowing how to utilize it can benefit (sometimes more than) being smart</p> <blockquote> <p>And that, I think, is the root of the problem. Nerds serve two masters. They want to be popular, certainly, but they want even more to be smart. And popularity is not something you can do in your spare time, not in the fiercely competitive environment of an American secondary school.</p> </blockquote> <p>Haha, yeah – time investment.</p> <blockquote> <p>Nerds don’t realize this. They don’t realize that it takes work to be popular. In general, people outside some very demanding field don’t realize the extent to which success depends on constant (though often unconscious) effort. For example, most people seem to consider the ability to draw as some kind of innate quality, like being tall. In fact, most people who “can draw” like drawing, and have spent many hours doing it; that’s why they’re good at it. Likewise, popular isn’t just something you are or you aren’t, but something you make yourself.</p> </blockquote> <p>Agreed. Did not realize this until very late. It takes a lot of time and thought and honestly, experimentation (+ failures) to become popular…</p> <blockquote> <p>Even if nerds cared as much as other kids about popularity, being popular would be more work for them. The popular kids learned to be popular, and to want to be popular, the same way the nerds learned to be smart, and to want to be smart: from their parents. While the nerds were being trained to get the right answers, the popular kids were being trained to please.</p> </blockquote> <p>Haha, suprised I reached the same reasoning. Paul’s writing is good.</p> <blockquote> <p>So far I’ve been finessing the relationship between smart and nerd, using them as if they were interchangeable. In fact it’s only the context that makes them so. A nerd is someone who isn’t socially adept enough. But “enough” depends on where you are. In a typical American school, standards for coolness are so high (or at least, so specific) that you don’t have to be especially awkward to look awkward by comparison.</p> </blockquote> <p>Oh god. Yes. It’s so very easy to seem awkward to someone, even becoming older. People tend to judge quickly, especially in the US.</p> <blockquote> <p>Partly because teenagers are still half children, and many children are just intrinsically cruel. Some torture nerds for the same reason they pull the legs off spiders. Before you develop a conscience, torture is amusing.</p> </blockquote> <p>Haha… yes, people don’t accept differences (from their own view of the world), especially if they’re children</p> <blockquote> <p>Another reason kids persecute nerds is to make themselves feel better. When you tread water, you lift yourself up by pushing water down. Likewise, in any social hierarchy, people unsure of their own position will try to emphasize it by maltreating those they think rank below. I’ve read that this is why poor whites in the United States are the group most hostile to blacks.</p> </blockquote> <p>Yes… Definitely when I was a teenager. I see this to some extent, even now.</p> <blockquote> <p>Because they’re at the bottom of the scale, nerds are a safe target for the entire school. If I remember correctly, the most popular kids don’t persecute nerds; they don’t need to stoop to such things. Most of the persecution comes from kids lower down, the nervous middle classes.</p> </blockquote> <p>Oh interesting – good observation. Happens when you’re older too, or maybe I just interpret some events like that.</p> <blockquote> <p>As well as gaining points by distancing oneself from unpopular kids, one loses points by being close to them. A woman I know says that in high school she liked nerds, but was afraid to be seen talking to them because the other girls would make fun of her. Unpopularity is a communicable disease; kids too nice to pick on nerds will still ostracize them in self-defense.</p> </blockquote> <p>Haha…</p> <blockquote> <p>It’s important to realize that, no, the adults don’t know what the kids are doing to one another. They know, in the abstract, that kids are monstrously cruel to one another, just as we know in the abstract that people get tortured in poorer countries. But, like us, they don’t like to dwell on this depressing fact, and they don’t see evidence of specific abuses unless they go looking for it.</p> </blockquote> <p>I don’t think I understand it to that extent. Maybe I’ve forgotten.</p> <blockquote> <p>Public school teachers are in much the same position as prison wardens. Wardens’ main concern is to keep the prisoners on the premises. They also need to keep them fed, and as far as possible prevent them from killing one another. Beyond that, they want to have as little to do with the prisoners as possible, so they leave them to create whatever social organization they want. From what I’ve read, the society that the prisoners create is warped, savage, and pervasive, and it is no fun to be at the bottom of it.</p> </blockquote> <p>Wow what a conclusion. I do agree with this. Again this is not PRIVATE school teachers – public school teachers have like 30-40 students to take care of per class. There’s easily not that much time devoted to each kid’s problems.</p> <blockquote> <p>When the things you do have real effects, it’s no longer enough just to be pleasing. It starts to be important to get the right answers, and that’s where nerds show to advantage. Bill Gates will of course come to mind. Though notoriously lacking in social skills, he gets the right answers, at least as measured in revenue.</p> </blockquote> <p>Huh, yes. School is much more restricted in that sense.</p> <blockquote> <p>If I could go back and give my thirteen year old self some advice, the main thing I’d tell him would be to stick his head up and look around. I didn’t really grasp it at the time, but the whole world we lived in was as fake as a Twinkie. Not just school, but the entire town. Why do people move to suburbia? To have kids! So no wonder it seemed boring and sterile. The whole place was a giant nursery, an artificial town created explicitly for the purpose of breeding children.</p> </blockquote> <p>Good advice – I’m going to take this advice.</p> <blockquote> <p>What bothers me is not that the kids are kept in prisons, but that (a) they aren’t told about it, and (b) the prisons are run mostly by the inmates. Kids are sent off to spend six years memorizing meaningless facts in a world ruled by a caste of giants who run after an oblong brown ball, as if this were the most natural thing in the world. And if they balk at this surreal cocktail, they’re called misfits.</p> </blockquote> <p>Glad I reached this conclusion when I was in school.</p> <blockquote> <p>Adults can’t avoid seeing that teenage kids are tormented. So why don’t they do something about it? Because they blame it on puberty. The reason kids are so unhappy, adults tell themselves, is that monstrous new chemicals, hormones, are now coursing through their bloodstream and messing up everything. There’s nothing wrong with the system; it’s just inevitable that kids will be miserable at that age.</p> </blockquote> <p>Blaming on something that can’t be fully explained – Typical. Also, sometimes I fall into this habit, but I’ve stopped it mostly.</p> <blockquote> <p>When I was in school, suicide was a constant topic among the smarter kids. No one I knew did it, but several planned to, and some may have tried. Mostly this was just a pose. Like other teenagers, we loved the dramatic, and suicide seemed very dramatic. But partly it was because our lives were at times genuinely miserable.</p> </blockquote> <p>True true true</p> <blockquote> <p>At best it was practice for real work we might do far in the future, so far that we didn’t even know at the time what we were practicing for. More often it was just an arbitrary series of hoops to jump through, words without content designed mainly for testability. (The three main causes of the Civil War were…. Test: List the three main causes of the Civil War.)</p> </blockquote> <blockquote> <p>And there was no way to opt out. The adults had agreed among themselves that this was to be the route to college. The only way to escape this empty life was to submit to it.</p> </blockquote> <p>Even in adult life, with a “job”, you get these structured instances too…</p> <blockquote> <p>Teenage kids used to have a more active role in society. In pre-industrial times, they were all apprentices of one sort or another, whether in shops or on farms or even on warships. They weren’t left to create their own societies. They were junior members of adult societies.</p> </blockquote> <p>That’s a good observation – most of the useful stuff I learned was outside of school – working with my father, exploring/navigating the city</p> <blockquote> <p>What happened? We’re up against a hard one here. The cause of this problem is the same as the cause of so many present ills: specialization. As jobs become more specialized, we have to train longer for them. Kids in pre-industrial times started working at about 14 at the latest; kids on farms, where most people lived, began far earlier. Now kids who go to college don’t start working full-time till 21 or 22. With some degrees, like MDs and PhDs, you may not finish your training till 30.</p> </blockquote> <p>Interesting thought. Yes, and it REQUIRES schooling again…</p> <blockquote> <p>The real problem is the emptiness of school life. We won’t see solutions till adults realize that. The adults who may realize it first are the ones who were themselves nerds in school. Do you want your kids to be as unhappy in eighth grade as you were? I wouldn’t. Well, then, is there anything we can do to fix things? Almost certainly. There is nothing inevitable about the current system. It has come about mostly by default.</p> </blockquote> <p>Yes. Man, I was dumb for not realizing this soon…</p> <h4 id="final-thoughts">Final thoughts</h4> <p>This is one of Paul’s older essays. He rambles quite a bit. Each paragraph after the like 5th one repeats what he says, but with a different story or tone. I like the point of the essay. Nerds are UNPOPULAR, and the time of that unpopularity actually does drag on these days (to even past college) due to the internet and having these traits be embedded with culture beyond school.</p> <p>One thing I do disagree with Paul is that popularity does matter, just in a different sense. To be popular is something that most people are not well adjusted to, say being attractive for the first time, or being more well known on the internet and being able to respond in a social setting well. I believe that these early years in life builds that and allows one to experience that type of feeling – to build “confidence” in some way. Because this matters after the teenager years, and is a useful skill to have. However, to be popular, it’s hard, and most kids are just thrown into the battle grounds to figure it out. No one really teaches them.</p> <p>I do agree with most of paul’s points on school. It’s a rigid structure that is basically a battleground for kids to bully another and place themselves into groups. Then you can pretend for the most part to pass classes if you put some effort and learn how to do so (I guess this is what is “smart”?). I wish kids did more apprentice-esque classes or etc, so that someone can show them some view of the adult world. I didn’t understand some until after college and am still learning.</p> <p>But why is Paul seem so harsh – angry almost? Does he regret going to such schools? Bitter? I can relate if so. I can’t really describe good things about school. Just hung out with the nerds, and that was fun, I think?</p> A Reality Check on DeepSeek's Distributed File System Benchmarks 2025-06-18T09:00:00+00:00 2025-06-18T09:00:00+00:00 https://maknee.github.io/blog/2025/3FS-Performance-Journal-2 <h1 id="series">Series</h1> <ul> <li><a href="/blog/2025/3FS-Performance-Journal-1/">An Intro to DeepSeek’s Distributed File System</a></li> <li><a href="/blog/2025/3FS-Performance-Journal-2/">A Reality Check on DeepSeek’s Distributed File System Benchmarks</a></li> <li><a href="/blog/2025/3FS-Performance-Journal-3/">Network Storage and Scaling Characteristics of a Distributed Filesystem</a></li> </ul> <!-- - [Theoretical Performance Limits of 3FS](/blog/2018/RTX-DXR-Path-Tracer-Host/) - [Benchmarking 3FS](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) - [Analysis of 3FS Benchmarks](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) - [Improving 3FS Performance](/blog/2018/RTX-DXR-Path-Tracer-HLSL/) --> <h1 id="how-should-we-analyze-3fs">How should we analyze 3FS?</h1> <p>In <a href="/blog/2025/3FS-Performance-Journal-1/">my previous post</a>, I introduced DeepSeek’s <a href="https://github.com/deepseek-ai/3FS/tree/ee9a5cee0a85c64f4797bf380257350ca1becd36">3FS distributed file system</a> – exploring its architecture, components, and the CRAQ protocol that provides its consistency guarantees. Now, I want to take a closer look at the published benchmark results and performance claims.</p> <p>When evaluating distributed systems, there’s a tendency to jump straight into complex profiling tools and detailed metrics.<span class="sidenote-ref"></span><span class="sidenote">Trying out perf, strace for syscalls, iostat for disk, it’s essentially throwing random darts until you hit something</span> However, I find tremendous value in performing an initial “performance reality check” on numbers and graphs. The check uses reference numbers from other sources, such as hardware manufacturer specifications or existing benchmarks, to provide a baseline quickly for a particular system<span class="sidenote-ref"></span><span class="sidenote">For example, when I drive a car on the highway, I try to match the speed to the other cars around me. Without that reference, it might turn out that I’m over the speed limit if I’m not constantly checking the speed gauge</span>. This approach helps identify potential bottlenecks or inconsistencies before deploying software tools for deeper investigation.</p> <p>A “performance reality check” can reveal the following:</p> <ol> <li>It validates whether benchmark results match what we’d expect based on the hardware configuration</li> <li>It helps identify which components (network, storage, cpu, etc) represent the main bottleneck</li> <li>It reveals the percentage of theoretical capacity actually being utilized</li> <li>It verifies whether the authors’ claims are valid and how they may have arrived at those conclusions</li> </ol> <p>To illustrate this method, imagine a startup making claims about their new database – “built for AI training” and “hyper performance”. They showcase a benchmark from a single node:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/example1.svg" width="75%" alt="" /> <div class="caption"> <em>A company produces a graph showing the throughput of one of their machines running the workload </em> </div> </div> <p>The system managed to read 250 GB in the total time, which seems impressive! However, this is like saying I drove 100 miles without mentioning whether it took an hour or 10. The rate (GB per second) reveals the actual work accomplished relative to time invested. Let’s approximate it by drawing a slope through the data.</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/example2.svg" width="75%" alt="" /> <div class="caption"> <em>Drawing a line through the graph to get the rate </em> </div> </div> <p>2 GB/s. Great number, but one might wonder – what should we compare this number to?</p> <p>A start might be to ask is if this utilizing the full potential of the hardware? Looking up <a href="https://www.micron.com/content/dam/micron/global/public/documents/products/technical-marketing-brief/7450-nvme-ssd-tech-prod-spec.pdf">modern SSD</a> specifications for random reads and plotting that on the graph can reveal the following:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/example3.svg" width="75%" alt="" /> <div class="caption"> <em>Taking a different look at the graph with theoretical limits </em> </div> </div> <p>Theoretically, the system should reach 500 GB by the end of the test period!</p> <p>The benchmark is only utilizing about half of the available device bandwidth. This might raise some eyebrows about their performance claims – where are the bottlenecks?</p> <p>This analytical approach is exactly what I’ll apply to DeepSeek’s 3FS benchmarks throughout this post. By calculating what the hardware should deliver and comparing it to what 3FS actually achieves, we can identify where the possible bottlenecks lie and assess whether performance claims hold up.<span class="sidenote-ref"></span><span class="sidenote">While not exact, these comparisons give us immediate insights that would take days to obtain through benchmarking</span></p> <h2 id="into-analyzing-3fs">Into analyzing 3FS</h2> <p>I’ll be examining three different workloads that showcase 3FS in action:</p> <ul> <li>AI training jobs featuring a massive amount of reads</li> <li>GraySort, a synthetic sorting benchmark with a mix of reads and writes</li> <li>KV cache in operation, representing LLM inference workloads with random reads</li> </ul> <p>Each benchmark consists of two main components – client and storage. The client sends a request to read/write to the storage node over a network. Then, the storage node reads/writes the data to its device(s) and responds to the client by sending a message back. Thus, the two main hardware components one should analyze closely are network and storage.</p> <p>For each benchmark, I’ll break down the hardware configuration, calculate theoretical maximums, and analyze how close the system comes to achieving its potential performance. Through this analysis, we’ll develop intuition about 3FS’s real-world capabilities before even deploying it.</p> <p>Let’s start by examining what could be 3FS’s most important benchmark: training throughput for AI workloads.</p> <h2 id="first-workload-training-job">First workload: Training job</h2> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/peak_throughput.jpg" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> <div class="caption"> <em>Peak throughput for training jobs (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <p>A training workload typically involves GPU nodes acting as clients that read data (text, images, etc.) from storage nodes to train deep neural networks like language or diffusion models. The throughput hovers around 6.6 TB/s<span class="sidenote-ref"></span><span class="sidenote">It’s not made explicit if this read throughput is the average or median. I would assume the average throughput.</span> on average, with peak throughput reaching 8 TB/s as reported in the <a href="https://arxiv.org/abs/2408.14158">Fire-Flyer AI-HPC paper</a>.</p> <p>Here’s the hardware configuration the benchmark uses:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Node Type,Component,Specification-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Node Type,Component,Specification" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Node Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Specification </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client</span> </td> <td id="fancy-table-Node Type,Component,Specification-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Number of nodes</span> </td> <td id="fancy-table-Node Type,Component,Specification-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">500</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="fancy-table-Node Type,Component,Specification-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1 × 200Gbps NIC</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Storage</span> </td> <td id="fancy-table-Node Type,Component,Specification-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Number of nodes</span> </td> <td id="fancy-table-Node Type,Component,Specification-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">180</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk</span> </td> <td id="fancy-table-Node Type,Component,Specification-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">16 × 14TB PCIe 4.0 NVMe SSDs</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="fancy-table-Node Type,Component,Specification-row4-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2 × 200Gbps NICs</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Memory</span> </td> <td id="fancy-table-Node Type,Component,Specification-row5-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">512 GB DDR4-3200MHz</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row6-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row6-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">CPU</span> </td> <td id="fancy-table-Node Type,Component,Specification-row6-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2 × AMD 32 Cores EPYC Rome/Milan</span> </td> </tr> </tbody> </table> </div> <p>Let’s apply the “performance reality check” on these numbers – Below are some back-of-the-envelope calculations<span class="sidenote-ref"></span><span class="sidenote"><a href="https://en.wikipedia.org/wiki/Back-of-the-envelope_calculation">Back-of-the-envelope calculations</a> involve performing rough additions and multiplications to get an approximate number within the range of the actual value</span> to get an idea of the theoretical limits<span class="sidenote-ref"></span><span class="sidenote">The authors don’t list the SSD used in the benchmark, so I’ll be using a <a href="https://www.micron.com/content/dam/micron/global/public/documents/products/technical-marketing-brief/7450-nvme-ssd-tech-prod-spec.pdf">Micron 7450 15.36TB U.3 enterprise SSD</a> as reference</span> of the benchmark. Click the “Show calculations” toggle button in the top right to see the detailed breakdown. The base numbers (7GB/s, 4GB/s, 6GB/s, 2GB/s) come from reference SSD specifications I selected to represent this workload. Also, the NIC’s throughput is measured in Gbps instead of GB/s. Performing the conversion: 200Gbps ÷ 8 = 25GB/s.</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="performance-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <div class="toggle-container"> <div class="calc-toggle"> <span id="performance-table-toggle-text" class="toggle-text">Show calculations</span> <span id="performance-table-toggle" onclick="toggleCalculations('performance-table')"> <span class="toggle-switch" id="performance-table-switch"></span> <span class="toggle-label">Toggle calculations</span> </span> </div> </div> <table id="performance-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Node Type </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Metric </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Per Unit </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Per Node </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Entire Cluster </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="performance-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Storage (180)</span> </td> <td id="performance-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Sequential Read</span> </td> <td id="performance-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">7 GB/s</span> </td> <td id="performance-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">112 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">112 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">7 GB/s × 16</span> </span> </td> <td id="performance-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">20.16 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">20.16 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">112 GB/s × 180</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="performance-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Random Read</span> </td> <td id="performance-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4 GB/s</span> </td> <td id="performance-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">64 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">64 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">4 GB/s × 16</span> </span> </td> <td id="performance-table-row1-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">5.04 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">5.04 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">64 GB/s × 180</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="performance-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Sequential Write</span> </td> <td id="performance-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">6 GB/s</span> </td> <td id="performance-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">96 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">96 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">6 GB/s × 16</span> </span> </td> <td id="performance-table-row2-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">7.56 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">7.56 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">96 GB/s × 180</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="performance-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Random Write</span> </td> <td id="performance-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2 GB/s</span> </td> <td id="performance-table-row3-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">32 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">32 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">2 GB/s × 16</span> </span> </td> <td id="performance-table-row3-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">2.52 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">2.52 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">2 GB/s × 180</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="performance-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="performance-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 GB/s</span> </td> <td id="performance-table-row4-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">50 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">50 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">25 GB/s × 2</span> </span> </td> <td id="performance-table-row4-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">9 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">9 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">50 GB/s × 180</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client (500)</span> </td> <td id="performance-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="performance-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 GB/s</span> </td> <td id="performance-table-row5-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 GB/s</span> </td> <td id="performance-table-row5-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">12.5 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">12.5 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">25 GB/s × 500</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ML Training</span> </td> <td id="performance-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Avg Read Throughput</span> </td> <td id="performance-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="performance-table-row6-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="performance-table-row6-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">6.6 TB/s</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="performance-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">ML Training</span> </td> <td id="performance-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Peak Read Throughput</span> </td> <td id="performance-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="performance-table-row7-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="performance-table-row7-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">8 TB/s</span> </td> </tr> </tbody> </table> </div> <p>From these numbers, one can observe that:</p> <ul> <li><span data-highlight-cells="performance-table-row5-col4, performance-table-row4-col4">The client’s network will not be a bottleneck (<span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="rgba(255,255,0,0.2)">12.5 TB/s client network throughput</span> &gt; <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="rgba(255,255,0,0.2)">9 TB/s storage network throughput</span>)</span><span class="sidenote-ref"></span><span class="sidenote">Hover over the text to see the numbers highlighted in the table!</span></li> <li><span data-highlight-cells="performance-table-row6-col4,performance-table-row1-col4">The training job workload indicates a mix of sequential/random read because 6.6 TB/s average throughput is greater than the maximum disk random read throughput (<span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">6.6 TB/s</span> &gt; <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">5 TB/s</span>)</span></li> <li><span data-highlight-cells="performance-table-row0-col4, performance-table-row4-col4">The storage nodes will be bottlenecked by network or disk depending on the type of workload. A network bottleneck occurs when the workload primarily consists of sequential reads<span class="sidenote-ref"></span><span class="sidenote">An example of this type of workload is reading a large file (movie, song, etc) in order to transfer the data to another device</span> and the network cannot match the sequential throughput (<span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">20 TB/s</span> &gt; <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">9 TB/s</span>)</span></li> <li><span data-highlight-cells="performance-table-row1-col4,performance-table-row2-col4,performance-table-row3-col4,performance-table-row4-col4">When workload primarily consist random reads, sequential write, or random writes, the storage becomes the bottleneck rather than the network. (<span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">5 TB/s</span>, <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">7.5 TB/s</span>, <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">2.5 TB/s</span> &lt; <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">9 TB/s</span></span>)</li> <li>This workload is most likely bottlenecked by network bandwidth. The sequential read throughput can reach up to <span data-highlight-cells="performance-table-row0-col4" data-hover-text-color="darkred" data-hover-cell-bg="#FFFFE0">20 TB/s</span> and the network throughput is <span data-highlight-cells="performance-table-row4-col4" data-hover-text-color="maroon" data-hover-cell-bg="#FFFFE0">9 TB/s</span>, but the peak throughput of <span data-highlight-cells="performance-table-row7-col4" data-hover-text-color="maroon" data-hover-cell-bg="#FFFFE0">8 TB/s</span> and average throughput of <span data-highlight-cells="performance-table-row6-col4" data-hover-text-color="maroon" data-hover-cell-bg="#FFFFE0">6.6 TB/s</span> are below the network limit and well below the maximum sequential throughput.</li> </ul> <p>Sometimes it’s hard to look at numbers. If we replot the numbers relative to the maximum sequential throughput of a SSD and lay the numbers on a bar plot, we can get a better idea of where the numbers are:</p> <div class="image-container"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/paper_throughput_relative_to_sequential_reads.svg" width="100%" alt="" /> </div> <p>The visualization reveals some interesting insights about system utilization that we have already pointed out:</p> <ul> <li>The average 6.6 TB/s represents: <ul> <li>33% of theoretical sequential disk throughput (6.6 / 20 TB/s)</li> <li>73% of available network bandwidth (6.6 / 9 TB/s)</li> </ul> </li> <li>The peak 8 TB/s achieves: <ul> <li>40% of theoretical sequential disk throughput (8 / 20 TB/s)</li> <li>88% of available network bandwidth (8 / 9 TB/s)</li> </ul> </li> </ul> <p>The data clearly shows that network bandwidth becomes the primary bottleneck. This suggests two potential optimization paths: either remove half of the SSDs from each storage node or upgrade to 800 Gbps NICs to unlock full sequential read potential. However, implementing these changes presents practical challenges. Hardware platforms often have limitations that prevent NIC upgrades and removing storage may leave PCIe slots unused. And, pure cost alone may make changing the existing setup unreasonable.</p> <p>Also, why does peak throughput cap at 8 TB/s rather than closer to the theoretical network limit of 9 TB/s? Is this a fundamental software limitation, or does it reflect overhead associated with network operations<span class="sidenote-ref"></span><span class="sidenote">Could be TCP or RDMA overhead</span> at this scale?<span class="sidenote-ref"></span><span class="sidenote">I’ll have better answers to such questions when I run benchmarks on 3FS</span></p> <h3 id="revisiting-the-training-job-with-some-background">Revisiting the training job with some background</h3> <p>Now, let’s revisit the throughput over time graph with these background numbers in mind.</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/peak_throughput.jpg" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> <div class="caption"> <em>Peak throughput for training jobs (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <p>The graph shows read throughput hovering around 6.6 TB/s, which represents approximately 73% of the theoretical 9 TB/s network capacity<span class="sidenote-ref"></span><span class="sidenote">I typically set 0 as the starting point for the y axis, which gives you an absolute base number that you can compare to</span>. This leaves 27% of potential throughput unutilized, suggesting possible system bottlenecks such as:</p> <ul> <li>Metadata communication network overhead (think TCP headers)</li> <li>Network completion delays before reading</li> <li>Workload imbalance creating hot nodes</li> <li>FUSE bottlenecks in the client for file operations</li> <li>Kernel overhead in managing communication and disk I/O</li> <li>Straggler storage nodes slowed by disk issues (temperature, wear)</li> <li>Native filesystem (XFS, ext4) overheads</li> <li>…</li> </ul> <h3 id="dips-in-the-training-job">Dips in the training job</h3> <p>The periodic dips in throughput are somewhat interesting:</p> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/paper_dips.svg" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> </div> <p>These dips could originate from either the filesystem or the workload itself. The filesystem might have internal mechanisms (periodic flushing, lock contention, etc.) that could cause these performance drops. But, because the dips occur at regular ~2.5 second intervals, it strongly suggests that checkpointing operations might cause these drops<span class="sidenote-ref"></span><span class="sidenote">The dips hover around 6.3 TB/s, so at 6.6 TB/s average, that’s a 4.5% drop in throughput (300 GB/s / 6600 GB/s). These dips last roughly 10% of the time between peak points, so overall throughput may decrease by about 0.45% - not particularly significant.</span> as the parts of the model may need to pause training while checkpoint data is written.</p> <h2 id="next-up-gray-sort-benchmark">Next up: Gray Sort Benchmark</h2> <h3 id="what-is-gray-sort">What is Gray Sort?</h3> <p><a href="https://sortbenchmark.org/">Gray Sort</a> is a synthetic benchmark that tests how quickly a system can sort large<span class="sidenote-ref"></span><span class="sidenote">Large as in terabytes large, and definitely will not fit on a single machine</span> amounts of data. The workload follows a specific pattern that stresses both sequential and random I/O operations:</p> <ol> <li>Read unsorted data from storage into memory (sequential reads)</li> <li>Sort each data chunk in memory</li> <li>Write the sorted chunks back to storage (random-ish writes)</li> <li>Read the fetching other node’s sorted chunks and merge them in memory (random-ish reads)</li> <li>Write the merged results back to disk (random-ish writes)</li> <li>Repeat until all data is fully sorted</li> <li>Write the full sorted result to disk (sequential writes)</li> </ol> <p>This alternating pattern of reads and writes, combined with both sort and merge phases, makes it a standard test for distributed filesystem performance<span class="sidenote-ref"></span><span class="sidenote">Sounds like a <a href="https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/">MapReduce</a> workload, essentially aggregating in keys in a range to a partition and then performing some operation on that range (merging in this case)</span>.</p> <h3 id="initial-look-at-the-graphs">Initial Look at the Graphs</h3> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/gray_sort_client.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Gray Sort Single Node Client Performance (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/gray_sort_server.png" style="width: 105%; margin-left: calc((100% - 105%) / 2);" alt="" /> <div class="caption"> <em>Gray Sort Single Node Server Performance (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <p>Note that orange represents writes and blue represents reads.</p> <p>Looking at the orange dotted lines separating the algorithm phases, there’s a clear pattern. The phase before the first dotted line is pure writes – the system writing the unsorted data to the storage. After that, we see mixed read/write workloads that gradually shift toward being more read-heavy as the sorting progresses<span class="sidenote-ref"></span><span class="sidenote">As more and more sorted runs get merged together, there are fewer write operations needed since each merge pass consolidates multiple inputs into fewer outputs, while the read operations increase to fetch data from the remaining sorted runs. This pattern is observable when comparing workload differences between the 18:05:00-18:10:00 and 18:25:00-18:30:00 time ranges in the server throughput graph.</span></p> <p>A few observations jump out:</p> <ul> <li>If one were to eyeball the average combined (read / write) throughput per phase, it would hover around ~10-15 GB/s.</li> <li>Clients peak at around 10 GB/s for writes while peaking 22 GB/s for reads.</li> <li>Storage nodes peak at 22 GB/s for writes and 30 GB/s for reads – their throughput is approximately twice the amount of the client’s average throughput, which makes sense given there are twice as many clients as storage nodes. We see this listed in the next section on hardware configuration.</li> </ul> <h3 id="hardware-configuration">Hardware Configuration</h3> <p>For this benchmark, 3FS was deployed with the following hardware setup:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="fancy-table-Node Type,Component,Specification-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-4 mb-4 overflow-x-auto"> <table id="fancy-table-Node Type,Component,Specification" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Node Type </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Component </th> <th class="px-6 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Specification </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row0-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client</span> </td> <td id="fancy-table-Node Type,Component,Specification-row0-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Number of nodes</span> </td> <td id="fancy-table-Node Type,Component,Specification-row0-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">50</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row1-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row1-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="fancy-table-Node Type,Component,Specification-row1-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">1 × 200Gbps NIC</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row2-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row2-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Memory</span> </td> <td id="fancy-table-Node Type,Component,Specification-row2-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2.2TB DDR4</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row3-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Storage</span> </td> <td id="fancy-table-Node Type,Component,Specification-row3-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Number of nodes</span> </td> <td id="fancy-table-Node Type,Component,Specification-row3-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row4-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row4-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk</span> </td> <td id="fancy-table-Node Type,Component,Specification-row4-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">16 × 14TB PCIe 4.0 NVMe SSDs</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row5-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="fancy-table-Node Type,Component,Specification-row5-col1" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="fancy-table-Node Type,Component,Specification-row5-col2" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2 × 400Gbps NICs</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="fancy-table-Node Type,Component,Specification-row6-col0" class="px-6 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> </tr> </tbody> </table> </div> <h3 id="analysis">Analysis</h3> <p>The main difference from the previous benchmark is that there are twice as many clients as there are storage nodes (compared to 3x from previous benchmark) and the storage nodes have twice as much network bandwidth!</p> <p>Let’s calculate the theoretical performance limits for this configuration:</p> <link rel="stylesheet" href="/assets/css/fancy_table.css" /> <script> function toggleCalculations(tableId) { // ... (toggleCalculations function remains the same) const table = document.getElementById(tableId); const cells = table.querySelectorAll('.has-calculation'); const tableWrapper = document.getElementById(tableId + '-wrapper'); const toggleSwitch = document.getElementById(tableId + '-switch'); const toggleText = document.getElementById(tableId + '-toggle-text'); if (tableWrapper.classList.contains('show-calculations')) { tableWrapper.classList.remove('show-calculations'); toggleSwitch.classList.remove('active'); toggleText.textContent = "Show calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'inline'; calculationText.style.display = 'none'; }); } else { tableWrapper.classList.add('show-calculations'); toggleSwitch.classList.add('active'); toggleText.textContent = "Hide calculations"; cells.forEach(cell => { const normalText = cell.querySelector('.normal-text'); const calculationText = cell.querySelector('.calculation-text'); normalText.style.display = 'none'; calculationText.style.display = 'inline'; }); } } function isConsideredYellow(colorString) { if (!colorString) return false; const lowerColor = colorString.toLowerCase(); const yellowKeywords = ['yellow', 'gold', 'lemon', 'chiffon', 'goldenrod', 'papayawhip', 'moccasin', 'khaki', '#ff0', '#ffff00', '#ffffe0', '#fffacd', '#fafad2', '#fff8dc', '#eee8aa', '#f0e68c']; if (yellowKeywords.some(k => lowerColor.includes(k))) { return true; } const match = lowerColor.match(/rgba?\((\d+),\s*(\d+),\s*(\d+)/); if (match) { const r = parseInt(match[1]); const g = parseInt(match[2]); const b = parseInt(match[3]); if (r > 200 && g > 180 && b < 200 && Math.abs(r - g) < 70) { return true; } } if (lowerColor === '#ffdab9') return true; // PeachPuff if (lowerColor === 'rgba(255,255,0,0.2)') return true; // Example: semi-transparent yellow return false; } function initializeCellHighlighters() { const highlighters = document.querySelectorAll('[data-highlight-cells]'); highlighters.forEach(highlighter => { if (highlighter.dataset.highlighterInitialized === 'true') return; highlighter.dataset.highlighterInitialized = 'true'; const cellIdsToHighlight = highlighter.dataset.highlightCells.split(',').map(id => id.trim()); const cellElements = cellIdsToHighlight.map(id => document.getElementById(id)).filter(el => el); let descriptiveSpans = []; if (highlighter.matches('span[data-hover-text-color]')) { descriptiveSpans.push(highlighter); } else { descriptiveSpans = Array.from(highlighter.querySelectorAll('span[data-hover-text-color]')); } descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor === 'undefined') { span.dataset.originalParaTextColor = window.getComputedStyle(span).color || ''; } }); highlighter.addEventListener('mouseenter', () => { highlighter.classList.add('trigger-text-active'); descriptiveSpans.forEach(span => { if (span.dataset.hoverTextColor) { span.style.color = span.dataset.hoverTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.add('cell-highlighted'); let hoverTextColorForCell = null; let hoverCellBgFromDescSpan = null; if (idx < descriptiveSpans.length) { const descSpan = descriptiveSpans[idx]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } else if (descriptiveSpans.length === 1 && cellElements.length > 1) { const descSpan = descriptiveSpans[0]; if (descSpan.dataset.hoverTextColor) { hoverTextColorForCell = descSpan.dataset.hoverTextColor; } if (descSpan.dataset.hoverCellBg) { hoverCellBgFromDescSpan = descSpan.dataset.hoverCellBg; } } if (hoverTextColorForCell) { const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); const calcWrapper = cell.querySelector('.calculation-text'); if (calcWrapper && window.getComputedStyle(calcWrapper).display !== 'none') { if (calcResultSpan) calcResultSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } else { if (normalTextSpan) normalTextSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.setProperty('color', hoverTextColorForCell, 'important'); } } if (hoverCellBgFromDescSpan && isConsideredYellow(hoverCellBgFromDescSpan)) { if (typeof cell.dataset.originalBgColor === 'undefined') { cell.dataset.originalBgColor = cell.style.backgroundColor; } cell.style.backgroundColor = hoverCellBgFromDescSpan; cell.dataset.bgChangedByScript = 'true'; } }); }); highlighter.addEventListener('mouseleave', () => { highlighter.classList.remove('trigger-text-active'); descriptiveSpans.forEach(span => { if (typeof span.dataset.originalParaTextColor !== 'undefined') { span.style.color = span.dataset.originalParaTextColor; } }); cellElements.forEach((cell, idx) => { cell.classList.remove('cell-highlighted'); const isCalcCell = cell.classList.contains('has-calculation'); if (isCalcCell) { const normalTextSpan = cell.querySelector('.normal-text'); const calcResultSpan = cell.querySelector('.calc-result'); if (normalTextSpan) normalTextSpan.style.removeProperty('color'); if (calcResultSpan) calcResultSpan.style.removeProperty('color'); } else { const directSpan = cell.querySelector(':scope > span'); if (directSpan) directSpan.style.removeProperty('color'); } if (cell.dataset.bgChangedByScript === 'true') { cell.style.backgroundColor = cell.dataset.originalBgColor || ''; delete cell.dataset.originalBgColor; delete cell.dataset.bgChangedByScript; } }); }); }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', initializeCellHighlighters); } else { initializeCellHighlighters(); } </script> <style> /* ... (Other styles remain the same) ... */ .toggle-container { display: flex; justify-content: flex-end; margin-bottom: 6px; } .calc-toggle { display: inline-flex; align-items: center; } .toggle-switch { position: relative; display: inline-block; width: 32px; height: 16px; background-color: #e5e7eb; /* gray-200 */ border-radius: 10px; transition: all 0.3s; cursor: pointer; } .toggle-switch::after { content: ''; position: absolute; width: 12px; height: 12px; border-radius: 50%; background-color: white; top: 2px; left: 2px; transition: all 0.3s cubic-bezier(0.175, 0.885, 0.32, 1.275); box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1); } .toggle-switch.active { background-color: #8b5cf6; /* purple-500 */ } .toggle-switch.active::after { left: 18px; } .toggle-label { position: absolute; width: 1px; height: 1px; padding: 0; margin: -1px; overflow: hidden; clip: rect(0, 0, 0, 0); white-space: nowrap; border-width: 0; } .toggle-text { font-size: 0.75rem; color: #6b7280; /* gray-500 */ margin-right: 6px; } .calc-result { color: rgb(75, 85, 99); /* gray-700 */ } .calc-formula { color: #6b83a6; /* Subtle bluish gray */ font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, monospace; } .calc-equals { color: rgb(75, 85, 99); /* gray-700 */ display: inline-block; margin: 0 4px; font-weight: normal; } .cell-highlighted { font-weight: bold !important; } .trigger-text-active { background-color: #E5E7EB; padding: 1px 3px; border-radius: 3px; transition: background-color 0.15s ease-in-out; } .trigger-text-active span[data-hover-text-color] { padding: 0; background-color: transparent; } /* Add styles for no-scroll option */ .table-wrapper-no-scroll { overflow-x: visible !important; } .table-no-scroll td { white-space: normal !important; word-break: break-word; } </style> <div id="graysort-table-wrapper" class="px-4 rounded-lg __basic-table not-prose mt-2 mb-2 overflow-x-auto"> <div class="toggle-container"> <div class="calc-toggle"> <span id="graysort-table-toggle-text" class="toggle-text">Show calculations</span> <span id="graysort-table-toggle" onclick="toggleCalculations('graysort-table')"> <span class="toggle-switch" id="graysort-table-switch"></span> <span class="toggle-label">Toggle calculations</span> </span> </div> </div> <table id="graysort-table" class="min-w-full divide-y divide-gray-200 font-sans basic-table-striped"> <thead class="bg-gray-50"> <tr> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Node Type </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Metric </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Per Unit </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Per Node </th> <th class="px-4 py-2 text-left text-xs font-medium text-gray-500 tracking-wider"> Entire Cluster </th> </tr> </thead> <tbody> <tr class="border-b border-gray-200"> <td id="graysort-table-row0-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Storage (25)</span> </td> <td id="graysort-table-row0-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Sequential Read</span> </td> <td id="graysort-table-row0-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">7 GB/s</span> </td> <td id="graysort-table-row0-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">112 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">112 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">7 GB/s × 16</span> </span> </td> <td id="graysort-table-row0-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">2.8 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">2.8 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">112 GB/s × 25</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row1-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="graysort-table-row1-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Random Read</span> </td> <td id="graysort-table-row1-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">4 GB/s</span> </td> <td id="graysort-table-row1-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">64 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">64 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">4 GB/s × 16</span> </span> </td> <td id="graysort-table-row1-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">1.6 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">1.6 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">64 GB/s × 25</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row2-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="graysort-table-row2-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Sequential Write</span> </td> <td id="graysort-table-row2-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">6 GB/s</span> </td> <td id="graysort-table-row2-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">96 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">96 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">6 GB/s × 16</span> </span> </td> <td id="graysort-table-row2-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">2.4 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">2.4 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">96 GB/s × 25</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row3-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="graysort-table-row3-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Disk - Random Write</span> </td> <td id="graysort-table-row3-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">2 GB/s</span> </td> <td id="graysort-table-row3-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">32 GB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">32 GB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">2 GB/s × 16</span> </span> </td> <td id="graysort-table-row3-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">0.8 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">0.8 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">32 GB/s × 25</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row4-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> <td id="graysort-table-row4-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="graysort-table-row4-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">100 GB/s</span> </td> <td id="graysort-table-row4-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">100 GB/s</span> </td> <td id="graysort-table-row4-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">2.5 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">2.5 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">100 GB/s × 25</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row5-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client (50)</span> </td> <td id="graysort-table-row5-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Network</span> </td> <td id="graysort-table-row5-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 GB/s</span> </td> <td id="graysort-table-row5-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">25 GB/s</span> </td> <td id="graysort-table-row5-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600 has-calculation"> <span class="normal-text">1.25 TB/s</span> <span class="calculation-text" style="display: none;"> <span class="calc-result">1.25 TB/s</span> <span class="calc-equals">=</span> <span class="calc-formula">25 GB/s × 50</span> </span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row6-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gray Sort</span> </td> <td id="graysort-table-row6-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client Write Peak</span> </td> <td id="graysort-table-row6-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="graysort-table-row6-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~10 GB/s</span> </td> <td id="graysort-table-row6-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row7-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gray Sort</span> </td> <td id="graysort-table-row7-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Client Read Peak</span> </td> <td id="graysort-table-row7-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="graysort-table-row7-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~22 GB/s</span> </td> <td id="graysort-table-row7-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row8-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gray Sort</span> </td> <td id="graysort-table-row8-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Server Write Peak</span> </td> <td id="graysort-table-row8-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="graysort-table-row8-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~22 GB/s</span> </td> <td id="graysort-table-row8-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row9-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Gray Sort</span> </td> <td id="graysort-table-row9-col1" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">Server Read Peak</span> </td> <td id="graysort-table-row9-col2" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> <td id="graysort-table-row9-col3" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">~30 GB/s</span> </td> <td id="graysort-table-row9-col4" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;">N/A</span> </td> </tr> <tr class="border-b border-gray-200"> <td id="graysort-table-row10-col0" class="px-4 py-2 whitespace-nowrap text-sm font-medium text-gray-600"> <span style="color: rgb(75, 85, 99) !important;"></span> </td> </tr> </tbody> </table> </div> <p>The performance numbers reveal an interesting pattern. In the first phase, the server write peak achieves <span data-highlight-cells="graysort-table-row8-col3, graysort-table-row3-col3"><span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">~22 GB/s</span> out of <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">32 GB/s</span> random write capacity</span> (69% utilization). In the second phase, the server read peak reaches <span data-highlight-cells="graysort-table-row9-col3, graysort-table-row1-col3"><span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">~30 GB/s</span> out of <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">64 GB/s</span> random read capacity</span> (47% utilization), which is quite a bit lower than the relative utilization for writes. However, <span data-highlight-cells="graysort-table-row9-col3, graysort-table-row0-col3">comparing to sequential read throughput <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">~30 GB/s</span> vs <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">112 GB/s</span></span> (27% utilization) strongly signals that the workload is predominantly random rather than sequential.</p> <p>Let’s take a look at the numbers:</p> <ul> <li>Storage nodes peak at <span data-highlight-cells="graysort-table-row8-col3, graysort-table-row9-col3, graysort-table-row4-col3"><span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">22 GB/s writes and 30 GB/s reads</span>, well below the <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">100 GB/s network capacity</span></span></li> <li>Client read peak achieves <span data-highlight-cells="graysort-table-row7-col3, graysort-table-row5-col2">88% of network capacity (<span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">22 GB/s</span> out of <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">25 GB/s</span>)</span></li> <li>Client write peak hits only <span data-highlight-cells="graysort-table-row8-col3, graysort-table-row5-col2">40% of network capacity (<span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">10 GB/s</span> out of <span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">25 GB/s</span>)</span><span class="sidenote-ref"></span><span class="sidenote">Why does the writes not peak nearly as high as reads? A reason might be from CRAQ’s consistency guarantees - each write must traverse the entire chain (head → middle → tail → back), which makes performance predictable unlike reads. Reads can come from the follower, or cause a consistency check at the tail</span></li> </ul> <p>The bottleneck here is clearly the number of clients. With the storage nodes far from saturated, we could support more clients. How many? If we want to saturate the storage sequential write capacity of <span data-highlight-cells="graysort-table-row2-col4, graysort-table-row5-col2"><span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">2.4 TB/s</span> and each client can push <span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">25 GB/s</span></span>:</p> <p>2.4 TB/s ÷ 25 GB/s = ~96 clients</p> <p>Nearly double the current 50 clients! This suggests the current configuration may be significantly underutilizing the storage infrastructure.</p> <p>Interestingly, <span data-highlight-cells="graysort-table-row8-col3, graysort-table-row6-col3">the storage write peak (<span data-hover-text-color="rgba(80, 150, 100, 0.9)" data-hover-cell-bg="#FFFFE0">22 GB/s</span>) slightly exceeds client write peak (<span data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">20 = 2 × 10 GB/s</span>)</span>. With 50 clients at 10 GB/s distributed across 25 storage nodes, each node should see ~20 GB/s, with the extra 2 GB/s coming somewhere – maybe, from CRAQ protocol overhead?<span class="sidenote-ref"></span><span class="sidenote">CRAQ requires writes to propagate through chains, potentially creating additional write traffic beyond what clients generate</span></p> <p>The end-to-end performance measurements, however, reveal an unexpected pattern: the <a href="https://github.com/deepseek-ai/3FS/tree/ee9a5cee0a85c64f4797bf380257350ca1becd36">benchmark notes mention achieving 3.66 TB/min</a> – 61 GB/s aggregate throughput, which doesn’t sound too bad. But that’s just 4.88% of the <span data-highlight-cells="graysort-table-row5-col4" data-hover-text-color="rgba(52, 152, 219, 0.9)" data-hover-cell-bg="#FFFFE0">1.25 TB/s client network capacity</span>. This suggests that most of bottleneck might not be network or disk at all – it could be even be CPU/memory bound from the sorting computation itself.</p> <h2 id="caching-the-key-value-pairs-of-the-transformer">Caching the key-value pairs of the transformer</h2> <h3 id="what-is-the-kv-cache">What is the KV Cache?</h3> <p>The KV cache stores the key-value pairs from attention mechanisms during LLM inference. Instead of recomputing these values for every new token, the system caches them to dramatically reduce computational overhead by trading computation for storage. For models like DeepSeek’s R1, this cache becomes substantial – each token requires approximately 70KB of storage in FP16 format.</p> <p>This workload represents an important real-world use case for 3FS. As LLMs process longer contexts and serve more users concurrently, the storage system must handle both massive reads (loading cached values) and periodic deletions (garbage collecting expired entries).</p> <h3 id="initial-look-at-the-graphs-1">Initial Look at the Graphs</h3> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/kvcache_read_throughput.png" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> <div class="caption"> <em>KV Cache Read Throughput (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/kvcache_gc_iops.png" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> <div class="caption"> <em>KV Cache GC IOPS (Image source: <a href="https://github.com/deepseek-ai/3FS" rel="external nofollow noopener" target="_blank">3FS github</a>) </em> </div> </div> <p>The graphs show per-client performance for KV cache operations. Looking at the read throughput graph:</p> <ul> <li>Average throughput hovers around 3 GB/s</li> <li>Peak throughput reaches approximately 40 GB/s</li> <li>Which is more than 13x difference between average and peak</li> </ul> <p>The GC IOPS graph reveals:</p> <ul> <li>Periodic bursts of deletion operations reaching 1-1.4M IOPS</li> <li>~4 bursts per 5-minute interval <ul> <li>Lasts around ~40 seconds each, followed by similar periods of low activity</li> </ul> </li> </ul> <p>Unfortunately, the authors don’t specify the complete hardware configuration - we only know each client has a 400 Gbps NIC (50 GB/s). This means the peak 40 GB/s achieves 80% network utilization, while average performance uses only 6% of available bandwidth.</p> <h3 id="analyzing-the-workload">Analyzing the Workload</h3> <p>The read pattern is fundamentally random – individual KV entries are scattered across storage. However, each 70KB entry spans multiple 4KB blocks<span class="sidenote-ref"></span><span class="sidenote">SSDs read data in fixed-size blocks, typically 4KB. A 70KB entry requires reading 18 consecutive blocks</span>, resulting in sequential device-level reads despite the random access pattern per entry.</p> <p>Let me calculate what these throughput numbers mean for token processing:</p> <details style="font-size: 1.2em; margin: 1em 0;"> <summary style="cursor: pointer; font-weight: bold;">Expand for calculations for KV cache entry</summary> <p><a href="https://github.com/deepseek-ai/DeepSeek-V3/blob/4c2fdb8f55e049553b9f4f1a3241f86d739c8cf8/inference/configs/config_671B.json">671B configuration</a></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ "vocab_size": 129280, "dim": 7168, "inter_dim": 18432, "moe_inter_dim": 2048, "n_layers": 61, "n_dense_layers": 3, "n_heads": 128, "n_routed_experts": 256, "n_shared_experts": 1, "n_activated_experts": 8, "n_expert_groups": 8, "n_limited_groups": 4, "route_scale": 2.5, "score_func": "sigmoid", "q_lora_rank": 1536, "kv_lora_rank": 512, "qk_nope_head_dim": 128, "qk_rope_head_dim": 64, "v_head_dim": 128, "dtype": "fp8" } </code></pre></div> </div> <div class="image-container" style="max-width: 100%; overflow: visible;"> <img loading="lazy" src="/assets/images/posts/2025-03-13/part2/paper_mla.png" style="width: 125%; margin-left: calc((100% - 125%) / 2);" alt="" /> <div class="caption"> <em>KV Cache MLA calculation described in Deepseek V2 (Image source: <a href="https://arxiv.org/pdf/2405.04434" rel="external nofollow noopener" target="_blank">DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model</a>) </em> </div> </div> <p>Given:</p> <ul> <li>kv_lora_rank = 512</li> <li>qk_rope_head_dim = 64</li> <li>n_layers = 61</li> </ul> <p>Per token: (512 + 64) × 61 = 35,136 elements</p> <p>In FP16 format (2 bytes per element) = 70,272 bytes ≈ 70KB per token In FP8 format (1 byte per element) = 35,136 bytes ≈ 35KB per token</p> </details> <p>With 70KB per token:</p> <ul> <li>Average throughput (3 GB/s) processes ~43,000 tokens/second per client</li> <li>Peak throughput (40 GB/s) processes ~570,000 tokens/second per client</li> </ul> <p>Given R1’s 128K context length:</p> <ul> <li>Average: Can read entire context in 3 seconds (128K ÷ 43K)</li> <li>Peak: Can read entire context in 0.22 seconds (128K ÷ 570K)</li> </ul> <p>These numbers are impressive, but without knowing the number of concurrent users or typical context lengths, it’s hard to judge real-world performance.</p> <h3 id="alignment-concerns">Alignment Concerns</h3> <p>Here’s an issue the authors don’t address: alignment waste. Modern NVMe SSDs use 4KB block sizes, but KV cache entries are 70KB – not cleanly divisible.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Blocks needed = ⌈70,272 ÷ 4,096⌉ = 18 blocks Actual storage = 18 × 4,096 = 73,728 bytes Wasted space = 3,456 bytes (4.69%) </code></pre></div></div> <p>This 4.69% waste might seem small, but at scale it adds up. With enterprise SSDs costing ~$2,200 each:</p> <ul> <li>Cost per SSD: $103</li> <li>Cost per node (16 SSDs): ~$1,650</li> <li>Cost per 180 nodes: ~$297,000</li> <li>Cost per 10,000 nodes: ~$16,500,000</li> </ul> <p>For a company running thousands of clusters, this alignment inefficiency could waste millions in storage costs.</p> <h3 id="garbage-collection">Garbage Collection</h3> <p>The GC algorithm isn’t detailed, but entries likely get marked for deletion when no longer referenced. The deletion mechanism remains unclear - could involve bit flags, pointer updates, zeroing entries, or <a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree#Operations">tombstone markers</a>.</p> <p>The periodic burst pattern (1-1.4M IOPS) suggests that it’s probably more efficient to threshold-based eviction or batch processing rather than continuous cleanup for this type of workload. While throughput remains stable during GC, these spikes could impact performance if disks are already near throughput capacity<span class="sidenote-ref"></span><span class="sidenote">Garbage collection problems have appeared numerous times in many existing systems – showing up as <a href="https://github.com/facebook/rocksdb/issues/3972">compaction issues in RocksDB</a> or <a href="https://stackoverflow.com/questions/54831212/postgresql-autovacuum-causing-significant-performance-degradation">auto vacuum spikes in Postgres</a></span>.</p> <h3 id="remaining-feedback">Remaining feedback</h3> <p>Some critical information is absent from this benchmark, most notably the lack of latency graphs. For LLM serving, latency matters as much as throughput - users need consistent time-to-first-token and smooth text generation, or they’ll switch to another service (chatgpt, gemini, claude, etc…).</p> <p>Someone at Deepseek clearly knows how to configure systems well if this is a real sample from a live system. The 80% peak utilization indicates a well-configured system with just enough headroom.<span class="sidenote-ref"></span><span class="sidenote">Nobody wants that 3am call to discuss needing to setup more machines to handle the traffic</span></p> <h1 id="closing-thoughts">Closing Thoughts</h1> <p>The benchmarks focus exclusively on throughput, omitting latency metrics entirely. Not sure why they skipped latency – perhaps cost considerations took priority. While latency optimization is notoriously difficult<span class="sidenote-ref"></span><span class="sidenote"><a href="http://www.stuartcheshire.org/rants/latency.html">Stuart Cheshire: “It’s the latency, stupid”</a></span><span class="sidenote-ref"></span><span class="sidenote"><a href="https://www.barroso.org/publications/TheTailAtScale.pdf">Jeff Dean on tail latencies at scale</a></span>, my future evaluations will include latency measurements and explore optimizations to improve the latency.</p> <p>Despite these limitations and critiques, the benchmarks align well with theoretical calculations and provide valuable insights into 3FS performance at scale.</p> <p>In upcoming posts, I’ll benchmark 3FS myself to verify these graph/claims and dig deeper:</p> <ul> <li>Testing actual hardware limits vs theoretical calculations</li> <li>Measuring latency distributions, not just throughput</li> <li>Creating custom visualizations for storage and network performance patterns</li> <li>Validating if our back-of-the-envelope math holds up</li> <li>Profiling with various tools (perf, sampling, adapting source code) to identify bottlenecks</li> </ul> <h1 id="acknowledgments">Acknowledgments</h1> <p>Thanks to <a href="https://sbaziotis.com/">Stefanos Baziotis</a>, <a href="https://www.linkedin.com/in/ahan-gupta-405619103/">Ahan Gupta</a>, and <a href="https://vimarsh.me/">Vimarsh Sathia</a> for reviewing this post.</p> <h1 id="citation">Citation</h1> <p>To cite this article:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{zhu20253fs2, title = {A Reality Check on DeepSeek's Distributed File System Benchmarks}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {June}, url = "https://maknee.github.io/blog/2025/3FS-Performance-Journal-2/" } </code></pre></div></div> Paul Graham - What to Do 2025-04-20T00:00:00+00:00 2025-04-20T00:00:00+00:00 https://maknee.github.io/blog/2025/Paul-Graham-What-To-Do <h3 id="paul-graham---what-to-do">Paul Graham - What to Do</h3> <p>Thoughts about <a href="https://paulgraham.com/do.html">When To Do</a> by Paul Graham.</p> <h4 id="thoughts-along-the-way">Thoughts along the way</h4> <blockquote> <p>What should one do? That may seem a strange question, but it’s not meaningless or unanswerable. It’s the sort of question kids ask before they learn not to ask big questions.</p> </blockquote> <p>This statement about kids kind of took me off guard - I do see it happen (at least in myself). Why though? Does it see in his children and the kids that he encounters? What does he consider “kids” in this context - elemetary school, high school, college? I see this explained in the hierarchy of societies. Most definitely in military.</p> <blockquote> <p>I only came across it myself in the process of investigating something else. But once I did, I thought I should at least try to answer it.</p> </blockquote> <p>Oh, I haven’t explained why I was caught off guard. Because I haven’t thought about this in a long time. And I don’t have an answer yet.</p> <blockquote> <p>So what should one do? One should help people, and take care of the world. Those two are obvious.</p> </blockquote> <p>This are how kids would answer.</p> <blockquote> <p>But is there anything else? When I ask that, the answer that pops up is Make good new things.</p> </blockquote> <p>What good things? How do you know that they are good? Or new?</p> <blockquote> <p>The most impressive thing humans can do is to think. It may be the most impressive thing that can be done. And the best kind of thinking, or more precisely the best proof that one has thought well, is to make good new things.</p> </blockquote> <p>I believe in this and he has state it out well with very concise sentences. I like it.</p> <blockquote> <p>Newton’s physics was a good new thing.</p> </blockquote> <p>Suprised by an example. The concept may be very abstract without a general example (here, where everyone knows about this discovery).</p> <p>I’m going to guess that this discovery allowed people to develop technology (ships, safety, etc)?</p> <blockquote> <p>Indeed, the first version of this principle was to have good new ideas. But that didn’t seem general enough: it didn’t include making art or music, for example, except insofar as they embody new ideas. And while they may embody new ideas, that’s not all they embody, unless you stretch the word “idea” so uselessly thin that it includes everything that goes through your nervous system.</p> </blockquote> <p>I don’t understand this very well; I think he’s trying to explain how general a new idea can be - I don’t think it has to be. It’s very very very very very very very difficult to make a general good new idea. I believe that it’s built upon the ideas of many people, hundreds, thousands, millions, etc… to get to a general good new idea. I see this repeated.</p> <blockquote> <p>To make discoveries, for example, or to understand something more deeply than others have. But how well do you understand something if you can’t make a model of it, or write about it? Indeed, trying to express what you understand is not just a way to prove that you understand it, but a way to understand it better.</p> </blockquote> <p>Each time I do this, the more I believe in it.</p> <p>I think I’ve applied it to a teeny bit of my life. And I hope that the same rule will apply in other aspects of life/experiences.</p> <blockquote> <p>Another reason I like this phrasing is that it biases us toward creation. It causes us to prefer the kind of ideas that are naturally seen as making things rather than, say, making critical observations about things other people have made. Those are ideas too, and sometimes valuable ones, but it’s easy to trick oneself into believing they’re more valuable than they are.</p> </blockquote> <p>Two parts to this.</p> <p>I don’t agree with phrasing it biasing towards creation. Seems forced - I didn’t see it that originally. Discoveries (albeit repeated among different individuals), can fall under this term. I believe that it’s more about thinking and learning.</p> <p>Yes, I agree with what Paul states about the observations. Even an intellectual person that you look up to may make a wrong guess. For example, the <a href="https://en.wikipedia.org/wiki/Tanenbaum%E2%80%93Torvalds_debate">godfather of operating systems</a> lost in a debate against Linus that Linux would succeed as a monolothic kernel. Imagine that, a random ass college kid (Linus was 23 at the time) tells the most well known/accomplished professor in operating system at that time that his hobby operating system would win. If I were to be a random person in this flame war, I would have definitely not chose Linus’ arguments.</p> <p>And I see this often in my life as well. People make observations all the time, but when some X action happens, they’re wrong sometimes. Should you believe their observations? Sometimes.</p> <blockquote> <p>Criticism seems sophisticated, and making new things often seems awkward, especially at first; and yet it’s precisely those first steps that are most rare and valuable.</p> </blockquote> <p>This next statement came out a bit off from the previous sentences. And I do think it’s necessary to have this statement. I think that the observations may seem most rare/valuable, but I believe that it’s a series of observations generally, and it takes a bit to form thoughts about different/unusual observations.</p> <blockquote> <p>Is newness essential? I think so. Obviously it’s essential in science. If you copied a paper of someone else’s and published it as your own, it would seem not merely unimpressive but dishonest.</p> </blockquote> <p>Interesting statement about papers.</p> <blockquote> <p>Which in turn implies it’s not impressive to make the same thing over and over, however well; you’re just copying yourself.</p> </blockquote> <p>The problem here is that there’s not much learning (which is going through the problems and pain of steps to get to the end) - which I think Paul is stating here.</p> <blockquote> <p>Historically most rules about how to live have been a mix of both kinds of should, though usually with more of the former than the latter.</p> </blockquote> <p>Nice observation</p> <blockquote> <p>Archimedes knew that he was the first to prove that a sphere has 2/3 the volume of the smallest enclosing cylinder and was very pleased about it. But you don’t find ancient writers urging their readers to emulate him. They regarded him more as a prodigy than a model.</p> </blockquote> <p>Very interesting observation. Why not emulate him?</p> <blockquote> <p>Now many more of us can follow Archimedes’s example and devote most of our attention to one kind of work.</p> </blockquote> <p>Oh.</p> <blockquote> <p>What kinds of new things count? I’d rather leave that question to the makers of them.</p> </blockquote> <p>He didn’t answer the question… :(, but this is the answer to give.</p> <blockquote> <p>It would be a risky business to try to define any kind of threshold, because new kinds of work are often despised at first. Raymond Chandler was writing literal pulp fiction, and he’s now recognized as one of the best writers of the twentieth century. Indeed this pattern is so common that you can use it as a recipe: if you’re excited about some kind of work that’s not considered prestigious and you can explain what everyone else is overlooking about it, then this is not merely a kind of work that’s ok to do, but one to seek out.</p> </blockquote> <p>What a good statement next. I’m focusing on the “hey it’s not good at first” part. I’ve seen this a couple of times already. But I think Paul doesn’t mention the other factors: time taken, mental stress, comfort, physical taxation, … are minor or major hurdles of going down such a route. It is sometimes brutal to go down such a path.</p> <blockquote> <p>The kind of people who make good new things don’t need rules to keep them honest.</p> </blockquote> <p>True, but again, hurdles and this includes other people this time around.</p> <blockquote> <p>But even if you’re one of those, you should at least make sure that the new things you make don’t net harm people or the world.</p> </blockquote> <p>Very hard to see sometimes.</p> <blockquote> <p>On the other hand, if you make something amazing, you’ll often be helping people or the world even if you didn’t mean to. Newton was driven by curiosity and ambition, not by any practical effect his work might have, and yet the practical effect of his work has been enormous. And this seems the rule rather than the exception. So if you think you can make something amazing, you should probably just go ahead and do it.</p> </blockquote> <p>Great ending. “Just do it” - easier said than done. That’s for sure. Another thing that Paul doesn’t mention. It’s like the gym. It takes reps to build muscle. It takes reps to make something amazing. “Just do it” - yes, but make sure you see your goals clearly in the moment and can learn to/be able to identify plateaus. Not every person in the olympics just randomly just went at it and did one thing and became good at their sport.</p> <h4 id="final-thoughts">Final thoughts</h4> <p>To answer this generically - I can’t. But for doing stuff that you’re interested in and try to create something: talking to people, reading/watching the literature, doing something and then thinking, or guessing, doing something and then thinking are some ways one can get to a point of creating something or at the point of creating something. However, going through it may not be fun at times, may be actually uninteresting at times, or even depressing/cause someone to re-evaluate a lot (lost). I think that as long one hold the belief at one’s core, one will make progress. Ask anyone that one thinks is successful what they failed at/when they felt lost, they should answer with an event that sticks out or a couple or even mention that they wanted to give up.</p>