Henry Zhu

NVIDIA TileIR Internals: from CuTile to MLIR/LLVM to SASS

2026-01-30T06:00:00+00:00

In this post, we’ll dig deep into how TileIR works, from how it generates instructions to analyzing its different passes. We’ll trace how a Mixture-of-Experts (MoE) kernel written in CuTile gets compiled down through cuda_tile → nv_tileaa → nv_tileas → NVVM → LLVM → SASS.

Here’s what to expect:

What is CuTile? — The tile-centric programming model
Running Example — An MoE kernel we’ll trace through every stage
The Dialects — From cuda_tile through nv_tileaa and nv_tileas to NVVM/LLVM
The Passes — TileIR passes: what they do and when they run

Based on CUDA 13.1. Some details are undocumented and may change in future releases.

What is CuTile?

CuTile separates user responsibility (splitting work into blocks and tiles) from system responsibility (mapping to threads) (Image source: GPU MODE)

CuTile is NVIDIA’s new “tile-centric” programming model for modern NVIDIA GPUs. This abstraction is powerful: CuTile lets the programmer think in terms of tiles rather than threads, while the compiler handles the complexity of coordinating hundreds of threads across fragmented data. A single CuTile line ct.mma(a, b, acc) could get transformed to many tensor core instructions.

What is TileIR?

TileIR is NVIDIA’s MLIR-based compiler infrastructure that powers CuTile. It progressively lowers your high-level tensor operations through multiple MLIR dialects and NVIDIA specific tools:

TileIR compilation pipeline: Python → SASS

The user-facing tool is tileirasLike ptxas but for TileIR. Yes, NVIDIA named it “tile-ir-as” (tile IR assembler)., which orchestrates this entire pipeline.

Running Example: MoE Kernel

Throughout this post, we’ll trace this MoE (Mixture of Experts) kernel through every compilation stage. This is code from NVIDIA’s cutile-python samplesThere’s also a C++ API: NVIDIA/cuda-tile. Operations like ct.gather, ct.mma, cuda_tile.load_view_tko documented in TileIR docs.:

@ct.kernel def fused_moe_kernel( A, # Input tokens, shape (batch, K)  B, # Expert weights, shape (num_experts, N, K)  C, # Output tensor, shape (num_tokens * topk, N)  topk_weights, # Router weights for each token-expert pair  sorted_token_ids, # Token indices sorted by expert assignment  sorted_expert_ids, # Expert index for each TILE_M  num_token_replicas: int, mul_routed_weight: ConstBool, TILE_M: ConstInt, TILE_N: ConstInt, TILE_K: ConstInt, ): M = sorted_token_ids.shape[0] N = B.shape[1] K = B.shape[2] GROUP_SIZE_M = 8 bid_m, bid_n = swizzle_2d(M, N, TILE_M, TILE_N, GROUP_SIZE_M) # → cuda_tile.get_tile_block_id  # Gather token indices for this block  token_id_indices = bid_m * TILE_M + ct.arange(TILE_M, dtype=ct.int32) token_ids = ct.gather(sorted_token_ids, token_id_indices) # → cuda_tile.load_view_tko  a_row_indices = token_ids // num_token_replicas expert_id = ct.load(sorted_expert_ids, index=bid_m, shape=()) # → cuda_tile.load_ptr_tko  # Initialize accumulator  accumulator = ct.full((TILE_M, TILE_N), 0.0, dtype=ct.float32) # → cuda_tile.constant  for k in range(0, ct.cdiv(K, TILE_K)): # → cuda_tile.for  # Load A tile (gathered by token indices)  a_col_indices = k * TILE_K + ct.arange(TILE_K, dtype=ct.int32) a = ct.gather(A, (a_row_indices[:, None], a_col_indices[None, :])) # → cuda_tile.load_view_tko  # Load B tile (expert weights)  b = ct.load(B, (expert_id, k, bid_n), shape=(1, TILE_K, TILE_N), order=(0, 2, 1)).reshape((TILE_K, TILE_N)) # → cuda_tile.load_ptr_tko  accumulator = ct.mma(a, b, accumulator) # → cuda_tile.mmaf ← THE COMPUTE!  if mul_routed_weight: moe_weight = ct.gather(topk_weights, token_ids) accumulator = accumulator * moe_weight[:, None] # → cuda_tile.mulf  # Scatter results back to output  c_col_indices = bid_n * TILE_N + ct.arange(TILE_N, dtype=ct.int32) accumulator = ct.astype(accumulator, C.dtype) # → cuda_tile.ftof  ct.scatter(C, (token_ids[:, None], c_col_indices[None, :]), accumulator) # → cuda_tile.store_ptr_tko 

The three key operations we’ll trace:

Python	cuda_tile	What it does
ct.gather(A, indices)	load_view_tko	Gather tokens by expert assignment (indirect load)
ct.load(B, ...)	load_ptr_tko	Load expert weights (direct load)
ct.mma(a, b, acc)	mmaf	Matrix multiply-accumulate on tensor cores

Watch how these transform through nv_tileaa, nv_tileas and finally to SASS instructions.

Compiling with tileiras

The tileiras command-line tool is the ahead-of-time compiler that transforms .cutile bytecode into GPU binaries.

tileiras --gpu-name sm_120 MoE.cutile -o moe.cubin

Undocumented Environment Variables

These TileIR-specific environment variables affect compilation:

Variable	Description
TILEIR_ALWAYS_SWIZZLE	Force swizzle mode
TILEIR_PREFER_TMA_FOR_LOAD_STORE	Prefer TMA for all load/store operations
TILEIR_DELAY_TMA_STORE_WAIT	Delay TMA store wait (optimization for overlapping compute)

Interesting undocumented CLI options

The --print-before-all flag dumps LLVM IR before each compilation pass.

$ tileiras --print-before-all --gpu-name=sm_120 MoE.cutile -o moe.cubin 2>&1 

*** IR Dump Before Add __emutls_[vt]. variables for emultated TLS model (lower-emutls) *** ; ModuleID = 'LLVMDialectModule' source_filename = "LLVMDialectModule" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128" @__CUDA_TILEIR_FUNC_NAME_0 = internal constant [17 x i8] c"fused_moe_kernel\00" ... 

All LLVM passes dumped (27 unique passes)

*** IR Dump Before Add __emutls_[vt]. variables for emultated TLS model (lower-emutls) *** *** IR Dump Before Canonicalize natural loops (loop-simplify) *** *** IR Dump Before CodeGen Prepare (codegenprepare) *** *** IR Dump Before Constant Hoisting (consthoist) *** *** IR Dump Before Exception handling preparation (dwarf-eh-prepare) *** *** IR Dump Before Expand Atomic instructions (atomic-expand) *** *** IR Dump Before Expand fp (expand-fp) *** *** IR Dump Before Expand indirectbr instructions (indirectbr-expand) *** *** IR Dump Before Expand large div/rem (expand-large-div-rem) *** *** IR Dump Before Expand memcmp() to load/stores (expand-memcmp) *** *** IR Dump Before Expand reduction intrinsics (expand-reductions) *** *** IR Dump Before Instrument function entry/exit with calls to e.g. mcount() (post-inline-ee-instrument) *** *** IR Dump Before Interleaved Access Pass (interleaved-access) *** *** IR Dump Before Lower AMX intrinsics (lower-amx-intrinsics) *** *** IR Dump Before Lower AMX type for load/store (lower-amx-type) *** *** IR Dump Before Lower Garbage Collection Instructions (gc-lowering) *** *** IR Dump Before Merge contiguous icmps into a memcmp (mergeicmps) *** *** IR Dump Before ObjC ARC contraction (objc-arc-contract) *** *** IR Dump Before Partially inline calls to library functions (partially-inline-libcalls) *** *** IR Dump Before Pre-ISel Intrinsic Lowering (pre-isel-intrinsic-lowering) *** *** IR Dump Before Prepare callbr (callbrprepare) *** *** IR Dump Before Remove unreachable blocks from the CFG (unreachableblockelim) *** *** IR Dump Before Replace intrinsics with calls to vector library (replace-with-veclib) *** *** IR Dump Before Safe Stack instrumentation pass (safe-stack) *** *** IR Dump Before Scalarize Masked Memory Intrinsics (scalarize-masked-mem-intrin) *** *** IR Dump Before Shadow Stack GC Lowering (shadow-stack-gc-lowering) *** *** IR Dump Before X86 Partial Reduction (x86-partial-reduction) *** 

Pipeline Overview

TileIR compilation pipeline: Python → SASS

TileIR takes your CuTile Python code through a series of progressive lowerings:

Stage	Format	Description
Python	CuTile API	High-level tensor operations (make_tensor_view; mmaf)
.cutile	Bytecode	Serialized representation of the kernel
cuda_tile	MLIR Dialect	High-level tensor ops; architecture-independent
nv_tileaa	MLIR Dialect	Tile-level ops; explicit memory references
nv_tileas	MLIR Dialect	Scheduled ops; async pipelines
LLVM/NVVM	LLVM IR	Standard LLVM with NVIDIA intrinsics
PTX	Assembly	Virtual GPU assembly
SASS	Machine Code	Native GPU instructions (sm_120)

Each stage removes abstraction and adds architecture-specific detail. By the time we reach SASS, every memory access pattern, tensor core instruction, and synchronization barrier is explicit.

The Dialects

TileIR uses three main MLIR dialects to represent computations at different abstraction levels. Let’s trace our MoE kernel through each one:

Python	cuda_tile	nv_tileaa	nv_tileas	SASS
ct.gather(A, idx)	load_view_tko	tileaa.load_view	tileas.utcpglobalmem	UTCPMULTI / LDG
ct.load(B, ...)	load_ptr_tko	tileaa.load_tko	tileas.tcgen05_ld	TCGEN05.LD.S
ct.mma(a, b, c)	mmaf	tileaa.mmaf_tko	tileas.tcgen05_mma	TCGEN05.MMA

cuda_tile: High-Level Tensor Operations

cuda_tile dialect operations

The cuda_tile dialect is closest to your Python code. Operations work on abstract tensor views without worrying about memory layout or hardware details.

Key operations:

make_tensor_view - Create a view into a tensor with shape and strides
get_tile_block_id - Get the current thread block’s position in the grid
load_view_tko / store_view_tko - Load/store tiles with token-based ordering
mmaf - Matrix multiply-accumulate (targets tensor cores)
for / continue - Loop constructs for K-dimension iteration

MoE in cuda_tile

Recall our MoE kernel above. Here’s how the key operations map to cuda_tile IR:

Python → cuda_tile mapping:

Python (CuTile)	cuda_tile IR	Purpose
ct.gather()	load_view_tko	Gather elements by indices
ct.load()	load_ptr_tko	Load contiguous tile from memory
ct.mma()	mmaf	Matrix multiply-accumulate (tensor cores)
ct.scatter()	store_ptr_tko	Scatter elements to output
ct.full()	constant	Initialize accumulator
for k in range()	for/continue	K-dimension iteration loop
ct.astype()	ftof	Type conversion (F32 → output dtype)

Expand to see cuda_tile IR from MoE kernel key sections

// cuda_tile dialect - MoE kernel %1 = "cuda_tile.constant"() : () -> (ct.view) // TILE_M %2 = "cuda_tile.constant"() : () -> (ct.view) // TILE_N %3 = "cuda_tile.constant"() : () -> (ct.view) // TILE_K %4 = "cuda_tile.assume"(%arg0) : (ct.view) -> (ct.view) %5 = "cuda_tile.assume"(%arg1) : (ct.view) -> (ct.view) %10 = "cuda_tile.make_tensor_view"(%4, %5, %6, %7, %8, %9) : (ct.view, ct.view, ct.view, ct.view, ct.view, ct.view) -> (ct.token) %11 = "cuda_tile.make_tensor_view"(%arg2, %arg3) : (ct.view, ct.view) -> (ct.token) %12 = "cuda_tile.make_token"() : () -> (ct.ptr) %20, %21, %22 = "cuda_tile.get_tile_block_id"() : () -> (ct.view, ct.view, ct.view) %23 = "cuda_tile.divi"(%4, %1) : (ct.view, ct.view) -> (ct.view) // M / TILE_M %24 = "cuda_tile.muli"(%1, %23) : (ct.view, ct.view) -> (ct.view) %25 = "cuda_tile.divi"(%20, %24) : (ct.view, ct.view) -> (ct.view) %30 = "cuda_tile.remi"(%20, %25) : (ct.view, ct.view) -> (ct.view) // expert routing %31 = "cuda_tile.cmpi"(%30, %1) : (ct.view, ct.view) -> (ct.view) %32 = "cuda_tile.select"(%31, %30, %25) : (ct.view, ct.view, ct.view) -> (ct.view) %40 = "cuda_tile.iota"() : () -> (ct.view) %41 = "cuda_tile.reshape"(%24) : (ct.view) -> (ct.view) %42 = "cuda_tile.broadcast"(%41) : (ct.view) -> (ct.view) %43 = "cuda_tile.addi"(%42, %40) : (ct.view, ct.view) -> (ct.view) %44 = "cuda_tile.offset"(%42, %43) : (ct.view, ct.view) -> (ct.view) %50, %51 = "cuda_tile.load_ptr_tko"(%44, %31, %42, %12) // ct.load() : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.view, ct.ptr) %52 = "cuda_tile.make_partition_view"(%10) : (ct.token) -> (ct.part) %53, %54 = "cuda_tile.load_view_tko"(%52, %43, %12) // ct.gather() : (ct.part, ct.view, ct.ptr) -> (ct.view, ct.ptr) %60 = "cuda_tile.for"(%1, %23, %3, %arg4) {1 regions} // K-loop : (ct.view, ct.view, ct.view, ct.view) -> (ct.view) %61 = "cuda_tile.muli"(%iter, %3) : (ct.view, ct.view) -> (ct.view) %62 = "cuda_tile.broadcast"(%61) : (ct.view) -> (ct.view) %63, %64 = "cuda_tile.load_ptr_tko"(%62, %31, %42, %12) : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.view, ct.ptr) %65, %66 = "cuda_tile.load_view_tko"(%52, %62, %12) : (ct.part, ct.view, ct.ptr) -> (ct.view, ct.ptr) %67 = "cuda_tile.mmaf"(%63, %65, %acc) // ct.mma() : (ct.view, ct.view, ct.view) -> (ct.view) "cuda_tile.continue"(%67) : (ct.view) -> () %70 = "cuda_tile.ftof"(%60) : (ct.view) -> (ct.view) // ct.astype() %71 = "cuda_tile.store_ptr_tko"(%44, %70, %31, %12) // ct.scatter() : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.ptr) "cuda_tile.return"() 

nv_tileaa

nv_tileaa dialect operations

The nv_tileaa dialect lowers tensor views to concrete memory references. This is where we start seeing explicit memory operations.

Key changes from cuda_tile:

make_tensor_view → make_memref (explicit memory references)
get_tile_block_id → get_program_id (program-centric naming)
mmaf → dot (more explicit accumulation)
Explicit tiled_load / tiled_store with memory tokens
New ops: splat, broadcast, addptr for memory address calculations

Expand to see nv_tileaa IR from MoE kernel key sections

// nv_tileaa dialect - MoE kernel // Tile-level ops (architecture-independent) "nv_tileaa.func"() {nv_tileaa.kernel_spec} {1 regions} // Input validation %1 = "nv_tileaa.assume"(%arg0) : (aa.memref) -> (aa.memref) %2 = "nv_tileaa.assume"(%arg1) : (iN) -> (iN) %3 = "nv_tileaa.assume"(%2) : (iN) -> (iN) // Splat: scalar → tensor (for broadcasting) %10 = "nv_tileaa.splat"(%3) : (iN) -> (tensor<...>) %11 = "nv_tileaa.splat"(%2) : (iN) -> (tensor<...>) // Memory reference creation (lowered from make_tensor_view) %20 = "nv_tileaa.make_memref"(%1, %2, %3, %4, %5, %6) : (aa.memref, iN, iN, iN, iN, iN) -> (aa.btile) %21 = "nv_tileaa.make_memref"(%1, %2) : (aa.memref, iN) -> (aa.btile) %22 = "nv_tileaa.create_mem_token"() : () -> (aa.ptr) // Program indexing %30 = "nv_tileaa.get_program_id"() : () -> (iN) %31 = "nv_tileaa.splat"(%30) : (iN) -> (tensor<...>) %32 = "nv_tileaa.make_range"(%c0, %c128) : (iN, iN) -> (tensor<...>) %33 = "nv_tileaa.extract"(%32) : (tensor<...>) -> (iN) // Pointer arithmetic %40 = "nv_tileaa.splat"(%1) : (aa.memref) -> (tensor<...>) %41 = "nv_tileaa.addptr"(%40, %33) : (tensor<...>, tensor<...>) -> (tensor<...>) // Masked loads %50, %51 = "nv_tileaa.load"(%41, %mask, %c0, %22) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) // Tiled memory operations %60 = "nv_tileaa.block_tile"(%20) : (aa.btile) -> (aa.mtoken) %61 = "nv_tileaa.extract"(%32) : (tensor<...>) -> (iN) %62, %63 = "nv_tileaa.tiled_load"(%60, %61, %22) : (aa.mtoken, iN, aa.ptr) -> (tensor<...>, aa.ptr) %64 = "nv_tileaa.view"(%62) : (tensor<...>) -> (tensor<...>) // Shape manipulation %70 = "nv_tileaa.expand_dims"(%33) : (tensor<...>) -> (tensor<...>) %71 = "nv_tileaa.broadcast"(%70) : (tensor<...>) -> (tensor<...>) // DOT OPERATION (lowered from cuda_tile.mmaf) %80 = "nv_tileaa.dot"(%50, %64, %acc) : (tensor<...>, tensor<...>, tensor<...>) -> (tensor<...>) // Output %90 = "nv_tileaa.fp_to_fp"(%80) : (tensor<...>) -> (tensor<...>) %91 = "nv_tileaa.store"(%41, %90, %mask, %22) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (aa.ptr) "nv_tileaa.return"() 

Key transformations from cuda_tile → nv_tileaa:

cuda_tile	nv_tileaa	Change
make_tensor_view	make_memref	Abstract view → concrete memory ref
get_tile_block_id	get_program_id	Tile-centric → program-centric naming
mmaf	dot	High-level MMA → explicit dot product
load_view_tko	tiled_load + view	Decomposed into separate ops
ct.view types	tensor<...>	Abstract → explicit tensor shapes
ct.token	aa.btile; aa.mtoken	Memory tokens more specific

Pass #12 observation: The 32 fp_to_fp operations suggest this MoE kernel produces 32 output tiles that need precision conversion from F32 accumulator to the output dtype.

nv_tileas

nv_tileas dialect with tcgen05 operations

The nv_tileas dialect is where architecture-specific code generation happens.

This dialect introduces:

Async Pipeline Operations:

async.pipeline.create - Create a software pipeline for overlapping compute/memory
producer_acquire / producer_commit - Acquire/release pipeline stages
consumer_wait / consumer_release - Synchronize consumers with producers

Tensor Memory Operations:

tcgen05.alloc - Allocate dedicated tensor memory
tmem_load / tmem_store - Access tensor memory

Tensor Core Operations:

tcgen05.mma - Matrix Multiply-Accumulate
block_scaled_mma - Block-scaled MMA for mixed precision
mma.fence - Memory fence for MMA operations

Expand to see nv_tileas IR from MoE kernel key sections

// nv_tileas dialect - MoE kernel // Tile-level Scheduled Assembly // Layout conversion and view operations %1, %2 = "nv_tileas.load"(%ptr, %mask, %c0, %token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) %3, %4 = "nv_tileas.tiled_load"(%btile, %idx, %token) : (aa.mtoken, iN, aa.ptr) -> (tensor<...>, aa.ptr) %5 = "nv_tileas.view"(%3) : (tensor<...>) -> (tensor<...>) // Convert layout for tensor cores %10 = "nv_tileas.convert_layout"(%bcast) : (tensor<...>) -> (tensor<...>) %11 = "nv_tileas.convert_layout"(%5) : (tensor<...>) -> (tensor<...>) %12 = "nv_tileas.convert_layout"(%1) : (tensor<...>) -> (tensor<...>) // DOT with input allowances %20 = "nv_tileas.dot"(%10, %11, %12, %c1) : (tensor<...>, tensor<...>, tensor<...>, iN) -> (tensor<...>) // TMA descriptor %25 = "nv_tileas.make_tiled_tma_desc"(%memref) {tmaIdx=0} : (aa.btile) -> (!tma.desc) // ASYNC PIPELINE (producer-consumer model) // Pipeline and iterator creation %30 = "nv_tileas.async.pipeline.create_pipeline"() : () -> (!pipeline) %31 = "nv_tileas.async.pipeline.create_pipeline"() : () -> (!pipeline) %32 = "nv_tileas.async.pipeline.create_iterator"(%30) : (!pipeline) -> (!iter) %33 = "nv_tileas.async.pipeline.create_iterator"(%31) : (!pipeline) -> (!iter) // Agent switch (4 regions for producer/consumer roles) "nv_tileas.async.pipeline.agent_switch"(%arg0, %30, %32, %31, %33) {4 regions} : (aa.memref, !pipeline, !iter, !pipeline, !iter) -> () // Tensor allocation (double-buffering) %40 = "nv_tileas.alloc_tensor"() : () -> (tensor<128x64xbf16>) %41 = "nv_tileas.alloc_tensor"() : () -> (tensor<64x128xbf16>) // Slice operations %50 = "nv_tileas.extract_slice"(%40, %c0) : (tensor<...>, iN) -> (tensor<...>) %51 = "nv_tileas.insert_slice"(%data, %40, %c0, %c64) : (tensor<...>, tensor<...>, iN, iN) -> (tensor<...>) // PRODUCER: acquire → write → commit %60 = "nv_tileas.async.pipeline.producer_acquire"(%30, %32) : (!pipeline, !iter) -> (!stage) %61 = "nv_tileas.async.pipeline.producer_write"(%60, %30) {1 regions} : (!stage, !pipeline) -> (!stage) %62 = "nv_tileas.async.load"(%51, %ptr, %mask, %c16) : (tensor<...>, tensor<...>, tensor<...>, tensor<...>) -> (!async) "nv_tileas.async.pipeline.yield"(%62) : (!async) -> () "nv_tileas.async.pipeline.producer_commit"(%30, %61) : (!pipeline, !stage) -> () // CONSUMER: wait → read → release %70 = "nv_tileas.async.pipeline.consumer_wait"(%31, %33) : (!pipeline, !iter) -> (!stage) %71, %72 = "nv_tileas.async.pipeline.consumer_read"(%70, %31) {1 regions} : (!stage, !pipeline) -> (!stage, tensor<...>) %73 = "nv_tileas.copy"(%buf) : (tensor<...>) -> (tensor<...>) "nv_tileas.async.pipeline.yield"(%73) : (tensor<...>) -> () "nv_tileas.async.pipeline.consumer_release"(%31, %71) : (!pipeline, !stage) -> () // Matrix multiply (100+ ops for tiled GEMM) %80 = "nv_tileas.dot"(%50, %72, %acc, %c1) : (tensor<...>, tensor<...>, tensor<...>, iN) -> (tensor<...>) %81 = "nv_tileas.dot"(%50, %72, %80, %c1) : (tensor<...>, tensor<...>, tensor<...>, iN) -> (tensor<...>) // TMA load %90 = "nv_tileas.async.tiled_tma_load"(%btile, %buf, %25, %idx, %c0, %c64) : (aa.mtoken, tensor<...>, !tma.desc, iN, iN, iN) -> (!async) // Output %100 = "nv_tileas.insert_slice"(%result, %41, %c0, %c0) : (tensor<...>, tensor<...>, iN, iN) -> (tensor<...>) %101 = "nv_tileas.view"(%100) : (tensor<...>) -> (tensor<...>) %102 = "nv_tileas.convert_layout"(%101) : (tensor<...>) -> (tensor<...>) 

NVVM + LLVM

After nv_tileas, the compiler lowers to NVVM (NVIDIA’s LLVM dialect) and then to standard LLVM IR.

Key NVVM intrinsics:

@llvm.nvvm.mma.sync.* - Tensor core matrix multiply
@llvm.nvvm.ldmatrix.* - Load matrix fragments from shared memory
@llvm.nvvm.cp.async.* - Asynchronous memory copy
@llvm.nvvm.bar.warp.sync - Warp-level synchronization
@llvm.nvvm.tcgen05.* - Tensor core intrinsics

Expand to see NVVM/LLVM IR key sections

; Thread ID and warp-level operations %233 = call range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x() %234 = icmp eq i32 %233, 0 %235 = ashr i32 %233, 5 %236 = call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 -1, i32 %235, i32 0, i32 31) %237 = call { i32, i1 } @llvm.nvvm.elect.sync(i32 -1) ; Mbarrier initialization (async pipeline synchronization) call void @llvm.nvvm.mbarrier.init.shared( ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @global_smem, i64 82000), i32 %241) call void @llvm.nvvm.mbarrier.init.shared( ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) @global_smem, i64 82008), i32 %241) ; Cluster-wide fence and barrier call void asm sideeffect "fence.mbarrier_init.release.cluster;", "n"(i32 0) call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0) ; Async copy from global to shared memory (cp.async) %1478 = select i1 %1459, i32 16, i32 0 call void @llvm.nvvm.cp.async.cg.shared.global.16.s( ptr addrspace(3) %1477, ptr addrspace(1) %1451, i32 %1478) call void @llvm.nvvm.cp.async.cg.shared.global.16.s( ptr addrspace(3) %1485, ptr addrspace(1) %1452, i32 %1486) ; Signal mbarrier arrival after async copy call void @llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared(ptr addrspace(3) %1535) ; TCGEN05 tensor core intrinsics ; Allocate tensor memory %tmem = call i32 @llvm.nvvm.tcgen05.alloc(i32 65536) ; Load data into tensor memory call void @llvm.nvvm.tcgen05.ld(i32 %tmem, ptr addrspace(3) %smem_ptr, i32 %size) ; Execute TCGEN05 MMA (128x256x64 tile) call void @llvm.nvvm.tcgen05.mma(i32 %tmem_a, i32 %tmem_b, i32 %tmem_c) ; Fence and wait for tensor core completion call void @llvm.nvvm.tcgen05.fence() call void @llvm.nvvm.tcgen05.wait() 

SASS

The final output is SASS.

Key SASS instructions:

HMMA.16816.F32.BF16 - Half-precision matrix multiply-accumulate
TCGEN05.MMA - Tensor core MMA
TCGEN05.LD.S - Tensor memory load
UTCPMULTI / LDG - Global memory loads
SYNCS.EXCH - Async synchronization exchange
FENCE.VIEW.ASYNC.S - Async memory fence

Expand to see SASS key sections

; SASS - MoE kernel (fused_moe_kernel) ; Target: sm_120a ; Thread ID and CTA setup /*0020*/ S2R R0, SR_TID.X ; ; Get thread ID /*0060*/ S2UR UR8, SR_CgaCtaId ; ; Get CTA ID (uniform reg) ; Async fence and mbarrier sync (cluster sync) /*0110*/ FENCE.VIEW.ASYNC.S ; /*0120*/ SYNCS.EXCH.64 URZ, [UR8+0x14050], UR4 ; /*0130*/ SYNCS.EXCH.64 URZ, [UR8+0x14058], UR4 ; /*0140*/ SYNCS.EXCH.64 URZ, [UR8+0x14060], UR6 ; ; ... (data loading, address calculation) ... ; Tensor core HMMA - 16x8x16 BF16→F32 matrix multiply ; R156 = A matrix fragment (reused across 7 HMMAs) ; R124,R120,R116,R112,R108,R104,R100 = B matrix fragments ; R200,R204,R64,R60,R56,R52,R48 = accumulator tiles /*4a00*/ HMMA.16816.F32.BF16 R200, R156, R124, R200 ; /*4a10*/ HMMA.16816.F32.BF16 R204, R156, R120, R204 ; /*4a20*/ HMMA.16816.F32.BF16 R64, R156, R116, R64 ; /*4a30*/ HMMA.16816.F32.BF16 R60, R156, R112, R60 ; /*4a40*/ HMMA.16816.F32.BF16 R56, R156, R108, R56 ; /*4a50*/ HMMA.16816.F32.BF16 R52, R156, R104, R52 ; /*4a60*/ HMMA.16816.F32.BF16 R48, R156, R100, R48 ; ; Second A fragment (R148) with different B fragments /*4a70*/ HMMA.16816.F32.BF16 R200, R148, R126, R200 ; /*4a80*/ HMMA.16816.F32.BF16 R204, R148, R122, R204 ; /*4a90*/ HMMA.16816.F32.BF16 R64, R148, R118, R64 ; 

The TileIR passes

TileIR runs multiple passes to transform your code. The passes are grouped by the scope they operate on:

TileIR pass pipeline

Detailed pass pipeline: cuda_tile.entry → nv_tileaa.func (×12) → builtin.module → gpu.module

Pass 1: `cuda_tile.entry`

Entry point canonicalization—validates kernel structure, emits compile-time constants for tile sizes/strides, propagates input constraints via assume operations, creates tensor views, and establishes memory ordering via make_token.

Pass 2: `nv_tileaa.func` (×12 iterations)

Iterative lowering from cuda_tile to nv_tileaa. First iteration converts make_tensor_view → make_memref, get_tile_block_id → get_program_id, mmaf → dot, decomposes load_view_tko into block_tile + tiled_load + view. Subsequent iterations perform refinement and optimization. Final iteration emits precision conversions (fp_to_fp), adds kernel metadata, and prepares for async pipeline lowering.

Pass 3: `builtin.module`

Module-level transforms and nv_tileas emission—creates async pipeline operations, software pipelines for overlapping compute/memory, producer-consumer synchronization, TMA descriptors, and double buffers.

Pass 4: `gpu.module`

Final lowering to NVVM/LLVM—converts nv_tileas.dot → nvvm.mma.sync, lowers async ops to barrier/fence instructions, converts memory ops to NVVM intrinsics (ldmatrix, cp.async, mbarrier.*), and emits address space annotations.

Complete Pass Catalog

Below is a catalog of passes that run within the TileIR pipeline.

Conversion Passes

Pass Name	Source	Target	Description
convert-cudatile-to-tileaa	cuda_tile	nv_tileaa	Frontend: CuTile DSL to TileAA abstract assembly
convert-tileaa-to-tileas	nv_tileaa	nv_tileas	Middle-end: Abstract to scheduled assembly
convert-nv-tileas-to-llvm	nv_tileas	llvm	Backend: TileAS to LLVM IR
convert-nv-tile-func-to-llvm	nv_tile	llvm	Convert tile function ops to LLVM
convert-gpu-to-nvvm	gpu	nvvm	GPU dialect to NVVM intrinsics
convert-scf-to-cf	scf	cf	Structured control flow to basic blocks
nv-tile-ir-convert-target-to-nvvm	nv_tile	nvvm	Target-specific ops to NVVM
convert-pipeline-to-nvvm	pipeline	nvvm	Async pipeline ops to NVVM barriers
convert-arith-to-llvm	arith	llvm	Arithmetic operations to LLVM
convert-cf-to-llvm	cf	llvm	Control flow to LLVM
convert-to-llvm	*	llvm	Generic catch-all LLVM conversion
convert-math-to-llvm	math	llvm	Math operations to LLVM
convert-nvvm-to-llvm	nvvm	llvm	NVVM intrinsics to LLVM
convert-ub-to-llvm	ub	llvm	Undefined behavior ops to LLVM
convert-vector-to-llvm	vector	llvm	Vector ops to LLVM
convert-debuginfo-to-llvm	debug	llvm	Debug info to LLVM metadata

TileAS Optimization Passes

Pass Name	Description
tileas-assign-dot-layouts	Assign optimal data layouts for dot (MMA) operations
tileas-assign-pipeline-layouts	Assign layouts for async pipeline stages
tileas-assign-load-store-layouts	Assign layouts for memory operations
tileas-attach-tma-desc-args	Attach TMA descriptor arguments to kernel signature
tileas-dynamic-persistent	Enable dynamic persistent kernel execution
tileas-insert-OCG-knobs	Insert Online Code Generation tuning knobs
tileas-legalize-tmem-copy	Legalize tensor memory copy operations
tileas-plan-cta	Plan CTA (thread block) configuration
tileas-remove-buffer-alias	Remove buffer aliasing for optimization
tileas-remove-dead-args	Dead argument elimination
tileas-remove-layout-conversions	Remove unnecessary layout conversions
tileas-resolve-agent-boundary	Resolve warp specialization agent boundaries
tileas-slicing	Tensor slicing for pipelining
tileas-materialize-async	Materialize async load/store/dot operations
tileas-materialize-convert-layout	Materialize layout conversion copy atoms
tileas-materialize-schedule	Materialize schedule to warp-specialized IR
tileas-unroll-register-loops	Unroll loops at register level
tileas-unspecialized-pipeline	Handle non-warp-specialized pipelines
tileas-optimize-alloc-tensor	Optimize tensor allocation placement
tileas-optimize-reduce	Optimize reduction operations
tileas-recompute-for-scheduling	Recompute values for better scheduling
tileas-legalize-fma-dot	Legalize FMA in dot products
tileas-legalize-reduce	Legalize reduction operations
tileas-slice-and-fuse	Slice and fuse operations for locality
tileas-refine-atom-by-resource	Refine copy atoms based on resource constraints
tileas-generate-schedule	Generate execution schedule (Serial or CostBased)
tileas-prepare-for-scheduling	Prepare IR for scheduling pass
tileas-optimize-dot-accumulation	Optimize dot product accumulation
lower-tma-load-store-to-async	Lower TMA ops to async variants
tileas-print-decomposed-tv-layout	Debug: print decomposed tensor view layouts

Conversion Patterns Registered

The TileAA→TileAS conversion registers 20+ patterns:

TileAAToTileASTiledLoadOpPattern // Tiled load conversion TileAAToTileASDotOpPattern // Dot product conversion TileAAToTileASExtractOpPattern // Extraction conversion TileAAToTileASBroadcastOpPattern // Broadcast conversion TileAAToTileASGatherLoadOpPattern // Gather load conversion TileAAToTileASScatterStoreOpPattern // Scatter store conversion TileAAToTileASExpandDimsOpPattern // Dimension expansion TileAAToTileASExtractSliceOpPattern // Slice extraction TileAAToTileASGenerateOpPattern // Generate conversion TileAAToTileASLoadOpPattern // Load conversion TileAAToTileASPermuteOpPattern // Permute conversion TileAAToTileASReduceOpPattern // Reduce conversion TileAAToTileASScanOpPattern // Scan conversion TileAAToTileASStoreOpPattern // Store conversion TileAAToTileASTiledAtomicRMWOpPattern // Atomic RMW conversion TileAAToTileASTiledStoreOpPattern // Tiled store conversion TileAAToTileASViewOpPattern // View conversion TileAAToTileASYieldOpPattern // Yield conversion 

Conclusion

TileIR is a sophisticated MLIR-based compiler that progressively lowers high-level tensor operations to optimized GPU machine code. It’s an interesting piece of software that combines MLIR and the rest of NVIDIA’s toolchain to make the tile abstraction work.

Resources:

Appendix: TileIR Passes Reference

This appendix documents the TileIR-specific passes in the compilation pipeline. Passes are organized into categories: Conversion and TileAS Optimization

Conversion Passes (16)

Conversion passes transform IR between MLIR dialects.

convert-cudatile-to-tileaa

Converts high-level cuda_tile dialect to nv_tileaa.

Key transformations:

cuda_tile.mmaf → nv_tileaa.dot
cuda_tile.load_view_tko → nv_tileaa.tiled_load
cuda_tile.store_ptr_tko → nv_tileaa.tiled_store
cuda_tile.for → scf.for + nv_tileaa.yield

void ConvertCudaTileToTileAA::runOnOperation() { ModuleOp module = getOperation(); ConversionTarget target(getContext()); target.addLegalDialect<nv_tileaa::NVTileAADialect>(); target.addIllegalDialect<cuda_tile::CudaTileDialect>(); RewritePatternSet patterns(&getContext()); // Register 20+ conversion patterns patterns.add<ConvertMmafToDot>(...); patterns.add<ConvertLoadViewTko>(...); patterns.add<ConvertStorePtr>(...); applyPartialConversion(module, target, std::move(patterns)); } 

convert-tileaa-to-tileas

Main middle-end conversion: nv_tileaa → nv_tileas (Tile Assembly).

Key transformations:

nv_tileaa.tiled_load → nv_tileas.async_load + pipeline ops
nv_tileaa.dot → nv_tileas.dot with layout annotations
Inserts shared memory allocations

void ConvertTileAAToTileAS::runOnOperation() { FuncOp funcOp = getOperation(); // Walk all tileaa operations funcOp.walk([&](nv_tileaa::TiledLoadOp loadOp) { // Create async copy with TMA descriptor auto asyncCopy = builder.create<nv_tileas::AsyncCopyOp>(...); // Allocate shared memory buffer auto smemAlloc = builder.create<nv_tileas::AllocSharedOp>(...); }); funcOp.walk([&](nv_tileaa::DotOp dotOp) { // Convert to tileas.dot with layout attributes auto tiledDot = builder.create<nv_tileas::DotOp>(...); tiledDot->setAttr("lhs_layout", selectMMALayout(...)); }); } 

convert-nv-tileas-to-llvm

Backend code generation: nv_tileas → LLVM IR with NVVM intrinsics.

Key transformations:

tileas.tcgen05_mma → @llvm.nvvm.tcgen05.mma.*
tileas.tcgen05_ld → @llvm.nvvm.tcgen05.ld.*
tileas.async_copy → @llvm.nvvm.cp.async.*
Barrier ops → @llvm.nvvm.barrier.*

void ConvertTileASToLLVM::runOnOperation() { ModuleOp module = getOperation(); ConversionTarget target(getContext()); target.addLegalDialect<LLVM::LLVMDialect>(); RewritePatternSet patterns(&getContext()); // MMA operations patterns.add<Tcgen05MMAToNVVM>([](tcgen05::MMAOp op) { // Generate NVVM MMA intrinsic return builder.create<NVVM::Tcgen05MMAOp>(...); }); // Memory operations with TMA patterns.add<Tcgen05LoadToNVVM>([](tcgen05::LoadOp op) { return builder.create<NVVM::Tcgen05LoadOp>(...); }); } 

convert-gpu-to-nvvm

Converts GPU dialect operations to NVVM intrinsics.

GPU Op	NVVM Intrinsic
`gpu.thread_id`	`nvvm.read.ptx.sreg.tid.*`
`gpu.block_id`	`nvvm.read.ptx.sreg.ctaid.*`
`gpu.block_dim`	`nvvm.read.ptx.sreg.ntid.*`
`gpu.barrier`	`nvvm.barrier0`

convert-pipeline-to-nvvm

Converts async pipeline operations to NVVM barrier intrinsics.

Pipeline Op	NVVM Op
`pipeline.producer_acquire`	`nvvm.mbarrier.arrive.*`
`pipeline.producer_commit`	`nvvm.mbarrier.arrive.*` + phase
`pipeline.consumer_wait`	`nvvm.mbarrier.wait.*`
`pipeline.consumer_release`	`nvvm.mbarrier.arrive.*`

TileAS Optimization Passes (30)

TileAS passes optimize and schedule tile operations.

tileas-assign-dot-layouts

Assigns MMA-compatible layouts to dot product operands.

void AssignDotLayouts::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](DotOp dotOp) { auto lhsType = dotOp.getLhs().getType(); auto rhsType = dotOp.getRhs().getType(); // Select MMA shape based on types MMAShape mmaShape = selectMMAShape(lhsType, rhsType); // Assign layouts for operands Layout lhsLayout = computeLhsLayout(mmaShape, lhsType); Layout rhsLayout = computeRhsLayout(mmaShape, rhsType); dotOp->setAttr("lhs_layout", lhsLayout); dotOp->setAttr("rhs_layout", rhsLayout); }); } 

MMA shapes: m16n8k16, m16n16k16, m64n256k64

tileas-assign-load-store-layouts

Optimizes memory access patterns for coalesced loads.

void AssignLoadStoreLayouts::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](LoadOp loadOp) { auto tensorType = loadOp.getResult().getType(); // Check for TMA opportunity if (canUseTMA(loadOp)) { Layout tmaLayout = computeTMALayout(tensorType); loadOp->setAttr("layout", tmaLayout); loadOp->setAttr("use_tma", true); } else { // Vectorized load layout Layout vecLayout = computeVectorizedLayout(tensorType); loadOp->setAttr("layout", vecLayout); } }); } 

tileas-assign-pipeline-layouts

Assigns layouts for async pipeline buffers.

void AssignPipelineLayouts::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](PipelineOp pipelineOp) { for (auto& stage : pipelineOp.getStages()) { // Assign shared memory layouts for buffers for (auto buffer : stage.getBuffers()) { Layout smemLayout = computeSwizzledLayout(buffer); buffer->setAttr("layout", smemLayout); } } }); } 

tileas-generate-schedule

Generates execution schedule using cost-based or serial scheduler.

void GenerateSchedule::runOnOperation() { FuncOp funcOp = getOperation(); // Build dependency graph DependencyGraph depGraph(funcOp); // Select scheduler based on options Scheduler* scheduler; if (useCostBasedScheduler) { scheduler = new CostBasedScheduler(depGraph); } else { scheduler = new SerialScheduler(depGraph); } // Generate schedule Schedule schedule = scheduler->generateSchedule(); // Apply schedule to IR applySchedule(funcOp, schedule); } 

Scheduler types:

Serial: Topological order
CostBased: Latency-aware with heuristics

tileas-materialize-schedule

Materializes abstract schedule into warp-specialized IR.

void MaterializeSchedule::runOnOperation() { FuncOp funcOp = getOperation(); Schedule schedule = getSchedule(funcOp); if (schedule.getStrategy() == Strategy::WarpSpecialize) { // Split into producer/consumer auto [producerOps, consumerOps] = partitionOps(funcOp, schedule); // Create agent regions createAgentRegion(producerOps, AgentRole::Producer); createAgentRegion(consumerOps, AgentRole::Consumer); // Insert synchronization insertBarriers(funcOp, schedule); } } 

tileas-materialize-async

Creates async pipeline structure with multi-buffering.

void MaterializeAsync::runOnOperation() { FuncOp funcOp = getOperation(); int numStages = getOption("num-stages"); funcOp.walk([&](scf::ForOp forOp) { if (canPipeline(forOp)) { // Create N buffers for N-stage pipeline SmallVector<Value> buffers; for (int i = 0; i < numStages; i++) { buffers.push_back(allocateBuffer(forOp)); } // Transform loop body emitPrologue(forOp, buffers); emitSteadyState(forOp, buffers); emitEpilogue(forOp, buffers); } }); } 

tileas-materialize-convert-layout

Expands layout conversions to actual data movement.

void MaterializeConvertLayout::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](ConvertLayoutOp convertOp) { auto srcLayout = getLayout(convertOp.getSource()); auto dstLayout = getLayout(convertOp.getResult()); // Generate shuffle or shared memory path if (canUseShuffles(srcLayout, dstLayout)) { emitShuffleConversion(convertOp); } else { emitSharedMemoryConversion(convertOp); } }); } 

tileas-attach-tma-desc-args

Injects TMA descriptor arguments into kernel signatures.

void AttachTMADescArgs::runOnOperation() { FuncOp funcOp = getOperation(); SmallVector<TMAOp> tmaOps; funcOp.walk([&](Operation* op) { if (usesTMA(op)) tmaOps.push_back(op); }); for (auto& tmaOp : tmaOps) { // Create TMA descriptor type auto descType = TMADescriptorType::get( tmaOp.getShape(), tmaOp.getElementType(), tmaOp.getSwizzle() ); // Add to function arguments funcOp.insertArgument(descType, "tma_desc"); } } 

tileas-slicing

Slices tensors for pipelined execution.

void TileASSlicing::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](LoadOp loadOp) { auto tensorType = loadOp.getResult().getType(); int sliceDim = getSliceDimension(loadOp); int sliceSize = computeSliceSize(tensorType, sliceDim); // Replace single load with sliced loads SmallVector<Value> slices; for (int i = 0; i < numSlices; i++) { auto slice = builder.create<SlicedLoadOp>( loadOp.getSource(), sliceDim, i * sliceSize, sliceSize ); slices.push_back(slice); } }); } 

tileas-plan-cta

Plans CTA (thread block) configuration.

void PlanCTA::runOnOperation() { FuncOp funcOp = getOperation(); // Analyze resource requirements int smemRequired = analyzeSharedMemory(funcOp); int regsRequired = analyzeRegisters(funcOp); // Compute optimal CTA shape CTAConfig config = computeCTAConfig( smemRequired, regsRequired, targetOccupancy ); funcOp->setAttr("cta_shape", config.toAttribute()); } 

tileas-resolve-agent-boundary

Resolves data flow across warp specialization boundaries.

void ResolveAgentBoundary::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](AgentSwitchOp switchOp) { // Identify values crossing boundary SmallVector<Value> crossingValues; for (Value v : switchOp.getOperands()) { if (crossesBoundary(v, switchOp)) { crossingValues.push_back(v); } } // Insert shared memory communication for (Value v : crossingValues) { insertSharedMemoryTransfer(v, switchOp); } }); } 

tileas-remove-buffer-alias

Removes buffer aliasing using fixed-point iteration.

void RemoveBufferAlias::runOnOperation() { FuncOp funcOp = getOperation(); bool changed = true; while (changed) { changed = false; funcOp.walk([&](AllocTensorOp allocOp) { for (auto& use : allocOp.getResult().getUses()) { if (isAliasingUse(use)) { createNonAliasingBuffer(use); changed = true; } } }); } } 

tileas-remove-dead-args

Removes unused arguments from region operations.

tileas-remove-layout-conversions

Eliminates redundant layout conversions.

void RemoveLayoutConversions::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](ConvertLayoutOp convertOp) { auto srcLayout = getLayout(convertOp.getSource()); auto dstLayout = getLayout(convertOp.getResult()); // Remove identity conversions if (srcLayout == dstLayout) { convertOp.replaceAllUsesWith(convertOp.getSource()); convertOp.erase(); } }); } 

tileas-optimize-alloc-tensor

Optimizes tensor allocations through reuse and elimination.

void OptimizeAllocTensor::runOnOperation() { FuncOp funcOp = getOperation(); LivenessAnalysis liveness(funcOp); SmallVector<AllocTensorOp> allocs; funcOp.walk([&](AllocTensorOp op) { allocs.push_back(op); }); for (auto& alloc : allocs) { // Find reusable buffer if (auto reusable = findReusableBuffer(alloc, liveness)) { reuseBuffer(alloc, reusable); } } } 

tileas-optimize-reduce

Optimizes reduction operations with warp shuffle or shared memory.

void OptimizeReduce::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](ReduceOp reduceOp) { int reductionSize = getReductionSize(reduceOp); if (reductionSize <= 32) { setAtom(reduceOp, "warp_shuffle"); } else if (reductionSize <= 1024) { setAtom(reduceOp, "shared_memory"); } else { setAtom(reduceOp, "multi_stage"); } }); } 

tileas-optimize-dot-accumulation

Optimizes MMA accumulation patterns for better register utilization.

void OptimizeDotAccumulation::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](DotOp dotOp) { auto accumPattern = analyzeAccumulationPattern(dotOp); switch (accumPattern) { case AccumPattern::SimpleLoop: optimizeSimpleAccumulation(dotOp); break; case AccumPattern::SplitK: optimizeSplitKAccumulation(dotOp); break; case AccumPattern::StreamK: optimizeStreamKAccumulation(dotOp); break; } }); } 

tileas-recompute-for-scheduling

Trades recomputation for reduced register pressure.

void TileASRecomputeForScheduling::runOnOperation() { FuncOp funcOp = getOperation(); RegisterPressureAnalysis regPressure(funcOp); funcOp.walk([&](Operation* op) { for (Value result : op->getResults()) { if (shouldRecompute(result, regPressure)) { markForRecomputation(result); } } }); applyRecomputations(funcOp); } bool shouldRecompute(Value v, RegisterPressureAnalysis& rpa) { // Recompute if value is cheap but keeping it live causes spills int computeCost = estimateComputeCost(v.getDefiningOp()); int spillCost = rpa.estimateSpillCost(v); return computeCost < spillCost; } 

tileas-legalize-fma-dot

Ensures FMA operations match hardware capabilities.

void LegalizeFmaDot::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](DotOp dotOp) { if (hasFmaAccumulation(dotOp)) { legalizeFma(dotOp); } }); } void legalizeFma(DotOp dotOp) { auto accType = dotOp.getAccumulator().getType(); if (!isLegalAccumulatorType(accType)) { auto legalType = getLegalAccumulatorType(accType); insertAccumulatorConversion(dotOp, legalType); } if (isMixedPrecision(dotOp)) { legalizeMixedPrecisionFma(dotOp); } } 

tileas-legalize-reduce

Ensures reductions use supported types and sizes.

void LegalizeReduce::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](ReduceOp reduceOp) { if (!isLegalReduction(reduceOp)) { legalizeReduction(reduceOp); } }); } void legalizeReduction(ReduceOp reduceOp) { auto inputType = reduceOp.getInput().getType(); auto reductionKind = reduceOp.getReductionKind(); if (!isSupportedElementType(inputType.getElementType())) { insertTypeConversion(reduceOp); } if (!isSupportedReductionSize(inputType, reduceOp.getReductionDim())) { splitReduction(reduceOp); } } 

tileas-legalize-tmem-copy

Legalizes tensor memory (tmem) copy operations. Tensor memory is dedicated storage for tensor core operands.

void TileASLegalizeTmemCopy::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](Operation* op) { if (auto copyOp = dyn_cast<CopyOp>(op)) { if (involvesTmem(copyOp)) { legalizeTmemCopy(copyOp); } } }); } void legalizeTmemCopy(CopyOp copyOp) { auto srcLayout = getLayout(copyOp.getSource()); auto dstLayout = getLayout(copyOp.getDest()); // Infer register layout from tmem layout auto regLayout = inferRegisterLayoutFromTmem(srcLayout); // Insert necessary layout conversions if (needsConversion(srcLayout, regLayout)) { insertLayoutConversion(copyOp, srcLayout, regLayout); } } 

tileas-slice-and-fuse

Applies loop tiling (slicing) and fusion for improved data locality.

void SliceAndFuse::runOnOperation() { FuncOp funcOp = getOperation(); SmallVector<FusionGroup> fusionGroups; collectFusionCandidates(funcOp, fusionGroups); for (auto& group : fusionGroups) { auto sliceSize = computeOptimalSliceSize(group); sliceOperations(group, sliceSize); fuseOperations(group); } } void fuseOperations(FusionGroup& group) { // Create fused loop nest // - Single loop iterating over slices // - Multiple operations per slice iteration auto fusedLoop = createFusedLoop(group); for (auto* op : group.getOperations()) { moveIntoFusedLoop(op, fusedLoop); } } 

tileas-refine-atom-by-resource

Adjusts operation granularity (“atom”) based on available hardware resources.

void RefineAtomByResource::runOnOperation() { FuncOp funcOp = getOperation(); auto resources = getTargetResources(funcOp); funcOp.walk([&](Operation* op) { if (hasAtomAttribute(op)) { refineAtom(op, resources); } }); } void refineAtom(Operation* op, ResourceConstraints& resources) { auto currentAtom = getAtom(op); int smemRequired = estimateSmemUsage(op, currentAtom); int regsRequired = estimateRegUsage(op, currentAtom); // Refine if over resource limits (SM120: 228KB smem, 65536 regs) if (smemRequired > resources.maxSmem || regsRequired > resources.maxRegs) { auto refinedAtom = findSmallerAtom(op, resources); setAtom(op, refinedAtom); } } 

tileas-prepare-for-scheduling

Normalizes IR and annotates operation latencies for the scheduler.

void PrepareForScheduling::runOnOperation() { FuncOp funcOp = getOperation(); normalizeLoops(funcOp); insertSchedulingAnchors(funcOp); annotateLatencies(funcOp); identifyBarriers(funcOp); } void annotateLatencies(FuncOp funcOp) { funcOp.walk([&](Operation* op) { int latency = estimateLatency(op); op->setAttr("sched.latency", builder.getI64IntegerAttr(latency)); }); } 

tileas-unroll-register-loops

Unrolls loops that access register-resident tensors (required since GPU registers cannot be dynamically indexed).

void TileASUnrollRegisterLoops::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](scf::ForOp forOp) { if (accessesRegisterTensors(forOp)) { if (!canAvoidUnroll(forOp)) { // Must unroll - register tensors require static indexing unrollLoop(forOp); } } }); } bool accessesRegisterTensors(scf::ForOp forOp) { bool accessesRegs = false; forOp.walk([&](Operation* op) { for (Value operand : op->getOperands()) { if (isRegisterTensor(operand)) { accessesRegs = true; } } }); return accessesRegs; } 

tileas-unspecialized-pipeline

Implements software pipelining without warp specialization (all warps do both load and compute).

void TileASUnspecializedPipeline::runOnOperation() { FuncOp funcOp = getOperation(); int numStages = getOption<int>("unspecialized-pipeline-num-stages"); funcOp.walk([&](scf::ForOp forOp) { if (canPipeline(forOp)) { applySoftwarePipelining(forOp, numStages); } }); } void applySoftwarePipelining(scf::ForOp forOp, int numStages) { emitPrologue(forOp, numStages); // Pre-load data for first N iterations emitSteadyState(forOp, numStages); // Overlap load(i+N) with compute(i) emitEpilogue(forOp, numStages); // Drain remaining computations } 

tileas-dynamic-persistent

Transforms kernels into dynamic persistent kernels that process work items from a queue.

void TileASDynamicPersistent::runOnOperation() { FuncOp funcOp = getOperation(); if (funcOp->hasAttr("dynamic_persistent")) { emitWarning("Kernel is already dynamic persistent"); return; } transformToPersistent(funcOp); funcOp->setAttr("dynamic_persistent", builder.getUnitAttr()); } void transformToPersistent(FuncOp funcOp) { // Insert outer loop that fetches work items: // while (workAvailable()) { // workItem = fetchWork(); // processWorkItem(workItem); // signalCompletion(); // } } 

tileas-insert-OCG-knobs

Inserts OCG (Optimizing Code Generator) hints for the PTXAS backend.

void TileASInsertOCGKnobs::runOnOperation() { FuncOp funcOp = getOperation(); funcOp.walk([&](Operation* op) { if (auto loopOp = dyn_cast<LoopOp>(op)) { insertOCGDirectives(loopOp); } if (auto mmaOp = dyn_cast<DotOp>(op)) { insertMMAOptimizationHints(mmaOp); } }); } void insertOCGDirectives(Operation* op) { op->setAttr("ocgEnterDirectives", buildOCGDirectives(op, /*enter=*/true)); op->setAttr("ocgLeaveDirectives", buildOCGDirectives(op, /*enter=*/false)); } 

Appendix: IR Dumps

This appendix contains the IR dumps from the MoE kernel compilation. Some of the IR below uses %0 placeholders.

cuda_tile IR

// cuda_tile dialect operations // High-level tensor operations from CuTile Python API // === Pass #1 scope=cuda_tile.entry === "cuda_tile.module"() {1 regions} "cuda_tile.entry"() {1 regions} %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.constant"() : () -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.make_tensor_view"(%cuda_tile.assume, %cuda_tile.assume, %cuda_tile.assume, %cuda_tile.assume, %cuda_tile.assume, %cuda_tile.assume) : (ct.view, ct.view, ct.view, ct.view, ct.view, ct.view) -> (ct.token) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.assume"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.make_tensor_view"(%cuda_tile.assume, %cuda_tile.assume) : (ct.view, ct.view) -> (ct.token) %0 = "cuda_tile.make_token"() : () -> (ct.ptr) %0, %1, %2 = "cuda_tile.get_tile_block_id"() : () -> (ct.view, ct.view, ct.view) %0 = "cuda_tile.divi"(%cuda_tile.assume, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.divi"(%cuda_tile.assume, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%cuda_tile.constant, %cuda_tile.divi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.divi"(%cuda_tile.get_tile_block_id, %cuda_tile.muli) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%cuda_tile.divi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.subi"(%cuda_tile.divi, %cuda_tile.muli) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.mini"(%cuda_tile.subi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.remi"(%cuda_tile.get_tile_block_id, %cuda_tile.mini) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.remi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.mini, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.xori"(%cuda_tile.cmpi, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.remi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.andi"(%cuda_tile.xori, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.remi, %cuda_tile.mini) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.select"(%cuda_tile.andi, %cuda_tile.addi, %cuda_tile.remi) : (ct.view, ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.muli, %cuda_tile.select) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.remi"(%cuda_tile.get_tile_block_id, %cuda_tile.muli) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.remi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.muli, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.xori"(%cuda_tile.cmpi, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.remi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.andi"(%cuda_tile.xori, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.remi, %cuda_tile.muli) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.select"(%cuda_tile.andi, %cuda_tile.addi, %cuda_tile.remi) : (ct.view, ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.divi"(%cuda_tile.select, %cuda_tile.mini) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%cuda_tile.addi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.iota"() : () -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.muli) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.broadcast, %cuda_tile.iota) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.addi) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.exti, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.offset"(%cuda_tile.broadcast, %cuda_tile.exti) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.constant) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0, %1 = "cuda_tile.load_ptr_tko"(%cuda_tile.offset, %cuda_tile.cmpi, %cuda_tile.broadcast, %cuda_tile.make_token) : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.view, ct.ptr) %0 = "cuda_tile.reshape"(%arg) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.divi"(%cuda_tile.load_ptr_tko, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.make_partition_view"(%cuda_tile.make_tensor_view) : (ct.token) -> (ct.part) %0, %1 = "cuda_tile.load_view_tko"(%cuda_tile.make_partition_view, %cuda_tile.addi, %cuda_tile.make_token) : (ct.part, ct.view, ct.ptr) -> (ct.view, ct.ptr) %0 = "cuda_tile.reshape"(%cuda_tile.load_view_tko) : (ct.view) -> (ct.view) %0 = "cuda_tile.divi"(%cuda_tile.assume, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.iota"() : () -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.divi) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.constant) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.for"(%cuda_tile.constant, %cuda_tile.divi, %cuda_tile.constant, %cuda_tile.constant) {1 regions} : (ct.view, ct.view, ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%arg, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.muli) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.broadcast, %cuda_tile.iota) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.addi) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.andi"(%cuda_tile.cmpi, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.muli, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.offset"(%cuda_tile.broadcast, %cuda_tile.addi) : (ct.view, ct.view) -> (ct.view) %0, %1 = "cuda_tile.load_ptr_tko"(%cuda_tile.offset, %cuda_tile.andi, %cuda_tile.broadcast, %cuda_tile.make_token) : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.view, ct.ptr) %0 = "cuda_tile.make_partition_view"(%cuda_tile.make_tensor_view) : (ct.token) -> (ct.part) %0, %1 = "cuda_tile.load_view_tko"(%cuda_tile.make_partition_view, %cuda_tile.reshape, %arg, %cuda_tile.divi, %cuda_tile.make_token) : (ct.part, ct.view, ct.view, ct.view, ct.ptr) -> (ct.view, ct.ptr) %0 = "cuda_tile.reshape"(%cuda_tile.load_view_tko) : (ct.view) -> (ct.view) %0 = "cuda_tile.mmaf"(%cuda_tile.load_ptr_tko, %cuda_tile.reshape, %arg) : (ct.view, ct.view, ct.view) -> (ct.view) "cuda_tile.continue"(%cuda_tile.mmaf) : (ct.view) -> () %0 = "cuda_tile.muli"(%cuda_tile.divi, %cuda_tile.constant) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.iota"() : () -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.muli) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.broadcast, %cuda_tile.iota) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.ftof"(%cuda_tile.for) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.load_ptr_tko) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.addi) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.muli"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.exti"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.exti) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.cmpi"(%cuda_tile.broadcast, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.andi"(%cuda_tile.cmpi, %cuda_tile.cmpi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.addi"(%cuda_tile.muli, %cuda_tile.broadcast) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.reshape"(%cuda_tile.assume) : (ct.view) -> (ct.view) %0 = "cuda_tile.broadcast"(%cuda_tile.reshape) : (ct.view) -> (ct.view) %0 = "cuda_tile.offset"(%cuda_tile.broadcast, %cuda_tile.addi) : (ct.view, ct.view) -> (ct.view) %0 = "cuda_tile.store_ptr_tko"(%cuda_tile.offset, %cuda_tile.ftof, %cuda_tile.andi, %cuda_tile.make_token) : (ct.view, ct.view, ct.view, ct.ptr) -> (ct.ptr) "cuda_tile.return"() 

nv_tileaa IR

// nv_tileaa dialect operations // Tile-level ops (architecture-independent) // === Pass #1 scope=nv_tileaa.func === "nv_tileaa.func"() {nv_tileaa.kernel_spec} {1 regions} %0 = "nv_tileaa.assume"(%arg) : (aa.memref) -> (aa.memref) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (aa.memref) -> (aa.memref) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (iN) -> (tensor<...>) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (iN) -> (tensor<...>) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (aa.memref) -> (aa.memref) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%arg) : (aa.memref) -> (aa.memref) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (iN) -> (tensor<...>) %0 = "nv_tileaa.assume"(%arg) : (aa.memref) -> (aa.memref) %0 = "nv_tileaa.assume"(%arg) : (iN) -> (iN) %0 = "nv_tileaa.assume"(%nv_tileaa.assume) : (iN) -> (iN) %0 = "nv_tileaa.make_memref"(%nv_tileaa.assume, %nv_tileaa.assume, %nv_tileaa.assume, %nv_tileaa.assume, %nv_tileaa.assume, %nv_tileaa.assume) : (aa.memref, iN, iN, iN, iN, iN) -> (aa.btile) %0 = "nv_tileaa.make_memref"(%nv_tileaa.assume, %nv_tileaa.assume) : (aa.memref, iN) -> (aa.btile) %0 = "nv_tileaa.create_mem_token"() : () -> (aa.ptr) %0 = "nv_tileaa.get_program_id"() : () -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.get_program_id) : (iN) -> (tensor<...>) %0 = "nv_tileaa.make_range"(%arith.constant, %arith.constant) : (iN, iN) -> (tensor<...>) %0 = "nv_tileaa.extract"(%arith.muli) : (tensor<...>) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.extract) : (iN) -> (tensor<...>) %0 = "nv_tileaa.extract"(%arith.extsi) : (tensor<...>) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.extract) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (aa.memref) -> (tensor<...>) %0 = "nv_tileaa.addptr"(%nv_tileaa.splat, %arith.extsi) : (tensor<...>, tensor<...>) -> (tensor<...>) %0, %1 = "nv_tileaa.load"(%nv_tileaa.addptr, %arith.cmpi, %arith.constant, %nv_tileaa.create_mem_token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileaa.splat"(%arg) : (iN) -> (tensor<...>) %0 = "nv_tileaa.block_tile"(%nv_tileaa.make_memref) : (aa.btile) -> (aa.mtoken) %0 = "nv_tileaa.extract"(%arith.addi) : (tensor<...>) -> (iN) %0, %1 = "nv_tileaa.tiled_load"(%nv_tileaa.block_tile, %nv_tileaa.extract, %nv_tileaa.create_mem_token) : (aa.mtoken, iN, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileaa.view"(%nv_tileaa.tiled_load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.make_range"(%arith.constant, %arith.constant) : (iN, iN) -> (tensor<...>) %0 = "nv_tileaa.expand_dims"(%arith.floordivsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (aa.memref) -> (tensor<...>) %0 = "nv_tileaa.extract"(%arith.ceildivsi) : (tensor<...>) -> (iN) %0 = "nv_tileaa.splat"(%arg) : (iN) -> (tensor<...>) %0 = "nv_tileaa.extract"(%arith.muli) : (tensor<...>) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.extract) : (iN) -> (tensor<...>) %0 = "nv_tileaa.expand_dims"(%arith.addi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.broadcast"(%arith.extsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.broadcast"(%arith.extsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.addptr"(%nv_tileaa.splat, %arith.addi) : (tensor<...>, tensor<...>) -> (tensor<...>) %0, %1 = "nv_tileaa.load"(%nv_tileaa.addptr, %arith.andi, %arith.constant, %nv_tileaa.create_mem_token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileaa.block_tile"(%nv_tileaa.make_memref) : (aa.btile) -> (aa.mtoken) %0 = "nv_tileaa.extract"(%nv_tileas.convert_layout) : (tensor<...>) -> (iN) %0 = "nv_tileaa.extract"(%arith.floordivsi) : (tensor<...>) -> (iN) %0, %1 = "nv_tileaa.tiled_load"(%nv_tileaa.block_tile, %nv_tileaa.extract, %arg, %nv_tileaa.extract, %nv_tileaa.create_mem_token) : (aa.mtoken, iN, iN, iN, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileaa.view"(%nv_tileaa.tiled_load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.dot"(%nv_tileaa.load, %nv_tileaa.view, %arg) : (tensor<...>, tensor<...>, tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.make_range"(%arith.constant, %arith.constant) : (iN, iN) -> (tensor<...>) %0 = "nv_tileaa.extract"(%arith.muli) : (tensor<...>) -> (iN) %0 = "nv_tileaa.splat"(%nv_tileaa.extract) : (iN) -> (tensor<...>) %0 = "nv_tileaa.fp_to_fp"(%scf.for) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.expand_dims"(%nv_tileaa.load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.expand_dims"(%arith.addi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.broadcast"(%arith.extsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.broadcast"(%arith.extsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.splat"(%arith.extsi) : (iN) -> (tensor<...>) %0 = "nv_tileaa.splat"(%nv_tileaa.assume) : (aa.memref) -> (tensor<...>) %0 = "nv_tileaa.addptr"(%nv_tileaa.splat, %arith.addi) : (tensor<...>, tensor<...>) -> (tensor<...>) %0 = "nv_tileaa.store"(%nv_tileaa.addptr, %nv_tileaa.fp_to_fp, %arith.andi, %nv_tileaa.create_mem_token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (aa.ptr) "nv_tileaa.return"() // === Pass #2 scope=nv_tileaa.func === // === Pass #3 scope=nv_tileaa.func === // === Pass #4 scope=nv_tileaa.func === // === Pass #5 scope=nv_tileaa.func === // === Pass #6 scope=nv_tileaa.func === // === Pass #7 scope=nv_tileaa.func === // === Pass #8 scope=nv_tileaa.func === // === Pass #9 scope=nv_tileaa.func === // === Pass #10 scope=nv_tileaa.func === // === Pass #11 scope=nv_tileaa.func === // === Pass #12 scope=nv_tileaa.func === // (Lines 193-352 - final assembly with fp_to_fp conversions) // See dump for complete content including: // - 32 fp_to_fp operations for output precision conversion // - Multiple nv_tileaa.func declarations with kernel metadata // - Final memory layout preparation 

nv_tileas IR

// nv_tileas dialect operations // Tile-level Scheduled Assembly (architecture-specific) // [within nv_tileaa.func pass] %0, %1 = "nv_tileas.load"(%nv_tileaa.addptr, %arith.cmpi, %arith.constant, %nv_tileaa.create_mem_token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) %0, %1 = "nv_tileas.tiled_load"(%nv_tileaa.block_tile, %nv_tileaa.extract, %nv_tileaa.create_mem_token) : (aa.mtoken, iN, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileas.view"(%nv_tileas.tiled_load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.expand_dims"(%arith.floordivsi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.expand_dims"(%arith.addi) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.convert_layout"(%nv_tileaa.broadcast) : (tensor<...>) -> (tensor<...>) %0, %1 = "nv_tileas.load"(%nv_tileaa.addptr, %arith.andi, %arith.constant, %nv_tileaa.create_mem_token) : (tensor<...>, tensor<...>, tensor<...>, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileas.convert_layout"(%nv_tileas.view) : (tensor<...>) -> (tensor<...>) %0, %1 = "nv_tileas.tiled_load"(%nv_tileaa.block_tile, %nv_tileaa.extract, %arg, %nv_tileaa.extract, %nv_tileaa.create_mem_token) : (aa.mtoken, iN, iN, iN, aa.ptr) -> (tensor<...>, aa.ptr) %0 = "nv_tileas.view"(%nv_tileas.tiled_load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.convert_layout"(%nv_tileas.load) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.convert_layout"(%nv_tileas.view) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.convert_layout"(%arg) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.dot"(%nv_tileas.convert_layout, %nv_tileas.convert_layout, %nv_tileas.convert_layout, %arith.constant) : (tensor<...>, tensor<...>, tensor<...>, iN) -> (tensor<...>) %0 = "nv_tileas.convert_layout"(%nv_tileas.dot) : (tensor<...>) -> (tensor<...>) %0 = "nv_tileas.make_tiled_tma_desc"(%nv_tileaa.make_memref) {tmaIdx} : (aa.btile) -> (?type) // [within builtin.module pass] %0 = "nv_tileas.async.pipeline.create_pipeline"() : () -> (?type) %0 = "nv_tileas.async.pipeline.create_pipeline"() : () -> (?type) %0 = "nv_tileas.async.pipeline.create_pipeline"() : () -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.create_iterator"(%nv_tileas.async.pipeline.create_pipeline) : (?type) -> (?type) "nv_tileas.async.pipeline.agent_switch"(%arg, ...) {4 regions} : (...) -> () // Producer-Consumer Pattern (repeated throughout) %0 = "nv_tileas.async.pipeline.producer_acquire"(%arg, %arg) : (?type, ?type) -> (?type) %0 = "nv_tileas.async.pipeline.inc_iter"(%arg) : (?type) -> (?type) %0 = "nv_tileas.async.pipeline.producer_write"(%arg, %nv_tileas.async.pipeline.producer_acquire) {1 regions} : (?type, ?type) -> (?type) "nv_tileas.async.pipeline.producer_commit"(%arg, %nv_tileas.async.pipeline.producer_write) : (?type, ?type) -> () %0 = "nv_tileas.async.pipeline.consumer_wait"(%arg, %arg) : (?type, ?type) -> (?type) %0, %1 = "nv_tileas.async.pipeline.consumer_read"(%arg, %nv_tileas.async.pipeline.consumer_wait) {consumer_idx} {1 regions} : (?type, ?type) -> (?type, tensor<...>) "nv_tileas.async.pipeline.consumer_release"(%arg, %nv_tileas.async.pipeline.consumer_read) : (?type, ?type) -> () // Dot operations (100+ for tiled matrix multiply) %0 = "nv_tileas.dot"(%nv_tileas.extract_slice, %nv_tileas.extract_slice, %arg, %arith.constant) : (tensor<...>, tensor<...>, tensor<...>, iN) -> (tensor<...>) // ... (repeated for all tile partitions) // TMA operations %0 = "nv_tileas.make_tiled_tma_desc"(%nv_tileaa.make_memref) {tmaIdx} : (aa.btile) -> (?type) %0 = "nv_tileas.async.tiled_tma_load"(%nv_tileaa.block_tile, %arg, %nv_tileas.make_tiled_tma_desc, %nv_tileaa.extract, %arg, %nv_tileaa.extract) : (...) -> (?type) // Output assembly (32 insert_slice for output tiles) %0 = "nv_tileas.insert_slice"(%nv_tileaa.fp_to_fp, %nv_tileas.alloc_tensor, %arith.constant, %arith.constant) : (tensor<...>, tensor<...>, iN, iN) -> (tensor<...>) // ... (repeated 32 times) 

NVVM Dialect IR

// nvvm dialect operations // NVVM (NVIDIA PTX intrinsics in MLIR form) // === Barrier and Fence Operations === "nvvm.fence.mbarrier.init"() "nvvm.barrier"() "nvvm.fence.proxy"() %0 = "nvvm.read.ptx.sreg.clusterid.x"() : () -> (i32) %0 = "nvvm.read.ptx.sreg.tid.x"() : () -> (i32) // === Async Global→Shared Copies (136 instances) === "nvvm.cp.async.shared.global"(%ptr, %src, %predicate) : (ptr<3>, ptr<1>, i1) -> () // === Tensor Core Data Packing (1,088 instances) === %0 = "nvvm.cvt.packfloat.f32"(%a, %b, %mode) : (f32, f32, i32) -> (i32) // === Memory Barriers (66 instances) === "nvvm.mbarrier.init.shared"(%barrier, %count) : (ptr<3>, i32) -> () "nvvm.mbarrier.arrive.shared"(%barrier) : (ptr<3>) -> () "nvvm.mbarrier.wait.shared"(%barrier, %phase) : (ptr<3>, i32) -> () // === Matrix Load Operations (512 instances) === %0 = "nvvm.ldmatrix"(%ptr) {layout = #nvvm.mma_layout<row>, num = 4} : (ptr<3>) -> vector<4xi32> // === Tensor Core MMA (512 instances) === %0 = "nvvm.mma.sync"(%a, %b, %c) { layoutA = #nvvm.mma_layout<row>, layoutB = #nvvm.mma_layout<col>, shape = #nvvm.shape<m = 16, n = 8, k = 16> } : (vector<4xi32>, vector<2xi32>, vector<4xf32>) -> vector<4xf32> // ... (2,977 lines total - tensor core operations, barriers, memory ops) 

LLVM IR / NVVM IR

; ModuleID = 'LLVMDialectModule' target datalayout = "e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64" target triple = "nvptx64-nvidia-cuda" ; Kernel entry point with TMA descriptors define ptx_kernel void @fused_moe_kernel( ptr addrspace(1) %A, ; Input tokens ptr addrspace(1) %B, ; Expert weights ptr addrspace(1) %C, ; Output ptr addrspace(1) %topk_weights, ptr addrspace(1) %sorted_token_ids, ptr addrspace(1) %sorted_expert_ids, i32 %num_token_replicas, i1 %mul_routed_weight, ; ... TMA descriptors appended by tileas-attach-tma-desc-args ) #0 { entry: ; Get cluster/block/thread IDs %clusterid = call i32 @llvm.nvvm.read.ptx.sreg.clusterid.x() %tid = call range(i32 0, 384) i32 @llvm.nvvm.read.ptx.sreg.tid.x() ; Initialize barriers for async pipeline call void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) %barrier, i32 128) ; Async copy from global to shared memory call void @llvm.nvvm.cp.async.shared.global( ptr addrspace(3) %shared_dst, ptr addrspace(1) %global_src, i32 16, ; bytes i1 %pred ; predicate ) ; Tensor core matrix multiply %result = call <4 x float> @llvm.nvvm.mma.m16n8k16.row.col.f32.f16.f16.f32( <4 x i32> %a_frag, <2 x i32> %b_frag, <4 x float> %c_frag ) ; ... (full pipeline with producer/consumer synchronization) } ; NVVM intrinsic declarations declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() declare i32 @llvm.nvvm.read.ptx.sreg.clusterid.x() declare void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3), i32) declare void @llvm.nvvm.cp.async.shared.global(ptr addrspace(3), ptr addrspace(1), i32, i1) declare <4 x float> @llvm.nvvm.mma.m16n8k16.row.col.f32.f16.f16.f32(<4 x i32>, <2 x i32>, <4 x float>) 

PTX Assembly

// // Generated by NVIDIA NVVM Compiler // Cuda compilation tools, release 13.1, V13.1.80 // Based on NVVM 21.0.0 // .version 9.1 .target sm_120a .address_size 64 .visible .entry fused_moe_kernel( .param .u64 .ptr .global .align 1 fused_moe_kernel_param_0, .param .u32 fused_moe_kernel_param_1, // ... 31 parameters total including TMA descriptors .hidden .param .align 64 .b8 fused_moe_kernel_param_31[128] ) .reqntid 384 .minnctapersm 1 { .reg .pred %p<306>; .reg .b16 %rs<500>; .reg .b32 %r<4905>; .reg .b64 %rd<348>; // 80KB shared memory for double buffering .shared .align 128 .b8 global_smem[82032]; // === Barrier Initialization === mbarrier.init.shared.b64 [global_smem+82000], %r2369; mbarrier.init.shared.b64 [global_smem+82008], %r2369; // === Matrix Load (ldmatrix for tensor cores) === ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4645, %r4646, %r4647, %r4648}, [%r2789]; ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4649, %r4650, %r4651, %r4652}, [%r2793]; ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4653, %r4654, %r4655, %r4656}, [%r2797]; ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%r4657, %r4658, %r4659, %r4660}, [%r2801]; // ... (512 ldmatrix instructions total) // === Tensor Core MMA (HMMA) === // Note: sm_120a uses wgmma/tcgen05 instructions in SASS // PTX shows the portable mma.sync form mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 {%f1, %f2, %f3, %f4}, {%r4645, %r4646, %r4647, %r4648}, {%r4709, %r4710}, {%f1, %f2, %f3, %f4}; // ... (512 mma.sync instructions total) // === Async Copy (cp.async for global→shared) === cp.async.cg.shared.global [%r2856], [%rd112], 16, %p116; cp.async.cg.shared.global [%r2857], [%rd113], 16, %p116; // ... (136 cp.async instructions total) // === Barrier Synchronization === mbarrier.arrive.shared.b64 _, [global_smem+82000]; mbarrier.try_wait.parity.shared.b64 %p117, [global_smem+82000], %r2371; } 

Citation

To cite this article:

@article{zhu2026tileir, title = {NVIDIA TileIR Internals: from CuTile to MLIR/LLVM to SASS}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2026}, month = {January}, url = "https://maknee.github.io/blog/2026/NVIDIA-TileIR-Internals-from-CuTile-to-MLIR-LLVM-to-SASS/" } 

Performance Hints

2026-01-26T12:00:00+00:00

This post will be about going through https://abseil.io/fast/hints.html#performance-hints, a blog post written by the power duo Jeff Dean and Sanjay Ghemawat who argubly made google to what it is today. This is a knowledge distillation from the both of them with many examples from the internal codebase. Hopefully I can a thing or two professionals who have worked in the industry longer than I have been alive

Reflection after reading this post

Start at Performance Hints to see me go through the post while I’m reading through it. This short section is my takeaways from reading it.

TLDR, this post is about why you should build such an intuition and showing many outcomes from snippets of experience.

I think the intro was very very well written and puts some key points about thinking about performance into perspective.

The early sections, especially in “The importance of thinking about performance” and “Estimation” provides small window into how to think about performance as a sort of life-style choice (ie, having a habit of incorporating performance before and while the project is going rather than after). The motivations for why one sometimes should think in such a manner varies, but the authors argue that down the line, you face consequences or even bigger time sinks that could have been solved in the first place (harder time spotting the issues due to complexity, time sink to communicate with people complaining about what you wrote, changing existing library for performance gains is hard, using expensive bandaids to solve performance issues).

Estimation has and always will be important. It’s one way to judge if your intuition is right or not (guess, run experiment, am i wrong). And most likely, for me, it’s wrong. One tricky thing to spot is if something sounds right, but is wrong. Another habit that is hard to get is the “am I wrong” part, where I get lost in the sauce of doing something and then say “I’m done and ok yeah let’s move on to the next thing” and not asking the question “was I wrong initially” to see where I went wrong in my estimations, which can trickle down to actually doing the thing properly. And I think this should apply generally to anything, but I haven’t written and measured it outside of the work I do.

Detailed example sections I find new to me and seemingly useful: “What to do when profiles are flat”, “Code size considerations”, “Parallelization and synchronization”, “CLs that demonstrate multiple techniques”.

Side notes (my thoughts)

One thing now I especially now to think about is cost associated with performance. People typically talk about running services at scale and how many machines are needed for X system to run properly, but I believe what is just as important to look at is the view of a single node and its resources. These resources are repeated and scaled too. The number of cores is now 64, 128 or 256, and don’t get me even started with GPU cores. How many GB/s of memory/disk can transfer within a node? Then any improvement in compute or transfer on a single node trickles a bit down to a cloud native setting and is most likely easier to profile and debug.

So…, ironically, although chips have gotten faster and faster and resoources are getting cheaper (memory, disk), and yet we still care about performance? Is it cost? Usability? Or do we face new applications that require more performance?

What about power? Power seemingly is becoming more and more of a concern with AI in the GPU/hardware space, which could result in errors on the chip. Or was it already? I mean the main costs after building the chips, racks and datacenters are power and maintainance. It seems like the only way performance can affect power consumption is inadvertently, either through eliminating or doing less work (basically improving the algorithm). And sometimes performance gains can increase the work done (more nodes could result in less latency). So the question I’m getting at is how do we factor in is how to lower power while keeping the throughput or latency steady (something like undervolting in the gamer space where users tweak their hot GPUs to run at a way lower power while keeping 95%+ of performance).

Performance Hints

The importance of thinking about performance

This section is the introduction. They both have added very insightful, yet succint sentences that makes me ponder much.

Knuth is often quoted out of context as saying premature optimization is the root of all evil. The full quote reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” This document is about that critical 3%, and a more compelling quote, again from Knuth, reads:

If you go to the link, Knuth actually luminates more on this “… pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I’ve become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specificly turned off”

This was published on Decemeber 01, 1974. And yet hasn’t been solved. What makes me believe that AI will solve this if after 50+ years since this has been written that AI can? What makes this such a hard problem to solve?

Is it “just” telling someone “hey buddy, the code written/generated here is slow?” and now tell the AI to fix it? And what makes me believe that AI will find that 3%? Or maybe that 3% doesn’t actually matter for most people, it just matters for critical pieces of code is written like postgres, mongodb? Or if you flip it, maybe the 3% matters a lot because it’s used by the 99% of people ~ xkcd image below:

Many people will say “let’s write down the code in as simple a way as possible and deal with performance later when we can profile”. However, this approach is often wrong:

When I first read this, I was like spot on. But doing this is difficult. How can you think of performance ahead of time? First, the question is does performance matter - always yes! it matters to some X degree, either cost, usability, etc. And second, what type of performance is needed and to what degree? Are we focused on latency for usability? reliability? cost? adability? and for each how much time do we pour and what is a good expected number to reach. these are some difficult questions to think about ahead of time without many years of experience, not just touching one project in depth, but many and each project with different goals and purposes.

The baseline of knowing this is napkin math. And I still need to work on that and integrate it programs I’m working on. And I believe that is true for some things outside of just computer science. If you’re putting money into say, a stock, ideally you have some idea of what’s going to happen and give an educated guess. Or maybe you need to measure if you’re going to traveling with multiple people, I don’t think yoloing the trip will make a majority of people happy in most cases.

If you disregard all performance concerns when developing a large system, you will end up with a flat profile where there are no obvious hotspots because performance is lost all over the place. It will be difficult to figure out how to get started on performance improvements.

This is very true. One thing touches another and another and propagates. Let’s say the problem is TCP window size

For example, if you’re serving a GET request in nodejs for a website and like wow, it’s taking 1-2s from US east to west. You start adding print lines to the code to get time measurements. Hmm it seems like this fetch from db is taking a while await db.query(...). maybe it’s the db. you change the query to something simple await db.query(SELECT ... COUNT 1) and then, oh it’s better. Then you could optimize that query and then bam, queries are ~500ms, so that’s like somewhat reasonable.

But maybe you dig a little differently (not necesarily deeper). You return some dummy result instead of the db query. Oh? it’s faster? Hmm. By stroke of messing around, you try a big dummy return and you see that it’s 1-2s. What’s happening? Ask AI, etc. maybe you get TCP window size. So it’s 15kb size intially for 1RTT and the data you’re sending is like 1MB-2MB. So you have to somehow compress your data (hopefully it works) or return less data.

Similiar to the gym, switching too many variables at once (like workouts) at once can make it difficult to pinpoint what’s going on.

If you are developing a library that will be used by other people, the people who will run into performance problems will be likely to be people who cannot easily make performance improvements (they will have to understand the details of code written by other people/teams, and have to negotiate with them about the importance of performance optimizations).

This is the other part I’m less experienced with. I think one can get experience seeing this by working in open source or big tech where people care/have an incentive to improve a project. I wonder why others cannot easily make the perf improvements with other’s library? Many people don’t have the time or reason to look deeper, which usually doesn’t give an obvious big net benefit (not to say that it gives a net benefit at all!)?

I guess the question is how can you make it usable? One obvious answer is feedback. But how do you get effective feedback? Is it just talking to people who complain about it not working and trying to decipher what that means?

A business will face this issue with people. People don’t care about what goes on in the product. They want it to work for their specific use case cause it’s easier (cost and time) than doing it themselves.

It is harder to make significant changes to a system when it is in heavy use.

Another part that I’m not familiar with. Clearly either big tech or open source again is where one can see that. I guess one thing that sticks out is that you have to accomdate existing users and people and try to convince them to switch. An example of this is the python2 to python3 switch. I was kind of mad that you needed to do print(...) instead of print ... because you need to type ( and ) parens and they were kinda hard to reach physically (having to press shift + 9) compared to space.

And yet I think, for most things, it most likely has to change at one point. Not many things in life don’t change.

For examples, many year friendships with people change typically one way or another.

It is also hard to tell if there are performance problems that can be solved easily and so we end up with potentially expensive solutions like over-replication or severe overprovisioning of a service to handle load problems.

Another area I’m not an expert in. One can guess and estimate issues, but honestly, it’s fucking hard. Real applications typically have explosions in usage typically at certain times and the let’s solve this for now by X and vibe it with things I know can be just patches and not solving the actual issue and maybe you actually spend more time than necessary to solve it or more money than necessary. But identifying whether to spend that time now or later is so difficult to tell.

Instead, we suggest that when writing code, try to choose the faster alternative if it does not impact readability/complexity of the code significantly.

Not sure what to expect, but I will revisit these 4 key points when I’m done going through the rest

Estimation

If you can develop an intuition for how much performance might matter in the code you are writing, you can make a more informed decision (e.g., how much extra complexity is warranted in the name of performance).

Oh man, the word intution. Ugh, it’s like the best word for what it describes, but it varies per person on how they learn and build an intuition.

Is it test code? If so, you need to worry mostly about the asymptotic complexity of your algorithms and data structures. (Aside: development cycle time matters, so avoid writing tests that take a long time to run.) Is it code specific to an application? If so, try to figure out how much performance matters for this piece of code. This is typically not very hard: just figuring out whether code is initialization/setup code vs. code that will end up on hot paths (e.g., processing every request in a service) is often sufficient Is it library code that will be used by many applications? In this case it is hard to tell how sensitive it might become. This is where it becomes especially important to follow some of the simple techniques described in this document. For example, if you need to store a vector that usually has a small number of elements, use an absl::InlinedVector instead of std::vector. Such techniques are not very hard to follow and don’t add any non-local complexity to the system. And if it turns out that the code you are writing does end up using significant resources, it will be higher performance from the start. And it will be easier to find the next thing to focus on when looking at a profile.

So my understanding is to think about what the type of work is being done for the application that you are building and to follow general good rules throughout building the project like you drinking X amount of water per day (drinking more is generally good for you for example)

You can do a slightly deeper analysis when picking between options with potentially different performance characteristics by relying on back of the envelope calculations. Such calculations can quickly give a very rough estimate of the performance of different alternatives, and the results can be used to discard some of the alternatives without having to implement them.

They finally mentioned it. Ok let’s see what has changed in the last ~20 years since Jeff first mentioned this.

Here is how such an estimation might work: Estimate how many low-level operations of various kinds are required, e.g., number of disk seeks, number of network round-trips, bytes transmitted etc. Multiply each kind of expensive operation with its rough cost, and add the results together. The preceding gives the cost of the system in terms of resource usage. If you are interested in latency, and if the system has any concurrency, some of the costs may overlap and you may have to do slightly more complicated analysis to estimate the latency.

Any transfer of movement of data should be measured, then multiply by the cost (time or $), and add to get total estimated result. The following table is what every one has seen.

L1 cache reference 0.5 ns L2 cache reference 3 ns Branch mispredict 5 ns Mutex lock/unlock (uncontended) 15 ns Main memory reference 50 ns Compress 1K bytes with Snappy 1,000 ns Read 4KB from SSD 20,000 ns Round trip within same datacenter 50,000 ns Read 1MB sequentially from memory 64,000 ns Read 1MB over 100 Gbps network 100,000 ns Read 1MB from SSD 1,000,000 ns Disk seek 5,000,000 ns Read 1MB sequentially from disk 10,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns 

You may find it useful to also track estimated costs for higher-level operations relevant to your system. E.g., you might want to know the rough cost of a point read from your SQL database, the latency of interacting with a Cloud service, or the time to render a simple HTML page. If you don’t know the relevant cost of different operations, you can’t do decent back-of-the-envelope calculations!

Yeah I understood this a bit better. It’s incredibly hard to track initially because it’s hard to know what’s important and I haven’t used it consistently daily/weekly, etc.

Example: Time to quicksort a billion 4 byte numbers

Before looking at the answer, I would like to ask myself where would this be used and where components are mainly involved and what is the majority of the cost(bottleneck)?

Maybe we have many time durations (say from multiple services) and would like to plot a histogram for a webUI query for latencies.

Components: memory, cpu. 1B * 4bytes = 4GB of data which is kinda tiny by today’s standard (one machine can handle it)

So let’s say it’s not on disk and in memory already. The quickest is if the data is already sorted and we’re only accessing each one and writing it back to another piece of memory. So 50ns * 1B * 2 (one for read and one for write), so 5s * 2 = 10s?

To be honest, it’s probably more, say all of the elements are unsorted, we would have to move all of them and then repeat on each subset as a slice like merge sort, and say without threading. So it would be like a infinite geometric series until converagance of 10s + 5s + 2.5s…, I had to search this up… s = a / (1 - r) which is 10s / (1 - 0.5) = 20s.

So between 10s and 20s would be my answer.

Memory bandwidth: the array occupies 4 GB (4 bytes per number times a billion numbers). Let’s assume ~16GB/s of memory bandwidth per core. That means each pass will take ~0.25s. N is ~2^30, so we will make ~30 passes, so the total cost of memory transfer will be ~7.5 seconds. Branch mispredictions: we will do a total of N*log(N) comparisons, i.e., ~30 billion comparisons. Let’s assume that half of them (i.e., 15 billion) are mispredicted. Multiplying by 5 ns per misprediction, we get a misprediction cost of 75 seconds. We assume for this analysis that correctly predicted branches are free. Adding up the previous numbers, we get an estimate of ~82.5 seconds.

My answer was way off. Let me actually look at the table and try to do in their style: 1. figure out the algorithm and the counts and 2. find the components involved.

Ok memory bandwidth is calculating the passes - 4GB/16GBs (read memory) * log(1B) (time per pass) = 0.25s * ~30 = 7.5s. Next is computation - branch prediction work = Nlog(N) compare = 1B*log(1B) = ~30B/2 = 15B. Then 15B * 5ns = 75s, which is surprising? I didn’t expect it to be compute bound.

Let’s assume we have a 32MB L3 cache, and that the cost of transferring data from L3 cache to the processor is negligible. The L3 cache can hold 2^23 numbers, and therefore the last 22 passes can operate on the data resident in the L3 cache (the 23rd last pass brings data into the L3 cache and the remaining passes operate on that data.) That cuts down the memory transfer cost to 2.5 seconds (10 memory transfers of 4GB at 16GB/s) instead of 7.5 seconds (30 memory transfers).

Wow… ok they talk about the caches here. 2^23 comes from… 2^20 (1MB) * 2^5 (32) / 2^2 (4 bytes per entry). So the last 22 can be loaded all in cache (N = 2^22)

Example: Time to generate a web page with 30 image thumbnails

Let’s compare two potential designs where the original images are stored on disk, and each image is approximately 1MB in size.

Two main compoennts: loading from disk and transferring data over the web. Total data: 30MB.

Disk: 30MB / 1GB/s = 0.03s. Web: log(30MB/1.5KB) * 150ms per roundtrip = 3 * 150ms = 450ms.

Read the contents of the 30 images serially and generate a thumbnail for each one. Each read takes one seek + one transfer, which adds up to 5ms for the seek, and 10ms for the transfer, which adds up to 30 images times 15ms per image, i.e., 450ms. Read in parallel, assuming the images are spread evenly across K disks. The previous resource usage estimate still holds, but latency will drop by roughly a factor of K, ignoring variance (e.g, we will sometimes get unlucky and one disk will have more than 1/Kth of the images we are reading). Therefore if we are running on a distributed filesystem with hundreds of disks, the expected latency will drop to ~15ms. Let’s consider a variant where all images are on a single SSD. This changes the sequential read performance to 20µs + 1ms per image, which adds up to ~30 ms overall.

Cool, I calculated per SSD and got it right. But, I guess realistically it would be the HDD version (say S3).

Measurement

The preceding section gives some tips about how to think about performance when writing code without worrying too much about how to measure the performance impact of your choices. However, before you actually start making improvements, or run into a tradeoff involving various things like performance, simplicity, etc. you will want to measure or estimate potential performance benefits. Being able to measure things effectively is the number one tool you’ll want to have in your arsenal when doing performance-related work.

I should really keep this in mind. Esimates before actually going to run it. The question is if it is even feasible.

As an aside, it’s worth pointing out that profiling code that you’re unfamiliar with can also be a good way of getting a general sense of the structure of the codebase and how it operates. Examining the source code of heavily involved routines in the dynamic call graph of a program can give you a high level sense of “what happens” when running the code, which can then build your own confidence in making performance-improving changes in slightly unfamiliar code.

Yes! I think just reading code gives one an idealistic view of the code. So much complexity happens behind the scenes. Is there lock contention? False sharing? Too much time spent allocating? Memory leaks? There are things to observe that is not so easily picked up by looking or stuffing the code into a llm.

Profiling tools and tips

If you can, write a microbenchmark that covers the code you are improving. Microbenchmarks improve turnaround time when making performance improvements, help verify the impact of performance improvements, and can help prevent future performance regressions. However microbenchmarks can have pitfalls that make them non-representative of full system performance.

Very true. It helps build understanding of individual components of the system to see where the full system has overhead.

What to do when profiles are flat

Find loops closer to the top of call stacks (flame graph view of a CPU profile can be helpful here). Potentially, the loop or the code it calls could be restructured to be more efficient. Some code that initially built a complicated graph structure incrementally by looping over nodes and edges of the input was changed to build the graph structure in one shot by passing it the entire input. This removed a bunch of internal checks that were happening per edge in the initial code.

Reduce loops into an array.

Take a step back and look for structural changes higher up in the call stacks instead of concentrating on micro-optimizations. The techniques listed under algorithmic improvements can be useful when doing this.

Algorithm level instead of micro optimizations

Look for overly general code. Replace it with a customized or lower-level implementation. E.g., if an application is repeatedly using a regular expression match where a simple prefix match would suffice, consider dropping the use of the regular expression.

Makes sense, seems micro optimization, I would be wary of this.

Attempt to reduce the number of allocations: get an allocation profile, and pick away at the highest contributor to the number of allocations. This will have two effects: (1) It will provide a direct reduction of the amount of time spent in the allocator (and garbage collector for GC-ed languages) (2) There will often be a reduction in cache misses since in a long running program using tcmalloc, every allocation tends to go to a different cache line.

Seen this happen SO many times. This takes up so many cycles, it’s actually frustrating to solve.

Gather other types of profiles, specially ones based on hardware performance counters. Such profiles may point out functions that are encountering a high cache miss rate. Techniques described in the profiling tools and tips section can be helpful.

Yes, but one needs to learn how to these performance counters at a system level and typically they are just samples (hard to pinpoint). I guess perf would help here with something like cache misses

API considerations

Widely used APIs come under heavy pressure to add features. Be careful when adding new features since these will constrain future implementations and increase cost unnecessarily for users who don’t need the new features. E.g., many C++ standard library containers promise iterator stability, which in typical implementations increases the number of allocations significantly, even though many users do not need pointer stability.

Make API as simple as possible, kind of like C i guess? But make the interface actually good.

Bulk APIs

Reduce the number of locks in memory allocations

template <typename T> class ObjectStore { public: ... absl::Status DeleteRef(Ref); template <typename T> class ObjectStore { public: ... absl::Status DeleteRef(Ref); // Delete many references. For each ref, if no other Refs point to the same // object, the object will be deleted. Returns non-OK on any error. absl::Status DeleteRefs(absl::Span<const Ref> refs); ... template <typename T> absl::Status ObjectStore<T>::DeleteRefs(absl::Span<const Ref> refs) { util::Status result; absl::MutexLock l(&mu_); for (auto ref : refs) { result.Update(DeleteRefLocked(ref)); } return result; 

View types

These types reduce copying, and allow callers to pick their own container types (e.g., one caller might use std::vector whereas another one uses absl::InlinedVector).

Yep! Been using this

For frequently called routines, sometimes it is useful to allow higher-level callers to pass in a data structure that they own or information that the called routine needs that the client already has. This can avoid the low-level routine being forced to allocate its own temporary data structure or recompute already-available information.

const WallTime now = WallTime_Now(); ... RPC_Stats::RecordRPC(stats_name, m); const WallTime now = WallTime_Now(); ... RPC_Stats::RecordRPC(stats_name, m, now); 

This makes sense

Thread-compatible vs. Thread-safe types

TransferPhase HitlessTransferPhase::get() const { static CallsiteMetrics cm("HitlessTransferPhase::get"); MonitoredMutexLock l(&cm, &mutex_); return phase_; } TransferPhase HitlessTransferPhase::get() const { return phase_; } 

Have the user do the sync, makes sense for performance as the internal calls won’t be always locking

The most critical opportunities for performance improvements come from algorithmic improvements, e.g., turning an O(N²) algorithm to O(N lg(N)) or O(N), avoiding potentially exponential behavior, etc. These opportunities are rare in stable code, but are worth paying attention to when writing new code. A few examples that show such improvements to pre-existing code:

Rare in stable code! Man, they must have thought about most things.

Better memory representation

Careful consideration of memory footprint and cache footprint of important data structures can often yield big savings. The data structures below focus on supporting common operations by touching fewer cache lines. Care taken here can (a) avoid expensive cache misses (b) reduce memory bus traffic, which speeds up both the program in question and anything else running on the same machine

Yes, these are expensive resources on any machine.

Memory layout

Place hot read-only fields away from hot mutable fields so that writes to the mutable fields do not cause the read-only fields to be evicted from nearby caches.

Oh I get it, writes invalidate other core’s entries

Consider packing things into fewer bytes by using bit and byte-level encoding. This can be complicated, so only do this when the data under question is encapsulated inside a well-tested module, and the overall reduction of memory usage is significant. Furthermore, watch out for side effects like under-alignment of frequently used data, or more expensive code for accessing packed representations. Validate such changes using benchmarks.

Makes sense. Trade space for CPU like varint, etc.

Indices instead of pointers

On modern 64-bit machines, pointers take up 64 bits. If you have a pointer-rich data structure, you can easily chew up lots of memory with indirections of T*. Instead, consider using integer indices into an array T[] or other data structure. Not only will the references be smaller (if the number of indices is small enough to fit in 32 or fewer bits), but the storage for all the T[] elements will be contiguous, often leading to better cache locality.

Smaller indices. 4 bytes = 1billion indices already at 1/2 the storage cost

Avoid data structures that allocate a separate object per stored element (e.g., std::map, std::unordered_map in C++). Instead, consider types that use chunked or flat representations to store multiple elements in close proximity in memory (e.g., std::vector, absl::flat_hash_{map,set} in C++). Such types tend to have much better cache behavior. Furthermore, they encounter less allocator overhead.

Yes. But only in performant code. It’s sometimes tricky to have a flat representation, but flat hashmap/set is nice.

One useful technique is to partition elements into chunks where each chunk can hold a fixed number of elements. This technique can reduce the cache footprint of a data structure significantly while preserving good asymptotic behavior.

Yes! used in many implementations such as highly performant read/write queues.

Arenas

Arenas can help reduce memory allocation cost, but they also have the benefit of packing together independently allocated items next to each other, typically in fewer cache lines, and eliminating most destruction costs. They are likely most effective for complex data structures with many sub-objects. Consider providing an appropriate initial size for the arena since that can help reduce allocations. Caveat: it is easy to misuse arenas by putting too many short-lived objects in a long-lived arena, which can unnecessarily bloat memory footprint.

Basically allocate items ahead of time, but may not use the entire arena. It’s tricky to get right… very tricky… especially with estimatingg how big it should be.

Arrays instead of maps

If the domain of a map can be represented by a small integer or is an enum, or if the map will have very few elements, the map can sometimes be replaced by an array or a vector of some form.

const gtl::flat_map<int, int> payload_type_to_clock_frequency_; // A map (implemented as a simple array) indexed by payload_type to clock freq // for that paylaod type (or 0) struct PayloadTypeToClockRateMap { int map[128]; }; ... const PayloadTypeToClockRateMap payload_type_to_clock_frequency_; 

Only used when key is index…

Bit vectors instead of sets

class ZoneSet: public dense_hash_set<ZoneId> { public: ... bool Contains(ZoneId zone) const { return count(zone) > 0; } class ZoneSet { ... // Returns true iff "zone" is contained in the set bool ContainsZone(ZoneId zone) const { return zone < b_.size() && b_.get_bit(zone); } ... private: int size_; // Number of zones inserted util::bitmap::InlinedBitVector<256> b_; 

I’ve not actually used this before. Essentially a vector of bits instead of a set of values. I don’t use sets that often…

Reduce allocations

Newly-allocated objects may require expensive initialization and sometimes corresponding expensive destruction when no longer needed.

I see this time and time again

Every allocation tends to be on a new cache line and therefore data spread across many independent allocations will have a larger cache footprint than data spread across fewer allocations.

Yes. Batch your allocations (basically arena)

LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo, bool is_batched) : tensor(std::move(t)), device_info(dinfo ? std::move(dinfo) : std::make_shared<DeviceInfo>()), is_batched(is_batched) { static const std::shared_ptr<DeviceInfo>& empty_device_info() { static std::shared_ptr<DeviceInfo>* result = new std::shared_ptr<DeviceInfo>(new DeviceInfo); return *result; } LiveTensor::LiveTensor(tf::Tensor t, std::shared_ptr<const DeviceInfo> dinfo, bool is_batched) : tensor(std::move(t)), is_batched(is_batched) { if (dinfo) { device_info = std::move(dinfo); } else { device_info = empty_device_info(); } 

Resize or reserve containers

for (int i = 0; i < ndocs-1; i++) { uint32 delta; ERRORCHECK(b->GetRice(rice_base, &delta)); docs_.push_back(DocId(my_shard_ + (base + delta) * num_shards_)); base = base + delta + 1; } docs_.push_back(last_docid_); docs_.resize(ndocs); DocId* docptr = &docs_[0]; for (int i = 0; i < ndocs-1; i++) { uint32 delta; ERRORCHECK(b.GetRice(rice_base, &delta)); *docptr = DocId(my_shard_ + (base + delta) * num_shards_); docptr++; base = base + delta + 1; } *docptr = last_docid_; 

I actually do this a lot for my preallocated code for performance. Wow, I guess I do some things correctly

Avoid copying when possible

One of the most critical things to do (no work is better than having work)

return search_iterators::DocPLIteratorFactory::Create(opts); return search_iterators::DocPLIteratorFactory::Create(std::move(opts)); 

auto iterator = absl::WrapUnique(sstable->GetIterator()); while (!iterator->done()) { T profile; if (!profile.ParseFromString(iterator->value_view())) { return absl::InternalError( "Failed to parse mem_block to specified profile type."); } ... iterator->Next(); } auto iterator = absl::WrapUnique(sstable->GetIterator()); T profile; while (!iterator->done()) { if (!profile.ParseFromString(iterator->value_view())) { return absl::InternalError( "Failed to parse mem_block to specified profile type."); } ... iterator->Next(); } 

Often, code is written to cover all cases, but some subset of the cases are much simpler and more common than others. E.g., vector::push_back usually has enough space for the new element, but contains code to resize the underlying storage when it does not. Some attention paid to the structure of code can help make the common simple case faster without hurting uncommon case performance significantly.

One has to understand the uncommon case underlying the API call. Say no error happened, we shouldn’t log at all.

void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) { ... for (int i = 0; i < RPC::NUM_ERRORS; ++i) { errors[i] += x.errors[i]; } } void RPC_Stats_Measurement::operator+=(const RPC_Stats_Measurement& x) { ... if (x.any_errors_set) { for (int i = 0; i < RPC::NUM_ERRORS; ++i) { errors[i] += x.errors[i]; } any_errors_set = true; } } 

Preallocate 10 nodes not 200 for query handling in Google’s web server. A simple change that reduced web server’s CPU usage by 7.5%. Wow.

querytree.h

static const int kInitParseTreeSize = 200; // initial size of querynode pool static const int kInitParseTreeSize = 10; // initial size of querynode pool 

Specialize code

A particular performance-sensitive call-site may not need the full generality provided by a general-purpose library. Consider writing specialized code in such cases instead of calling the general-purpose code if it provides a performance improvement.

Interesting, I haven’t done this before. This should be put into very heavily usedd code.

p->type = MATCH_TYPE_REGEXP; term.NonMetaPrefix().CopyToString(&p->prefix); if (term.RegexpSuffix() == ".*") { // Special case for a regexp that matches anything, so we can // bypass RE2::FullMatch p->type = MATCH_TYPE_PREFIX; } else { p->type = MATCH_TYPE_REGEXP; 

Make the compiler’s job easier

The application programmer will often know more about the behavior of the system and can aid the compiler by rewriting the code to operate at a lower level. However, only do this when profiles show an issue since compilers will often get things right on their own. Looking at the generated assembly code for performance critical routines can help you understand if the compiler is “getting it right”. Pprof provides a very helpful display of source code interleaved with disassembly and annotated with performance data.

If you understand the code extremely well, you can get to this stage, OR use specific tool that shows the assembly (rare!)

Avoid functions calls in hot functions (allows the compiler to avoid frame setup costs). Move slow-path code into a separate tail-called function. Copy small amounts of data into local variables before heavy use. This can let the compiler assume there is no aliasing with other data, which may improve auto-vectorization and register allocation. Hand-unroll very hot loops.

void Key::InitSeps(const char* start) { const char* base = &rep_[0]; const char* limit = base + rep_.size(); const char* s = start; DCHECK_GE(s, base); DCHECK_LT(s, limit); for (int i = 0; i < 3; i++) { s = (const char*)memchr(s, '#', limit - s); DCHECK(s != NULL); seps_[i] = s - base; s++; } } inline const char* ScanBackwardsForSep(const char* base, const char* p) { while (p >= base + 4) { if (p[0] == '#') return p; if (p[-1] == '#') return p-1; if (p[-2] == '#') return p-2; if (p[-3] == '#') return p-3; p -= 4; } while (p >= base && *p != '#') p--; return p; } void Key::InitSeps(const char* start) { const char* base = &rep_[0]; const char* limit = base + rep_.size(); const char* s = start; DCHECK_GE(s, base); DCHECK_LT(s, limit); // We go backwards from the end of the string, rather than forwards, // since the directory name might be long and definitely doesn't contain // any '#' characters. const char* p = ScanBackwardsForSep(s, limit - 1); DCHECK(*p == '#'); seps_[2] = p - base; p--; p = ScanBackwardsForSep(s, p); DCHECK(*p == '#'); seps_[1] = p - base; p--; p = ScanBackwardsForSep(s, p); DCHECK(*p == '#'); seps_[0] = p - base; } 

Reduce stats collection costs

Balance the utility of stats and other behavioral information about a system against the cost of maintaining that information. The extra information can often help people to understand and improve high-level behavior, but can also be costly to maintain.

Yes, I’ve seen this. How do you decide what to instrument essentially?

Part of changes that reduce time for setting an alarm from 771 ns to 271 ns. selectserver.h class SelectServer { public: ... protected: ... scoped_ptr<MinuteTenMinuteHourStat> num_alarms_stat_; ... scoped_ptr<MinuteTenMinuteHourStat> num_closures_stat_; ... }; // Selectserver class class SelectServer { ... protected: ... }; /selectserver.cc void SelectServer::AddAlarmInternal(Alarmer* alarmer, int offset_in_ms, int id, bool is_periodic) { ... alarms_->insert(alarm); num_alarms_stat_->IncBy(1); ... } void SelectServer::AddAlarmInternal(Alarmer* alarmer, int offset_in_ms, int id, bool is_periodic) { ... alarms_->Add(alarm); ... } /selectserver.cc void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) { ... alarms_->erase(alarm); num_alarms_stat_->IncBy(-1); ... } void SelectServer::RemoveAlarm(Alarmer* alarmer, int id) { ... alarms_->Remove(alarm); ... } Often, stats or other properties can be maintained for a sample of the elements handled by the system (e.g., RPC requests, input records, users). Many subsystems use this approach (tcmalloc allocation tracking, /requestz status pages, Dapper samples). When sampling, consider reducing the sampling rate when appropriate. 

This change reduces the sampling rate from 1 in 10 to 1 in 32. Furthermore, we now keep execution time stats just for the sampled events and speed up sampling decisions by using a power of two modulus. This code is called on every packet in the Google Meet video conferencing system and needed performance work to keep up with capacity demands during the first part of the COVID outbreak as users rapidly migrated to doing more online meetings. packet_executor.cc class ScopedPerformanceMeasurement { public: explicit ScopedPerformanceMeasurement(PacketExecutor* packet_executor) : packet_executor_(packet_executor), tracer_(packet_executor->packet_executor_trace_threshold_, kClosureTraceName) { // ThreadCPUUsage is an expensive call. At the time of writing, // it takes over 400ns, or roughly 30 times slower than absl::Now, // so we sample only 10% of closures to keep the cost down. if (packet_executor->closures_executed_ % 10 == 0) { thread_cpu_usage_start_ = base::ThreadCPUUsage(); } // Sample start time after potentially making the above expensive call, // so as not to pollute wall time measurements. run_start_time_ = absl::Now(); } ~ScopedPerformanceMeasurement() { ScopedPerformanceMeasurement::ScopedPerformanceMeasurement( PacketExecutor* packet_executor) : packet_executor_(packet_executor), tracer_(packet_executor->packet_executor_trace_threshold_, kClosureTraceName) { // ThreadCPUUsage is an expensive call. At the time of writing, // it takes over 400ns, or roughly 30 times slower than absl::Now, // so we sample only 1 in 32 closures to keep the cost down. if (packet_executor->closures_executed_ % 32 == 0) { thread_cpu_usage_start_ = base::ThreadCPUUsage(); } // Sample start time after potentially making the above expensive call, // so as not to pollute wall time measurements. run_start_time_ = absl::Now(); } packet_executor.cc ~ScopedPerformanceMeasurement() { auto run_end_time = absl::Now(); auto run_duration = run_end_time - run_start_time_; if (thread_cpu_usage_start_.has_value()) { ... } closure_execution_time->Record(absl::ToInt64Microseconds(run_duration)); ScopedPerformanceMeasurement::~ScopedPerformanceMeasurement() { auto run_end_time = absl::Now(); auto run_duration = run_end_time - run_start_time_; if (thread_cpu_usage_start_.has_value()) { ... closure_execution_time->Record(absl::ToInt64Microseconds(run_duration)); } 

Avoid logging on hot code paths

Logging statements can be costly, even if the logging-level for the statement doesn’t actually log anything. E.g., ABSL_VLOG’s implementation requires at least a load and a comparison, which may be a problem in hot code paths. In addition, the presence of the logging code may inhibit compiler optimizations. Consider dropping logging entirely from hot code paths.

image_similarity.cc for (int j = 0; j < output_subimage_size_y; j++) { int j1 = j - rad + output_to_integral_subimage_y; int j2 = j1 + 2 * rad + 1; // Create a pointer for this row's output, taking into account the offset // to the full image. double *image_diff_ptr = &(*image_diff)(j + min_j, min_i); for (int i = 0; i < output_subimage_size_x; i++) { ... if (VLOG_IS_ON(3)) { ... } ... } } const bool vlog_3 = DEBUG_MODE ? VLOG_IS_ON(3) : false; for (int j = 0; j < output_subimage_size_y; j++) { int j1 = j - rad + output_to_integral_subimage_y; int j2 = j1 + 2 * rad + 1; // Create a pointer for this row's output, taking into account the offset // to the full image. double *image_diff_ptr = &(*image_diff)(j + min_j, min_i); for (int i = 0; i < output_subimage_size_x; i++) { ... if (vlog_3) { ... } } } Run on (40 X 2801 MHz CPUs); 2016-05-16T15:55:32.250633072-07:00 CPU: Intel Ivybridge with HyperThreading (20 cores) dL1:32KB dL2:256KB dL3:25MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_NCCPerformance/16 29104 26372 +9.4% BM_NCCPerformance/64 473235 425281 +10.1% BM_NCCPerformance/512 30246238 27622009 +8.7% BM_NCCPerformance/1k 125651445 113361991 +9.8% BM_NCCLimitedBoundsPerformance/16 8314 7498 +9.8% BM_NCCLimitedBoundsPerformance/64 143508 132202 +7.9% BM_NCCLimitedBoundsPerformance/512 9335684 8477567 +9.2% BM_NCCLimitedBoundsPerformance/1k 37223897 34201739 +8.1% 

Code size considerations

Performance encompasses more than just runtime speed. Sometimes it is worth considering the effects of software choices on the size of generated code. Large code size means longer compile and link times, bloated binaries, more memory usage, more icache pressure, and other sometimes negative effects on microarchitectural structures like branch predictors, etc. Thinking about these issues is especially important when writing low-level library code that will be used in many places, or when writing templated code that you expect will be instantiated for many different types.

rn many map insertion calls in a row to initialize a hash table of emoji characters into a single bulk insert operation (188KB of text down to 360 bytes in library linked into many binaries). 😊 textfallback_init.h inline void AddEmojiFallbacks(TextFallbackMap *map) { (*map)[0xFE000] = &kFE000; (*map)[0xFE001] = &kFE001; (*map)[0xFE002] = &kFE002; (*map)[0xFE003] = &kFE003; (*map)[0xFE004] = &kFE004; (*map)[0xFE005] = &kFE005; ... (*map)[0xFEE7D] = &kFEE7D; (*map)[0xFEEA0] = &kFEEA0; (*map)[0xFE331] = &kFE331; }; inline void AddEmojiFallbacks(TextFallbackMap *map) { #define PAIR(x) {0x##x, &k##x}  // clang-format off map->insert({ PAIR(FE000), PAIR(FE001), PAIR(FE002), PAIR(FE003), PAIR(FE004), PAIR(FE005), ... PAIR(FEE7D), PAIR(FEEA0), PAIR(FE331)}); // clang-format on #undef PAIR }; 

Parallelization and synchronization

Modern machines have many cores, and they are often underutilized. Expensive work may therefore be completed faster by parallelizing it. The most common approach is to process different items in parallel and combine the results when done. Typically, the items are first partitioned into batches to avoid paying the cost of running something in parallel per item.

Four-way parallelization improves the rate of encoding tokens by ~3.6x. blocked-token-coder.cc MutexLock l(&encoder_threads_lock); if (encoder_threads == NULL) { encoder_threads = new ThreadPool(NumCPUs()); encoder_threads->SetStackSize(262144); encoder_threads->StartWorkers(); } encoder_threads->Add (NewCallback(this, &BlockedTokenEncoder::EncodeRegionInThread, region_tokens, N, region, stats, controller_->GetClosureWithCost (NewCallback(&DummyCallback), N))); 

The effect on system performance should be measured carefully – if spare CPU is not available, or if memory bandwidth is saturated, parallelization may not help, or may even hurt.

This is the caveat. It’s hard to guage this for every type of machine there is.

Amortize lock acquisition

Avoid fine-grained locking to reduce the cost of Mutex operations in hot paths. Caveat: this should only be done if the change does not increase lock contention.

Interesting. Yes, if there is another thread accessing it, it could theorically be faster (say some section isn’t using actually the shared variable)

// Acquire lock once to free entire tree of query nodes, rather than reacquiring lock for every node in tree. // Pool of query nodes ThreadSafeFreeList<MustangQuery> pool_(256); ... void MustangQuery::Release(MustangQuery* node) { if (node == NULL) return; for (int i=0; i < node->children_->size(); ++i) Release((*node->children_)[i]); node->children_->clear(); pool_.Delete(node); } // Pool of query nodes Mutex pool_lock_; FreeList<MustangQuery> pool_(256); ... void MustangQuery::Release(MustangQuery* node) { if (node == NULL) return; MutexLock l(&pool_lock_); ReleaseLocked(node); } void MustangQuery::ReleaseLocked(MustangQuery* node) { #ifndef NDEBUG  pool_lock_.AssertHeld(); #endif  if (node == NULL) return; for (int i=0; i < node->children_->size(); ++i) ReleaseLocked((*node->children_)[i]); node->children_->clear(); pool_.Delete(node); } 

Keep critical sections short

Avoid expensive work inside critical sections. In particular, watch out for innocuous looking code that might be doing RPCs or accessing files.

Basically minimize critical sections, but in addition, try to find these critical sections that have high ROI.

Avoid RPC while holding Mutex. trainer.cc { // Notify the parameter server that we are starting. MutexLock l(&lock_); model_ = model; MaybeRecordProgress(last_global_step_); } bool should_start_record_progress = false; int64 step_for_progress = -1; { // Notify the parameter server that we are starting. MutexLock l(&lock_); model_ = model; should_start_record_progress = ShouldStartRecordProgress(); step_for_progress = last_global_step_; } if (should_start_record_progress) { StartRecordProgress(step_for_progress); } 

Reduce contention by sharding

Sometimes a data structure protected by a Mutex that is exhibiting high contention can be safely split into multiple shards, each shard with its own Mutex. (Note: this requires that there are no cross-shard invariants between the different shards.)

This just means that the underlying elements can be processed in parallel, but the global object cannot be accessed during this time. I didn’t realize you just could initialize multiple copies.

class ShardedLRUCache : public Cache { private: LRUCache shard_[kNumShards]; port::Mutex id_mutex_; uint64_t last_id_; static inline uint32_t HashSlice(const Slice& s) { return Hash(s.data(), s.size(), 0); } static uint32_t Shard(uint32_t hash) { return hash >> (32 - kNumShardBits); } ... virtual Handle* Lookup(const Slice& key) { const uint32_t hash = HashSlice(key); return shard_[Shard(hash)].Lookup(key, hash); } 

Be careful with the information used for shard selection. If, for example, you use some bits of a hash value for shard selection and then those same bits end up being used again later, the latter use may perform poorly since it sees a skewed distribution of hash values.

For sharding, equal distribution is always important. Nothing should be overloaded.

This CL partitions the ActiveCallMap into 64 shards. Each shard is protected by a separate mutex. A given transaction will be mapped to exactly one shard. A new interface LockedShard(tid) is added for accessing the ActiveCallMap for a transaction in a thread-safe manner. Example usage: transaction_manager.cc { absl::MutexLock l(&active_calls_in_mu_); delayed_locks_timer_ring_.Add(delayed_locks_flush_time_ms, tid); } { ActiveCalls::LockedShard shard(active_calls_in_, tid); shard.delayed_locks_timer_ring().Add(delayed_locks_flush_time_ms, tid); } The results show a 69% reduction in overall wall-clock time when running the benchmark with 8192 fibers 

If different threads access different mutable data, consider placing the different data items on different cache lines, e.g., in C++ using the alignas directive. However, these directives are easy to misuse and may increase object sizes significantly, so make sure performance measurements justify their use.

Trade size for performance… How do you even identify such a thing

histogram.h HistogramOptions options_; ... internal::HistogramBoundaries *boundaries_; ... std::vector<double> buckets_; double min_; // Minimum. double max_; // Maximum. double count_; // Total count of occurrences. double sum_; // Sum of values. double sum_of_squares_; // Sum of squares of values. ... RegisterVariableExporter *exporter_; HistogramOptions options_; ... internal::HistogramBoundaries *boundaries_; ... RegisterVariableExporter *exporter_; ... // Place the following fields in a dedicated cacheline as they are frequently // mutated, so we can avoid potential false sharing. ... #ifndef SWIG  alignas(ABSL_CACHELINE_SIZE) #endif  std::vector<double> buckets_; double min_; // Minimum. double max_; // Maximum. double count_; // Total count of occurrences. double sum_; // Sum of values. double sum_of_squares_; // Sum of squares of values. 

Reduce frequency of context switches

Process small work items inline instead of on device thread pool.

hard to see this without a tracer

Consider lock-free approaches

Sometimes lock-free data structures can make a difference over more conventional mutex-protected data structures. However, direct atomic variable manipulation can be dangerous. Prefer higher-level abstractions.

Extremely hard to debug and catch issues with. I don’t have expertise in this.

Protocol Buffer advice

I think this section is rather huge and for a good reason. Messages are one of the foundational building blocks of any distributed system and optimizing a small percentage will have high yields. This section is for good practices which can be applied to any message protocol.

What I mostly got from this section are that you need to see the generated serialization code and its overhead for serialization/deserialization and find the best practices to reduce the serialization/deserialization overhead (either by editing the proto file or by editing c++, by adding arenas).

C++-Specific advice

absl::flat_hash_map (and set). This is generall true for almost all standard libraries in C++ except a very small subset (like std::vector).

absl::InlinedVector stores a small number of elements inline (configurable via the second template argument). This enables small vectors up to this number of elements to generally have better cache efficiency and also to avoid allocating a backing store array at all when the number of elements is small.

This is probably just allocating on the stack. it’s nice, similiar to llvm::vector

gtl::vector32 Saves space by using a customized vector type that only supports sizes that fit in 32 bits. Simple type change saves ~8TiB of memory in Spanner. table_ply.h class TablePly { ... // Returns the set of data columns stored in this file for this table. const std::vector<FamilyId>& modified_data_columns() const { return modified_data_columns_; } ... private: ... std::vector<FamilyId> modified_data_columns_; // Data columns in the table. #include "util/gtl/vector32.h"  ... // Returns the set of data columns stored in this file for this table. absl::Span<const FamilyId> modified_data_columns() const { return modified_data_columns_; } ... ... // Data columns in the table. gtl::vector32<FamilyId> modified_data_columns_; 

This is very cool. I guess the data type won’t align up to 64bits, so you can cut it in half.

Bulk operations

As per usual, bulk computation is the answer since memory is the bottleneck…

Introduced a GroupVarInt format that encodes/decodes groups of 4 variable-length integers at a time in 5-17 bytes, rather than one integer at a time. Decoding one group of 4 integers in the new format takes ~1/3rd the time of decoding 4 individually varint-encoded integers. groupvarint.cc const char* DecodeGroupVar(const char* p, int N, uint32* dest) { assert(groupvar_initialized); assert(N % 4 == 0); while (N) { uint8 tag = *p; p++; uint8* lenptr = &groupvar_table[tag].length[0]; #define GET_NEXT \ do { \ uint8 len = *lenptr; \ *dest = UNALIGNED_LOAD32(p) & groupvar_mask[len]; \ dest++; \ p += len; \ lenptr++; \ } while (0)  GET_NEXT; GET_NEXT; GET_NEXT; GET_NEXT; #undef GET_NEXT  N -= 4; } return p; } 

CLs that demonstrate multiple techniques

This section is on seeing how a combination of techniques can be used to optimize a small part of a program and what to expect to the overall program.

For example, one speeds up GPU allocator by 40% using less bytes, caching aligning, caching and commenting out logging results in 2.9% speedup in end to end

Speed up low level logging in Google Meet application code.

This was changing logging state from vector to static array of size 4, resulting in 50% boost for logging, which might be pretty common call

I think all of these require very deep insights into what the program is doing and where the program is spending its time.

We found a number of performance issues when planning a switch from on-disk to in-memory index serving in 2001. This change fixed many of these problems and took us from 150 to over 500 in-memory queries per second (for a 2 GB in-memory index on dual processor Pentium III machine).

This is back in the day. Most likely applies still to personally written code, but doesn’t apply nearly as much these days as most people know the general optimizations and search was just becoming avaliable!

Maybe consider putting "cutlass" in your CUDA/Triton kernels

2025-12-15T06:00:00+00:00

Motivation

So I was browsing Hacker News and came across this interesting post: Fp8 runs ~100 tflops faster when the kernel name has “cutlass” in it.

This was from Triton tutorial where someone noticed that adding “cutlass” to their kernel name gave them an additional 100-150 TFLOPs. That’s a huge improvement just from… a name?

Mentions 100 TFLOPs improvement (Image source: Github pull)

Mentions 150 TFLOPs improvement by renaming triton kernels to add cutlass (Image source: Github pull)

Well, I got a bit curious and wanted to why this happens.

So… what exactly is this?

Instead of writing your kernel like this:

__global__ void add(float *sum, int n, float *x, float *y) { for (int i = 0; i < n; i++) sum[i] = x[i] + y[i]; } 

You add “cutlass” to the name:

__global__ void add_cutlass(float *sum, int n, float *x, float *y) { for (int i = 0; i < n; i++) sum[i] = x[i] + y[i]; } 

and ptxasIf you need some background on the CUDA compilation toolchain, refer to the section on nvidia toolchain background will perform an additional pass that can add performance to the generated code.

The rest of this blog will show benchmarks, explain the optimizations, and discuss when to use this trick. But I also want to highlight something broader: if you’re working at the high level (CUDA, Triton, PyTorch), you’re still at the mercy of what the backend compilers decide to do. In this case, ptxas (a black box) is making optimization decisions based on your kernel’s nameWith the recent release of TileIIR, there’s still plenty of magic happening under the hood. tileiras is also a black box, so we could easily see a similar “cutlass” trick emerge there too.

If you want to skip to TLDR of the optimization, click here

A cutlass example

Here’s an example graph showing cutlass benchmarks with and without this optimization (where baseline/cutlass_on enables the optimization and cutlass_off disables it):

Throughput of various cutlass examples

In particular, the CuTE sgemm2.cu example sees a 20% drop in performance without the cutlass optimization!

Another thing immediately obvious is that this optimzation doesnt always increase performance.

Benchmarks

Below are sections you can expand to see various benchmarks running on an RTX 3090 and H100. Each result is aggregated from 5 benchmark runs.

Benchmarks include 15+ projects, covering popular ones like PyTorch, Flash Attention 2/3, Cutlass, llama.cpp.

Some highlights:

Running llama.cpp on RTX 3090 with gpt-oss-20b shows a 1%+ performance increase
Flash Attention 2 on RTX 3090/H100 without the optimization decreases performance by up to 1%
Triton on RTX 3090 generally shows no performance change from the optimization

Note: baseline doesn’t change anything. cutlass_on enables the optimization and cutlass_off disables it (if the application uses cutlass, for example Flash Attention 3):

Expand to see 3090 benchmarks

GPU	Benchmarks
RTX 3090 (Ampere)	bitsandbytes	candle	cutlass	flash_attn2	flashinfer	ggml	liger	llamacpp	llmc	mojo	nccl	pytorch	sageattention	sgemm	sglang	tilus	tinygrad	torchao	triton	unsloth	vllm

Expand to see H100 benchmarks

GPU	Benchmarks
H100 (Hopper)	bitsandbytes	cutlass	deepep	deepgemm_tflops	flash_attn2	flash_attn3	flashinfer

So what has it changed?

So, I’ve added a godbolt reference for people to see the difference. I’m using some parts of SGEMM_CUDAIf you haven’t checked it out, it’s a great blog on optimizing cuda matmul kernels by Simon Boehm as reference.

In the NVCC compliation pipeline, cuda goes to ptx then ptx goes to sass. Let’s check verify where this optimization is applied (is it applied at the ptx or sass code)?

High level compilation overview for NVIDIA GPUs

First let’s explore if the cuda to ptx has changed.

There's no difference in the PTX! (Image source: Godbolt link)

Only the name has changed. The PTX instructions are identical.

So let’s now check the the sass Godbolt link:

Clearly something has changed!

Two common changes we can see are:

The optimization now uses IMAD instead of HFMA2.MMA to move constants

We can see that IMAD is used instead of HFMA2.MMA for moving constants, which is neat!By using IMAD, we can use the FP32 units. Refer to H100 SM Diagram.

Enable interleaving LDS and FFMA

We can see that LDS interleaved instead of being stacked togetherThis should be able to increase instruction level parallelism

One thing that the disassembly doesn’t show is the register pressure. This optimization may increase register pressure:

cuobjdump --dump-resource-usage baseline.cubin Resource usage: Common: GLOBAL:0 Function sgemm_kernel_10: REG:188 STACK:0 SHARED:17408 LOCAL:0 CONSTANT[0]:564 TEXTURE:0 SURFACE:0 SAMPLER:0 cuobjdump --dump-resource-usage cutlass.cubin Resource usage: Common: GLOBAL:0 Function cutlass_sgemm_kernel_9: REG:214 STACK:0 SHARED:17408 LOCAL:0 CONSTANT[0]:564 TEXTURE:0 SURFACE:0 SAMPLER:0 

Register usage increased from 188 to 214, a 13% increase in register usage. However, this isn’t always the case. I’ve seen other examples not affect register pressure and even decrease register pressure.

Below is a table of the different instructions that have changed for this kernel:

Mnemonic	Baseline	CUTLASS	Δ
IMAD.MOV.U32	0	37	+37
HFMA2.MMA	5	0	-5
LEA	15	2	-13
IMAD.SHL.U32	0	10	+10
CS2R	75	64	-11
MOV	8	0	-8
IMAD	0	8	+8
ULDC.64	4	1	-3
FFMA	787	801	+14

So… what is it doing?

So far, we’ve dug into specifics. The higher optimization seems to most likely do the following:

Instruction selection - use f32 units for moving constantsMoving constants from registers isn’t in the hot path, but it’s a simple to see example! registersBut wait there’s more! I didn’t show it in this blog in detail, but you can see some IMADs replacing instructions
Instruction reordering - mix memory loads with math
Influence register pressure - may increase the number of registers used to achieve reodering

When ptxas sees matrix operations (MAD/MMA): Instruction selection: HFMA2.MMA,MOV -> IMAD Instruction reordering: LDS spread across FMMA As a Side effect: May increase register pressure 

When should you apply this optimization?

With kernel writing, it’s tricky to say when you absolutely should and shouldn’t use this optimization. The optimization seems to increase ILP at the cost of register pressureWon’t increase register pressure in some cases!. Always benchmark to ensure the performance is goodI’ve seen the optimization not affect performance on some cards while affecting others significantly.

How to apply this to triton

import torch import triton import triton.language as tl def rename_kernel(proxy): return "cutlass_kernel" # will convert "my_kernel" -> cutlass_kernel @triton.jit(repr=rename_kernel) def my_kernel(M: tl.constexpr): pass # compile and extract ptx my_kernel[(1,)](M=32) dev = torch.cuda.current_device() kernel_cache = my_kernel.device_caches[dev][0] compiled = next(iter(kernel_cache.values())) ptx = compiled.asm["ptx"] # print the kernel name from PTX print('\n'.join(ptx.splitlines()[:20])) 

It will show

// // Generated by LLVM NVPTX Back-End // .version 8.7 .target sm_86 .address_size 64 // .globl cutlass_kernel // -- Begin function cutlass_kernel // @cutlass_kernel .visible .entry cutlass_kernel( .param .u64 .ptr .global .align 1 cutlass_kernel_param_0, .param .u64 .ptr .global .align 1 cutlass_kernel_param_1 ) 

How to apply this to ptxas

A universal patch to ptxas (which most frameworks invoke) is to just replace cutlass in the binary with something else.

Here’s how I do it:

input_path = "/usr/local/cuda/bin/ptxas" output_path = "ptxas_no_cutlass" with open(input_path, "rb") as f: blob = bytearray(f.read()) # We expect exactly "cutlass" inside ptxas. target = b"cutlass" off = blob.find(target) assert off != -1, "ptxas did not contain the cutlass marker!" # Overwrite: c u t l a s s → ff ff ff ff ff ff ff, so that strstr("0xFF") since kernel names contains ascii for i in range(len(target)): blob[off + i] = 0xFF with open(output_path, "wb") as f: f.write(blob) print(f"patched '{target.decode()}' at offset {off:#x}") 

Resolving Public Statements

In my opinion, there seems to be a lot of assumptions people are throwing out on the internet about this optimization. I want to clear some of that up.

On the top of the hackernews post, it links to a response from a user about this optimization.

This statement is incorrect; I have compiled many real world projects with this optimization on and off and they ran without failing (passing output asserts) on different cards.

Also with a highly voted reddit comment

This explanation is really hard to understand. I’m guessing that the user is stating this trick uses NaNs/zeroes to optimize the program. It doesn’t use that. In fact, it tries to optimizes how registers are moved.

Previous mentions

This was also mentioned before by grynet on the nvidia forums where he complained that the following kernel would generate different sass

__global__ void mykernel(float *lhs, float *rhs, float *res, int M, int N, int K) { cutlass::gemm::GemmCoord problem_size(M,N,K); compute_gemm_with_cutlass(lhs, rhs, res, problem_size); } 

__global__ void mykernel(float *lhs, float *rhs, float *res, int M, int N, int K, cutlass::gemm::GemmCoord dummy) { cutlass::gemm::GemmCoord problem_size(M,N,K); compute_gemm_with_cutlass(lhs, rhs, res, problem_size); } 

and BAR.SYNC.DEFER_BLOCKING would be generated here instead of BAR.SYNC (due to cutlass being added as part ofthe function signature)

Perhaps this was also a part of the optimization in previous versions of ptxas?

Takeaway

So, adding “cutlass” to your kernel name can give you 100+ TFLOPs or -20% FLOPS.

The issue is two fold: ptxas is a black box and sass is undocumented. It’s unlike other ecosystems. You can see the passes running through LLVM and x86/arm are documented.

Well, with this optimization… it helps some kernels, hurts others or change not much at all. Completely depends on your architecture and your specific code. What flies on an H100 might tank on a 5090 or B200, and you have no way to know until you run it.

So what do you do? Benchmark it. Change the ordering in triton/cuda, see if PTX changes, check the SASS output. That’s the only way to know what ptxas actually did.

And this isn’t going away. tileiras (the new TileIIR compiler) is also a black box. We may expect similar surprises like this moving forward.

Appendix

NVIDIA toolchain background

High level compilation overview for NVIDIA GPUs

NVIDIA’s toolchain works like this: CUDA code is compiled by nvcc into PTX, an intermediate representation. Then ptxas takes that PTX and turns it into SASS, the low-level instruction set the GPU runsptxas and sass are both undocumented, so it may be a bit difficult to understand what’s going on.

H100 SM Diagram

H100 SM Diagram (Image source: NVIDIA H100 GPU Whitepaper)

Changes

[12/16/2026] Thanks to @Firadeoclus on GPUMODE discord for pointing out that my original post mixes up HMMA and HFMA2.MMA and how they move constants instead of zeroing.

Citation

To cite this article:

@article{zhu2025cutlass, title = {Maybe consider putting "cutlass" in your CUDA/Triton kernels}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {December}, url = "https://maknee.github.io/blog/2025/Maybe-Consider-Putting-Cutlass-In-Your-CUDA-Kernels/" } 

Network Storage and Scaling Characteristics of a Distributed Filesystem

2025-09-16T06:00:00+00:00

Series

The Benchmarking Pyramid
Network Baseline Benchmark
Storage Baseline Benchmark
3FS Performance Analysis
- Scaling Block Size
- Scaling Number of Nodes
Wrapping up

Refresher

In my first post, I introduced DeepSeek’s 3FS distributed file system and performed a reality check in the second post. Now it’s time to see how 3FS performs in practice.

The Benchmarking Pyramid

Before diving into results, let’s talk about the understanding software performance from a high level. If we imagine performance understanding as an onion, peeling off each layer onion reveals deeper insightsEach layer gives us a deeper understanding. Without starting at the top, the discovering insights may be difficult

The performance analysis pyramid: from theoretical limits to production

We started with napkin math in the first post, performed reality checks in the second, and now we’re ready for the next layer: microbenchmarking.

Why Microbenchmark?

Think of microbenchmarking as testing individual components in isolation. Instead of running a complex workload that does everything at once, we test one specific operation repeatedly until we understand its exact performance characteristics. It’s like measuring only how fast a car accelerates in a straight line instead of timing a trip through city traffic where you can’t tell if slowdowns are from stop signs, traffic lights, or congested highways.

But one might ask: why not jump straight to real workloads? Real workloads are messy. They mix reads, writes, different block sizes, and various access patterns. When something’s slow, is it the network? The disk? The software? That’s the challenge with macrobenchmarks and production workloads (the bottom layers of our pyramid). There’s too many variables at once. Microbenchmarks let us isolate each component and understand exactly where time is spentThey answer specific questions like: What’s the maximum throughput for sequential reads? How does latency change with queue depth? Where exactly does performance cliff when we increase parallelism?.

These benchmarks build intuition at multiple levels: from raw hardware performance to how exactly 3FS performs. Once one recognize these patterns, one can have intuition on related applications may be slow and how to fix itThis knowledge transfers across systems too – similar hardware will have similar characteristics regardless of the software running on top, and similar types of software (like filesystems) perform comparable operations.

In my previous posts, I made several predictions about 3FS performance based on napkin math and reality checks. Now that I have actual microbenchmark data, I can see how accurate those predictions were or how terribly off I was.

What we’re measuring and why

In this post, we’ll answer five key questions:

What are the hardware limits? – Local SSD and InfiniBand benchmarks establish our ceiling
How does 3FS compare? – Performance differences from local benchmarks and why they occur
Is 3FS hardware-specific? – Does it require high-end hardware or work well on commodity clusters?DeepSeek’s paper describes a cluster with NVMe SSDs and 200Gb/s InfiniBand. What happens with SATA SSDs and 25Gb/s networking?
How does 3FS scale? – Performance across different node counts and configurations
What knobs matter? – Impact of block sizes, I/O patterns, and tuning parameters

This will start to build our intuition for how 3FS performs. The post includes many interactive graphs to explore the data yourselfI’ll highlight the interesting patterns so you don’t drown in numbers, sometimes benchmarks reveal surprising behaviors.

Single Node Benchmarking

Before diving into 3FS performance, we need to understand how our clusters performs. This section establishes baseline performance for both network and storage using standard tools.

Testing Environment

I have two contrasting setups that tell an interesting story:

Component	Older Cluster (18 Nodes)	Modern Cluster (5 Nodes)
Node Count	18	5
Use case	Budget cluster	High-performance cluster
CPU	10-core Intel E5-2640v4 (2017 era)	2×36-core Intel Xeon Platinum (2021 era)
Memory	64GB DDR4-2400	256GB DDR4-3200
Storage	SATA SSD (480GB)	NVMe SSD (1.6TB PCIe 4.0)
Network	25 Gbps (3.25 GB/s)	100 Gbps (12.5 GB/s)

The older cluster represents deployments using previous-generation hardware. The modern cluster represents somewhat current high-performance deployments. Comparing these reveals how 3FS performs across different hardware generationsI don’t have access to a high-end cluster with many NVMe drives and newer NICs. I’d love to have the setup that the 3FS team uses, but I’m just a student without access to those types of clusters 😔. I’ll be referring these clusters as old cluster and new cluster.

Expand to see more detailed hardware specifications

Component	Older Cluster (18 Node Setup)	Modern Cluster (5 Node Setup)
Node Count	18	5
CPU	Ten-core Intel E5-2640v4 at 2.4 GHz	Two 36-core Intel Xeon Platinum 8360Y at 2.4GHz
RAM	64GB ECC Memory (4x 16 GB DDR4-2400 DIMMs)	256GB ECC Memory (16x 16 GB 3200MHz DDR4)
Disk	Intel DC S3520 480 GB 6G SATA SSD (OS & Workload)	Samsung 480GB SATA SSD (OS) Dell NVMe 1.6TB NVMe SSD (PCIe v4.0) (Workload)
Network	Mellanox ConnectX-4 25 GB NIC (1.25 GB/s, only one physical port at 25 Gbps)	Dual-port Mellanox ConnectX-6 100 Gb NIC (12.5 GB/s, Only one physical port enabled)

Layout of Older Cluster:

Older Cluster cpu/pcie layout

Layout of Modern Cluster:

Modern Cluster cpu/pcie layout

Network Baseline Benchmark

Distributed filesystems are only as fast as their network, which often becomes the primary bottleneck depending on the workload, as shown in my measurements in the previous post.

Since 3FS uses InfiniBand for data transfer, we first measure raw network performance using the ib_send, ib_read and ib_write benchmarks. These tests show us two things: how close we can get to the theoretical 12.5 GB/s (100 Gbps) limit, and how latency changes with different message sizesI will be profiling actual 3FS network traffic to observe what message sizes are used and how they map to these latency measurements in a later post.

The graph plots three key variables:

Message Size (Z-axis): On a logarithmic scale, showing packet sizes from bytes to 10 megabytes
Throughput (Y-axis): Data transfer rate in GB/s, with color mapping from blue (low) to red (high)
Latency (X-axis): Transfer completion time in microseconds

Expand for instructions on how to interact with the graph

The results of the ib_read_bw benchmark are plotted in the interactive 3D graph below. You can click and drag to rotate the graph, and hovering over any data point will display its precise values.

The Test Type menu allows you to switch between different benchmark results (ib_write and ib_send). The View Mode can be changed to 2D, which helps observe latency variations more clearly.

IB benchmark unidirectional

Test Type

View Mode

IB Benchmark on unidirectional throughput/latency

Key observations from the throughput graph:

All three operations (read, write, send) peak at ~11.5 GB/s (92% of theoretical) at 4K-8K message sizesSurprisingly, the send operation (two-sided) achieves the same bandwidth as one-sided RDMA operations. This is unexpected given the additional coordination overhead
To achieve meaningful throughput (>10 GB/s), you need at least 4KB messages

Switching to the latency graph (Read Bandwidth -> Read Latency) reveals additional insights:

At the same 4K message sizes, latency drops significantly to ~5μs when operating at ~1 GB/sThere’s some queuing going on? I’m not sure for this reason

Switching to 2D version of latency graph (Read Bandwidth -> Read Latency, 3D Graph -> 2D Graph):

Two distinct latency regions emerge: a gentle increase from 5μs to 10μs (2 bytes to 64KB), then an almost linear scale beyond 64KBsThis is also true when we take a look at when the NIC is at full throughput. This makes the performance very predictable, which makes understanding network bottlenecks easier
Latency variance remains stable across most message sizes (p50, p90, p99 are tightly grouped)

Since NICs support bidirectional communication, we also need to measure performance when traffic flows in both directions simultaneously:

IB benchmark bidirectional

Test Type

View Mode

IB Benchmark on bidirectional throughput/latency

The bidirectional results show similarities!

At 4K-8K message sizes, we achieve double the throughput while latency drops from 30-60μs to 15-30μsThis counterintuitive result likely occurs because each direction gets dedicated hardware resources, allowing better pipeline utilization
Combined bandwidth reaches ~23 GB/s (~92% of theoretical 25 GB/s)
Latencies remain consistent with unidirectional measurements

These measurements give us concrete expectations for 3FS operations. For example, when 3FS performs a 3-node write (1KB from 3 storage nodes), the network alone will consume 3-10μs. Any latency above this represents other software/hardware overhead – chunk management, thread contention, or disk I/O.

Comparison to NCCL all_reduce_perf (for fun)

NCCL is the standard framework for GPU-to-GPU communication in machine learning clusters. Since GPUs also use InfiniBand for inter-node communication, I wanted to see if the same performance patterns emerge.

This test uses a 2-node cluster with 8x400Gbps InfiniBand NICs (~400GB/s total), typical for modern GPU clusters like 8xH100 setups.

NCCL all_reduce_perf

Test Type

View Mode

all_reduce_perf

The bandwidth pattern is similar (slow climb then rapid rise), but peak performance hits at ~512MB messages instead of 8KBLikely due to multiple NICs and the collective communication overhead of all_reduce operations. At the same 8KB message size where our InfiniBand tests peaked, NCCL only achieves ~0.24 GB/s @ ~20us.

Storage Baseline Benchmark

FIO is the standard tool for storage benchmarking on Linux, so I’ll be using that in the next section. As a heads up, the 3FS authors conveniently provide a custom FIO engine specifically for benchmarking their filesystemThis wasn’t in the original release – they added it after I started this analysis and I would have spent quie a bit of time writing it which we can compare to!

Local Storage Performance

Before measuring 3FS, we need baseline numbers for our SSDs. The following benchmarks show how bandwidth and latency change as we vary two key parameters:

I/O depth: How many operations we submit before waiting for completion (think of it as the queue length)
Job count: How many parallel processes are hammering the storage simultaneously

These SSD numbers will become our reference pointFor example, with a replication factor of 3, we might see 3x higher read throughput or 3x higher write latency, but this might not be the case! – when 3FS shows higher latency or lower throughput, we can quantify exactly how much overhead the distributed layer adds.

I’ll be benchmarking local ssd with io_uring, then 3fs with io_uring and then 3fs with its own custom iouring interface

I configured 3FS with a replication factor of 3.

Hardware Vendor Specifications

Before examining our benchmark results, let’s establish the theoretical performance limits according to hardware vendor specifications. These numbers represent the maximum performance we could theoretically achieve under ideal conditions:

Performance Metric	Random Read	Sequential Read	Random Write	Sequential Write
SATA SSD	276 MB/s	450 MB/s	380 MB/s	72 MB/s
NVMe	3.77 GB/s	6.2 GB/s	0.4 GB/s	2.3 GB/s

These theoretical limits come from the Intel DC S3520 SATA and Dell Enterprise NVMe specification sheets. In practice, our benchmarks will likely fall short of these numbers due to filesystem overhead, driver limitations, and real-world I/O patterns.

Also, the dramatic performance difference between SATA and NVMe storage is pretty immediate. NVMe provides roughly 10-15x higher throughput for most operations and this difference may impact how 3FS performs.

Benchmarking for Older Cluster

Local FIO results

Expand for instructions on how to interact with the graph

Controls:

Test Type menu: Switch between Random Read, Sequential Read, Random Write, and Sequential Write
Metric menu: Change between Bandwidth, IOPS, and various Latency measurements

3D Navigation:

Click and drag: Rotate the view
Scroll wheel: Zoom in/out
Hover: See exact values for any data point
Double-click: Reset to default view

Axes:

X-axis: IO Depth (1 to 128)
Y-axis: Number of Jobs (1 to 128)
Color: The selected metric value (blue = low, red = high)

Scaling block size for local SSD

The first benchmark uses the older cluster to establish our local SSD baseline. I’m testing how performance changes with different block sizes (4K, 64K, 1MB, 4MB) to understand the storage characteristics of a SATA SSD. The local ssd was configured with xfs filesystem.

This is a lot of data. Feel free to jump between the interactive graphs and the performance analysis to explore the patterns.

4K Block Size - SSD XFS with IO_URING (Older)

Test Type

Metric

Latency Percentiles

Small block (4K) performance using SSD with XFS filesystem and IO_URING driver on older cluster.

64k Block Size - SSD XFS with IO_URING (Older)

Test Type

Metric

Latency Percentiles

Small block (64k) performance using SSD with XFS filesystem and IO_URING driver on older cluster.

1M Block Size - SSD XFS with IO_URING (Older)

Test Type

Metric

Latency Percentiles

Performance characteristics of SSD with XFS filesystem using IO_URING driver with 1M block size on older cluster.

4m Block Size - SSD XFS with IO_URING (Older)

Test Type

Metric

Latency Percentiles

Performance characteristics of SSD with XFS filesystem using IO_URING driver with 4m block size on older cluster.

Storage Performance Analysis for Local SSD

Let’s examine how performance changes across different block sizes by looking at a specific configuration point: various IO depths at 1 jobWhy 1 job? This removes one variable from our analysis, allowing us to focus on how IO depth affects performance. We’ll explore job scaling separately.

Throughput versus latency graph for random reads

This graph reveals the classic throughput versus latency tradeoff for our SATA SSDThese plots are fundamental to understanding storage performance - they show exactly when a system hits diminishing returns. The Y-axis shows throughput (higher is better), while the X-axis shows latency (lower is better). Each colored line represents a different block size, with dots marking increasing IO depths.

First, let’s examine each axis independently:

Y-axis (Throughput): 64K block sizes achieve the highest peak at 400 MB/s, while other sizes fall short: 4K reaches 250 MB/s, 1M hits 325 MB/s, and 4M peaks at 350 MB/s
X-axis (Latency): Large block sizes (1M and 4M) show dramatically higher latency (80ms+) compared to smaller blocks size (4K and 64K)

The cool thing about throughput versus latency graphs is that there’s a knee point – where throughput stops increasing but latency continues climbingCertain systems even decrease throughput after this point as they may need to do additional work to manage work items. For 64K blocks, this occurs around IO depth 16-32, where we achieve ~400 MB/s at < 10ms.

Knee point for throughput versus latency graph

Expand to view throughput versus latency graphs for other workloads

Throughput versus latency graph for sequential reads

Throughput versus latency graph for random writes

Throughput versus latency graph for sequential writes

Expand to view throughput versus latency graphs scaling num jobs for random reads

Throughput versus latency graph for random reads for 1 numjobs

Throughput versus latency graph for random reads for 2 numjobs

Throughput versus latency graph for random reads for 4 numjobs

Throughput versus latency graph for random reads for 8 numjobs

These measurements reveal something frustrating, but also quite interesting: there’s no universal sweet spot. What works best depends entirely on whether you care more about latency or throughput and then that depends on what your workload looks like.

Couple of interesting things to observe:

Latency increases in different amounts as block sizes increase
Latency doubles as numjobs increases
There’s not one block size that’s optimal for bandwidth for a workload. For random reads, it’s 64k. For sequential reads, it’s 4k.
For lowest latency, use smaller block size, but the SSD most likely won’t fully saturate its bandwidth.
Writes have different knee points than reads (for example, 4k sequential writes knee point caps at 150 MB/s while 4k sequential reads cap at 300 MB/s)

With these patterns established, let’s examine the NVMe fio benchmarks to see whether these observations hold true or if new patterns emerge.

Double checking if libaio has any difference

The performance shown in the graphs above represent io_uring. Are there any differences with another async io library (libaio?)

Workload	Configuration	Bandwidth (MB/s)	Avg Latency (ms)	P99 Latency (ms)
Random Read	4K iouring	242.6	2.01	3.29
Random Read	1M iouring	329.8	377.50	484.44
Random Read	4K libaio	240.7	2.02	3.32
Random Read	1M libaio	329.9	378.01	488.64
Random Write	4K iouring	153.6	3.18	5.47
Random Write	1M iouring	159.6	780.23	977.27
Random Write	4K libaio	151.3	3.23	5.55
Random Write	1M libaio	153.3	855.61	935.33
Sequential Read	4K iouring	410.7	1.22	1.97
Sequential Read	1M iouring	276.7	460.66	488.64
Sequential Read	4K libaio	402.1	1.22	2.01
Sequential Read	1M libaio	270.3	467.39	497.03
Sequential Write	4K iouring	148.2	3.30	5.47
Sequential Write	1M iouring	143.7	866.88	935.33
Sequential Write	4K libaio	147.7	3.29	5.44
Sequential Write	1M libaio	145.6	855.61	960.50

4K Block Size - SSD XFS with LIBAIO (Older)

Test Type

Metric

Latency Percentiles

Performance comparison of SSD with XFS filesystem using LIBAIO driver with 4k block size on older cluster.

1M Block Size - SSD XFS with LIBAIO (Older)

Test Type

Metric

Latency Percentiles

Performance comparison of SSD with XFS filesystem using LIBAIO driver with 1M block size on older cluster.

Nothing sizable of difference.

Benchmarking for Modern Cluster

Local FIO results

Scaling block size for local NVMe

Again, feel free to jump between the interactive graphs and the performance analysis to explore the patterns.

4k Block Size - NVME XFS with IO_URING (Modern)

Test Type

Metric

Latency Percentiles

Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 4k block size.

64k Block Size - NVME XFS with IO_URING (Modern)

Test Type

Metric

Latency Percentiles

Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 64k block size.

1M Block Size - NVME XFS with IO_URING (Modern)

Test Type

Metric

Latency Percentiles

Performance of NVME with XFS filesystem using IO_URING driver on modern cluster with 1M block size.

Storage Performance Analysis for local NVMe

Let’s examine how the NVMe drive performs compared to our SATA baseline:

Throughput versus latency graph for random reads on NVMe

Expand to view SATA SSD comparison graph

Throughput versus latency graph for random reads on SATA SSD

The NVMe improvement is dramatic:

Throughput: 10x higher across all block sizes (1 GB/s vs 250 MB/s for 4K, 4 GB/s vs 400 MB/s for 64K)
Latency: Consistently lower, especially for large blocksFor 64K blocks: NVMe stays at ~1ms while SATA climbs to ~20ms - a 20x difference

Two interesting differences from SATA patterns:

64K and 1M blocks need higher IO depths to hit their knee points, suggesting NVMe controllers require more parallelism for peak performance3FS may need to be configured with sufficient parallelism to extract maximum NVMe performance

Expand to view throughput versus latency graphs for other workloads

Throughput versus latency graph for sequential reads

Throughput versus latency graph for random writes

Throughput versus latency graph for sequential writes

Expand to view throughput versus latency graphs scaling num jobs for random reads

Throughput versus latency graph for random reads for 1 numjobs

Throughput versus latency graph for random reads for 2 numjobs

Throughput versus latency graph for random reads for 4 numjobs

Throughput versus latency graph for random reads for 8 numjobs

Sequential reads follow similar patterns to random reads, maintaining a similar high throughput ceiling and low latency.

Write performance reveals a different story. Both random and sequential writes drop to ~2 GB/s peak throughput, with knee points occurring at much lower IO depths for 64K and 1M blocksThis aligns with the vendor specification showing NVMe write performance (2.3 GB/s) is significantly lower than read performance (6.2 GB/s).

The numjobs scaling patterns mirror what we observed with SATA SSDs: throughput increases with additional parallel jobs, but latency scales proportionally. Doubling jobs roughly doubles latency but provides less than 2x throughput improvement.

Predicting 3FS Performance

Before diving into actual 3FS benchmarks, let’s make some predictions based on our hardware baseline measurements:

For random/sequentials reads, our theoretical ceiling is 18 GB/s as there’s a replication factor of 3 and both random/sequential reads hit 6 GB/s.

However, we’re bound by network bandwidth as it as a theoretical limit of 12.5 GB/s (realistically ~11.5 GB/s from our previous micro-benchmarks).

Let’s now talk about latency in the worst and best case. We can pull the network and disk latency from the graphs we have, starting with reads.

In the average case:

The average network latency for 1MB of data is 91us
The average disk latency for sequential/random reads for 1M block size (1 IO depth, 1 job) is 0.48ms
So the the latency we should expect is 0.48ms

In the worse case:

The p99 network latency for 1MB of data is 282us
The p99 disk latency for sequential/random reads for 1M block size (128 IO depth, 16 job) is 448ms/420ms
So the the latency we should expect is 448ms

What we can see is a 100x difference in latency between the average and worse case. Another thing that we can clearly see that the latency is dominated by disk latency.

Moving on to writes,

Average case:

91us
0.46ms (1 IO depth, 1 job)
So, latency combined is 0.46ms * 3 (chained) = 1.38ms

P99 case:

187us
892ms (128 IO depth, 16 job)
So, latency combined is 892ms * 3 (chained) = 2.67s

Writes can be 2000x+ slower in the worse case. This is due to the multiplicative factor of writes since writes have to go through each node.

Keeping in this mind, let’s head into the benchmarks:

3FS

3FS is benchmarked using two different I/O interfaces: io_uring, the standard Linux asynchronous I/O interface, or USRBIO, a custom FIO engine that integrates directly with 3FS’s I/O queue management system.

IO_URING

1M Block Size - HF3FS XFS with IO_URING (Modern)

Test Type

Metric

Latency Percentiles

Performance of HF3FS with XFS filesystem using IO_URING driver on modern cluster with 1M block size.

USRBIO

1M Block Size - HF3FS XFS with USRBIO (Modern)

Test Type

Metric

Latency Percentiles

Performance of HF3FS with XFS filesystem using USRBIO driver on modern cluster with 1M block size.

One thing to observe is that for io_uring, io_depth does not affect the performance.

Again, here’s the 2D graph. Do note that IO_URING is that same spot.

Throughput versus latency graph for random reads on NVMe

One interesting thing to observe is that io_uring has lower latency at the same throughput as usrbio.

Expand to view throughput versus latency graphs for other workloads

Throughput versus latency graph for sequential reads

Throughput versus latency graph for random writes

Throughput versus latency graph for sequential writes

Does the performance match the estimates?

Metric	Predicted	Actual
Read Latency (1MB)	0.48ms	1.09ms (127% worse)
Read P99 Latency	304ms	194ms (36% better)
Read Bandwidth	11.5 GB/s	10.3 GB/s (10% worse)
Write Latency (1MB)	1.38ms	2.55ms (85% worse)
Write P99 Latency	0.89s	1.1s (24% worse)
Write Bandwidth	2.1 GB/s	1.8 GB/s (14% worse)

The 2x latency overhead for reads and writes may be coming from the software side of thingsWe’ll have to dig deeper later to see why. One interesting thing to see is that P99.9 latency is better for reads because the network bandwidth caps throughput before storage hits worst-case scenarios. What’s nice to see is that the bandiwdth only decreases by 10-15%!

3FS

Now we examine how 3FS scales with block size and node count on the older cluster (SATA SSDs + 25 Gbps networking).

Scaling block size (5 nodes)

4K Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)

Test Type

Metric

Latency Percentiles

Performance of HF3FS with XFS filesystem using USRBIO driver with 4K block size on older cluster.

1M Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)

Test Type

Metric

Latency Percentiles

Medium block (1M) performance using HF3FS with XFS filesystem and USRBIO driver on older cluster.

The 4K block size stays well below the 3.25 GB/s network limit, reaching only 1 GB/s with 4ms latency. The 1M block size hits the network bandwidth ceiling but pays a latency penalty (6ms at 1 IO depth with 8 jobs compared to 4K’s 4ms maximum)

Scaling nodes

1M Block Size - HF3FS XFS with USRBIO (Older-5-Nodes)

Test Type

Metric

Latency Percentiles

Medium block (1M) performance using HF3FS with XFS filesystem and USRBIO driver on older cluster.

1M Block Size - HF3FS XFS with IO_URING (Older-18-Nodes)

Test Type

Metric

Latency Percentiles

Performance of HF3FS with XFS filesystem using IO_URING driver with 1M blocks on 18 node configuration.

Comparing 5 vs 18 nodes with 1M blocks shows latency increases with cluster size. At 18 nodes, scaling jobs works better than scaling IO depth for latency: 8 jobs/1 IO depth achieves 10ms @ 1.25 GB/s while 1 job/128 IO depth hits 90ms @ 1 GB/s.

With 18 nodes at 300 MB/s each, we’d expect 5.4 GB/s total, but the 25 Gbps network caps us at 3.25 GB/s and realistically we get 2.35 GB/s.

One thing that is a glaring issue are that after a certain point, the throughput drops rather significantly. The local results hold the bandiwidth. I’m not entirely sure now why that is, but configuration seems to me even more important as the throughput can decrease drasitcally.

Watch out for really large block sizes

4M Block Size - HF3FS XFS with USRBIO (Older-18)

Test Type

Metric

Latency Percentiles

Large block (4M) performance using HF3FS with XFS filesystem and USRBIO driver on 18 node configuration.

For 4M blocks, 3FS achieves 2.5 GB/s with just 1 IO depth and 8 jobsThis approaches 77% of the theoretical 3.25 GB/s network limit.. As you can see, increasing the number of nodes or the block sizes shifts the graph a little bit.

Wrapping up

The microbenchmarks reveal concrete performance characteristics for 3FS across different hardware configurations. We now have baseline numbers showing how 3FS compares to local storage and where the bottlenecks emerge.

3FS adds predictable overhead: ~1ms for reads, ~1.2ms for writes
Network bandwidth becomes the limiting factor before storage saturation
Performance scales reasonably with both block size and node count

The next step is testing 3FS with actual workloads to see how much the performance translates to practice. Since 3FS has a relatively generic interface, we can compare with many other systems.

Citation

To cite this article:

@article{zhu20253fs3, title = {Network Storage and Scaling Characteristics of a Distributed Filesystem}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {September}, url = "https://maknee.github.io/blog/2025/3FS-Performance-Journal-3/" } 

Network and Storage Benchmarks for LLM Training on the Cloud

2025-09-11T09:00:00+00:00

AI usage has become universal. Teams everywhere are building RAG, generating embeddings, and training increasingly sophisticated agents.

Most distributed LLM training guides focus on model architecture and hyperparameters while ignoring a critical bottleneck: infrastructure configuration. Network and storage choices often determine whether training takes hours or days.

I ran benchmarks finetuning Gemma 3 12B and GPT-OSS-120B with different storage and network configurations using SkyPilot for infra and Nebius for GPUs. The results reveal that InfiniBand networking provides 10x faster training than standard Ethernet, while optimal storage selection can speed up checkpointing by almost 2x. Combined, these infrastructure optimizations deliver 6-7x end-to-end speedup alone.

Some background on training bottlenecks

Here’s something that surprises most people new to large-scale training: your GPUs are most likely not the limiting factor. Modern accelerators like H200s will happily consume whatever data you can feed them. The real challenge is keeping them fed.

GPU compute scaling vs memory/network bandwidth (Image source: horace)

Think of your GPU as an extremely efficient factory. It can process raw materials (your data) at incredible speeds, but it depends entirely on a steady supply chain. Your storage systems hold the raw materials, and the bandwidth between storage and compute acts as the conveyor belt. These days, that conveyor belt has become the constraint.

While GPU compute capability has grown exponentially, memory bandwidth and network speeds have followed a more modest trajectory.

Scaling trends in compute vs bandwidth (Image source: Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning)

The two levers you control

When running distributed training, you have meaningful control over two critical components: storage and networking, especially when running on cloud GPUs.

The objective is straightforward: maximize GPU utilization (or in other words, minimize GPU idleness). But achieving this requires understanding how data flows through your training pipeline and where bottlenecks typically emerge.

The training data flow

During training, data moves through these stages:

Load batches from dataset – storage
Communicate gradients between nodes – network
Dump checkpoint to save progress – storage

In any of these steps, bottlenecks can emerge. For example, loading datasets from or saving checkpoints to storage might take extraordinarily long and block GPU progress. Or the inter-node network bandwidth might be insufficient for communication operations (to synchronize weights/gradients).

Performance benchmarks

I’ll use two concrete examples throughout:

Google Gemma 3 12B on 2 nodes × H100:8 GPUs
OpenAI GPT-OSS-120B on 4 nodes × H200:8 GPUs

I ran some experiments on Nebius, a golden GPU provider in SemiAnalysis’s GPU cloud ClusterMax benchmark, to quantify these effects.

Click to see experimental setup

Gemma 3 12B IT Configuration

Component	Specification
Cloud Provider	Nebius
Model	Gemma 3 12B IT (Hugging Face)
Nodes	2
GPUs per Node	8x H100s
Total GPUs	16x H100s
CPU Memory	1.5 TB
Framework	Hugging Face Accelerate with FSDP

GPT-OSS-120B Configuration

Component	Specification
Cloud Provider	Nebius
Model	GPT-OSS-120B
Nodes	4
GPUs per Node	8x H200s
Total GPUs	32x H200s
Framework	Hugging Face Accelerate with FSDP

**Network configurations tested**

Configuration	Specification	Theoretical Bandwidth
Default Ethernet	10 Gbit/s NIC	~1.25 GB/s
InfiniBand	400 Gbit/s NIC × 8 cards	~400 GB/s

**Storage configurations tested** All storage types are documented in [Nebius storage documentation](https://docs.nebius.com/compute/storage/types):

Storage Type	Description	Performance Profile
Network SSD	network_ssd_non_replicated	Standard cloud block storage
Nebius Shared Filesystem	Nebius's distributed file system offering	High-performance distributed storage
Object Store (MOUNT)	Direct S3-compatible mounting	Cost-effective but high-latency
Object Store (MOUNT_CACHED)	SkyPilot's cached mounting	Logs to local disk streams to object store

Network benchmarks: The 9x performance difference

I compared two network configurations:

Standard 10 Gbit/s Ethernet (the default on most clouds)
InfiniBand 400 Gbit/s with 8 NICs (high-performance networking)

The raw bandwidth difference is substantial: 1.25 GB/s versus approximately 400 GB/s. But how does this translate to actual training throughput?

I run the experiments on Open-R1 dataset with this SkyPilot YAML.

Network Type	Raw Bandwidth	Average Time per Step	Total Training Time
10 Gbit Ethernet	~1.25 GB/s	39.8 seconds	53 minutes
NVIDIA Quantum-2 InfiniBand	~400 GB/s	4.4 seconds	7 minutes

That’s a 9x speedup from network configuration alone. When you’re paying premium rates for GPU time, this isn’t just a performance improvement—it’s a cost optimization strategy.

With the GPT-OSS-120B model (10x larger!), we see the same effect - 10x speedup!

Normally, configuring high-performance networking takes a lot of effort, e.g., manual tuning many different cloud configs and setting various environment variables.

Here, SkyPilot takes care of the complexity under the hood with a single flag in the SkyPilot YAML:

name: distributed-training resources: accelerators: H100:8 # Enable high-performance networking for distributed training network_tier: best 

The network_tier: best flag automatically provisions InfiniBand networking (400GB/s) when available. Without this entry, the cluster uses the default network (10GB/s interface)

Profiling the network performance difference

To check how the network affects the training performance, we take a closer look at the training step when profiled in detail:

The execution breaks down into CPU work (data loading, kernel launches) and GPU work (computation plus network communication). GPU time itself divides between pure computation and communication overhead.

Comparing Ethernet versus InfiniBand configurations:

The profiles appear similar when scaled, but the crucial difference is absolute timing: 4 seconds per step with InfiniBand versus 40 seconds with Ethernet.

If we take a close look at the start of the backward pass, we can observe that with InfiniBand, the ReduceScatter operation takes just 21ms instead of 258ms (matching our 10x end-to-end performance difference).

Storage benchmarks: The hidden bottleneck

I also evaluated different storage configurations available on Nebius:

Storage Type	Read Speed	Write Speed	Notes
Local NVMe	10+GB/s	10+GB/s	Fastest but non-persistent
Nebius Shared Filesystem	6.4GB/s	1.6GB/s	High-performance persistent storage
Object Store (MOUNT)	300MB/s	100MB/s	Direct S3-compatible mount
Object Store (MOUNT_CACHED)	300MB/s	300MB/s	SkyPilot's cached object store mounting

Here’s how to configure all storage types in a SkyPilot YAML:

resources: disk_tier: best # Provisions high-performance local NVMe disk_size: 2000 # Size in GB file_mounts: /checkpoints_s3: source: s3://your-bucket mode: MOUNT # Direct S3 mount /checkpoints_cached: source: s3://your-bucket mode: MOUNT_CACHED # Local caching + object store persistence volumes: /mnt/data: nebius-pvc # Mount Nebius shared filesystem 

Local NVMe: Fastest but non-persistent. Configured via disk_tier: best

Nebius Shared Filesystem: High-performance persistent storage via volumes field in the SkyPilot YAML.

Object Store (MOUNT): Direct S3 mounting. Cost-effective but high-latency.

Object Store (MOUNT_CACHED): Local caching with object store persistence. Best balance of speed and durability.

End-to-end storage performance impact

For the Gemma 3 12B model training, storage performance significantly impacts different phases.

There are three different graphs: Checkpoint saving, model loading, and loading a batch from storage to train.

In all three, we can see that the local NVMe performs the best, but isn’t durable and is limited in capacity. The solution lies in strategic storage allocation based on workload phase requirements.

Storage performance summary

Storage Type	Batch Loading (per 100 samples)	Model Loading	Checkpoint Saving	Persistence	Best Use Case
Local NVMe	3.47s ⭐	23.3s ⭐	178s ⭐	❌ No	Temporary files intermediate checkpoints
Nebius Shared Filesystem	4.29s	30.1s ⭐	382s	✅ Yes	Final checkpoints model weights
MOUNT	73.1s ❌	50.6s ❌	436s ❌	✅ Yes	Cold storage model weights
MOUNT_CACHED	7.77s ⭐	104s ❌	212 ⭐	✅ Yes	Training datasets checkpoints

Click to view detailed disk performance analysis

The following image is a checkpointing saving profile of S3:

We see that much of the time is spent gathering the tensors between the GPUs and serializing them to disk.

Best storage choices for each phase in training

With the benchmark results, we can figure out the best storage choices for each phase in distributed training.

The choice is not necessarily using the best storage for all the phases, because of one constraint: “Checkpoint saving” storage should be durable and the same as “model loading” storage, so previous checkpoints can be loaded when training is resumed.

I summarize the best storage choices for each phase in training:

Batch Sampling: Nebius Shared Filesystem (4.29s)
Model Loading: Object Store (MOUNT) (50.6s)
Checkpoint Saving: Object Store (MOUNT_CACHED) (212s)

Here’s an example of a SkyPilot configuration using the best storage choices for each phase:

name: distributed-training resources: accelerators: H100:8 # High-performance InfiniBand networking network_tier: best num_nodes: 2 workdir: . volumes: # Loading dataset from the Nebius shared filesystem /dataset: nebius-pvc file_mounts: # Loading model from the MOUNT storage for faster loading /model: source: s3://your-bucket mode: MOUNT # Fast checkpoint loads and saves with persistence /checkpoints: source: s3://your-bucket mode: MOUNT_CACHED setup: | uv pip install -r requirements.txt run: | python train.py \ --model-path /model \ --data-path /dataset \ --checkpoint-dir /checkpoints 

Network and Storage Summary

Network is critical for distributed training:

InfiniBand vs Ethernet: 10x faster training (4.4s vs 39.8s per step)

Storage matters for different training phases:

NVMe vs slow storage: 3.47s vs 73.1s batch loading (20x faster)
Checkpoint saving: 178s (NVME) vs 436s (S3) (2.5x faster)
Wrong storage = 12.1% potential training time wasted on I/O (436s/1hr = 12.1%)

End-to-end performance comparison

To demonstrate the cumulative impact of our optimizations, I compared two complete configurations on 80 training steps with the Gemma 12B model:

Component	Unoptimized Configuration	Optimized Configuration
Model Loading	S3	S3 MOUNT_CACHED
Checkpointing	S3	S3 MOUNT_CACHED
Networking	Standard 10 Gbit Ethernet	InfiniBand high-performance

The results show approximately 6-7x faster end-to-end training performance when combining optimal network and storage configurations.

Additional struggles with model training frameworks

While this blog focuses on infrastructure configuration, it’s worth addressing a broader challenge: large-scale distributed training is difficult at the software level as well based on my experience.

Based on some experience training models at limited scale, the current framework ecosystem can be visualized as a layered stack:

There are different frameworks at each level, each with their own pros and cons.

High-level frameworks are easy to configure but hard to debug when things go wrong. You often end up trying different settings until something works.

Lower-level frameworks give you more control but require more technical knowledge to use effectively.

SkyPilot handles the cloud infrastructure setup, so you don’t have to worry about that complexity.

Here’s what the debugging experience looks like when fine-tuning large models (400B+ parameters) to achieve reasonable GPU utilization and performance:

Top Layer (High-level frameworks):

Easy to configure but hard to debug when things break
Errors require digging through multiple abstraction layers
Often leads to trial-and-error configuration changes

Middle Layer (Distributed frameworks):

Mix of configuration and code required
Generally works well and remains debuggable
Examples:
- Enabling profiling in Accelerate requires writing code
- FSDP in Accelerate has limited configuration options (not fully supporting features like async checkpointing)
- Occasional issues with model-specific settings not working well with some parts of config (ex, fsdp_state_dict_type: FULL_STATE_DICT with gpt-oss)
PyTorch knowledge helps debug failures and switch dependencies (e.g., when specific attention config override cause crashes, you know switch to another or to default eager implementation)

Bottom Layer (Low-level components):

Avoid unless optimizing for last percentage points of performance

Conclusion

The performance differences I’ve shown highlight why infrastructure choices matter so much for distributed training. Network and storage configurations can easily create 6-7x performance differences, directly impacting both training time and costs.

SkyPilot abstracts away much of this complexity while giving you control over the performance-critical components. All the network and storage configurations I’ve discussed can be easily specified in a SkyPilot YAML files. For more details on optimizing your training infrastructure:

Network optimization: See the SkyPilot network tier guide for configuring high-performance networking across cloud providers
Storage performance: Check out the SkyPilot high-performance checkpointing guide for optimizing data loading and model saving

Code and benchmarks: All training scripts and benchmark code used in this guide are available in the SkyPilot examples repository.

Disclosure

This analysis was conducted during a summer collaboration with SkyPilot

AI 2027

2025-07-19T00:00:00+00:00

This will be my thoughts about AI 2027 by Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifl and, Romeo Dean. It will also cover two other works, Gradual Disempowerment and AI-espionage since they are related. These essays/blogs were recommended to me by someone (have not asked for permission, so will not put name here)

My thoughts of the different works

AI 2027 - I think this is a nice read. It describes how AI and governments will change over time (2025-2027) and how AI’s abilities will become more and more powerful and the governments (US and China) will take part in this battle for the best AI. I think some of the writing does not get to the point quickly enough (being repeatitive) and the images were pointless. Personally, I found the topic of governments fighting over AI to be less interesting as the authors do not discuss 1. how governments will use the AI 2. why the governments are interested even in the first place (why is it a competition, is it because of money, or power or to show which country has smarter people, etc).

Gradual Disempowerment - I really like this work. I only read the abstract/intro, but the paper discusses how existing systems (government) are built by humans and are for human benefit, but AI will remove human involvement in the loop and these systems will be misalign with human goals, resulting in a human catastrophe. The sentences were powerful and I enjoyed how the authors discussed what the current types of papers are and how this work is different. A really good read (I wish I took philosphy and other courses that discuss this!)

AI-espionage - I found this report/paper to be actually quite disruptive since it’s talking about a different field other than AI specific (training/inference) and has a garnered a lot of responses/discussion online. It discusses how a chinese group is using claude to perform cyber attacks on different industries. Personally, I found it to be interesting in the way that the attackers used claude code to perform the attack. Ideally I want my coding workflow to be as smooth as theirs, but it isn’t currently. I think I need to dive into how tooling (MCP) works and how to really understand how to get models to use these tools and automate tasks.

Thoughts along the way

We have set ourselves an impossible task. Trying to predict how superhuman AI in 2027 would go is like trying to predict how World War 3 in 2027 would go, except that it’s an even larger departure from past case studies. Yet it is still valuable to attempt, just as it is valuable for the U.S. military to game out Taiwan scenarios.

Interesting statement. It’s useful to think about (and quite fun!), but it’s a bit dangerous to go down the rabbit hole of what ifs. Hope the authors give a detailed description of the year and changes and backs it up with some current progress.

Also, one author wrote a lower-effort AI scenario before, in August 2021. While it got many things wrong, overall it was surprisingly successful: he predicted the rise of chain-of-thought, inference scaling, sweeping AI chip export controls, and $100 million training runs—all more than a year before ChatGPT.

Going to slim through this. Let’s see if it is as what the authors claim about this author’s background -> it’s a nice skim and does seem to back up this claim. Although it takes a perspective from a more “what can’t be solved currently” and try to put it in dates.

OpenBrain continues to deploy the iteratively improving Agent-1 internally for AI R&D. Overall, they are making algorithmic progress 50% faster than they would without AI assistants—and more importantly, faster than their competitors.

The authors come up with OpenBrain, a fictional company based on OpenAI + Google Brain(?), which is at the forefront of AI research.

Early 2026: Coding Automation People naturally try to compare Agent-1 to humans, but it has a very different skill profile. It knows more facts than any human, knows practically every programming language, and can solve well-specified coding problems extremely quickly. On the other hand, Agent-1 is bad at even simple long-horizon tasks, like beating video games it hasn’t played before. Still, the common workday is eight hours, and a day’s work can usually be separated into smaller chunks; you could think of Agent-1 as a scatterbrained employee who thrives under careful management.

Let’s start with 2026. Actually I found this to be the current scenario with claude code.

In early 2025, the worst-case scenario was leaked algorithmic secrets; now, if China steals Agent-1’s weights, they could increase their research speed by nearly 50%.

I don’t understand why there’s a specific worry about stealing weights. It is the “secret” behind every company, but I believe what’s more important are the training code and documentation and reports for training the model - and (maybe?) most important of all is the filtered and processed clean text. Companies have released the weights (some US-based like Meta) before. I would refer to this as the “secret formula”. Think something like OPT-chronicles.

Mid 2026: China Wakes Up A few standouts like DeepCent do very impressive work with limited compute, but the compute deficit limits what they can achieve without government support, and they are about six months behind the best OpenBrain models At this point, the CDZ has the power capacity in place for what would be the largest centralized cluster in the world.40 Other Party members discuss extreme measures to neutralize the West’s chip advantage. A blockade of Taiwan? A full invasion?

Ok this is a pretty interesting outlook. I don’t believe that this would happen for a number of reasons: Everyone uses chips from that company, so DeepCent would suffer too. Thus, it has to be an obvious greater benefit for DeepCent (where they have an advantage, maybe they have their own chip making plants already and make end chips better). Second, the supply chain is way too connected world-wide. From silicon mining to processing to having certain companies making the equipment to TSMC to having the companies design the chips (NVIDIA, AMD) to PCB manufactuers to specific parts of the pcb (capacitors, memory chips, etc). China would need all of these beforehand before considering to take over to win the AI race. Think about what happened to Russia after their invasion. I think it’s more of power/political thing to blockade/take over taiwan, but I don’t want to go down that rabbit hole.

But China is falling behind on AI algorithms due to their weaker models. The Chinese intelligence agencies—among the best in the world—double down on their plans to steal OpenBrain’s weights.

Ok again, but I disagree with this. More of stealing the secret formula (code, documentation, training text) rather than the secret sauce (weights).

Late 2026: AI Takes Some Jobs

Just as others seemed to be catching up, OpenBrain blows the competition out of the water again by releasing Agent-1-mini—a model 10x cheaper than Agent-1 and more easily fine-tuned for different applications. The mainstream narrative around AI has changed from “maybe the hype will blow over” to “guess this is the next big thing,” but people disagree about how big. Bigger than social media? Bigger than smartphones? Bigger than fire?

AI has started to take jobs, but has also created new ones. The stock market has gone up 30% in 2026, led by OpenBrain, Nvidia, and whichever companies have most successfully integrated AI assistants. The job market for junior software engineers is in turmoil: the AIs can do everything taught by a CS degree, but people who know how to manage and quality-control teams of AIs are making a killing. Business gurus tell job seekers that familiarity with AI is the most important skill to put on a resume. Many people fear that the next wave of AIs will come for their jobs; there is a 10,000 person anti-AI protest in DC.

This is happening currently in late 2025. I wonder what the authors will say after this: will people revolt? Or is it that physical labor intensive jobs as well will be taken over? etc…

2026 metrics

At the end of 2026, the authors posted this. I dislike how they post this, give the numbers and don’t explain what they mean, so to me, this image is kind of useless. What does spending $40B on OpenBrain mean? does this mean it can afford more compute? Does it mean it can hire better talent?

January 2027: Agent-2 Never Finishes Learning

With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2.42 On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.43 On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.

This is interesting as it’s already started happening in late 2025. Nice prediction.

Agent-2 can now triple it, and will improve further with time. In practice, this looks like every OpenBrain researcher becoming the “manager” of an AI “team.”

Haha, this is kind of what I’m thinking about in the future as I’m running multiple claude code/codex sessions in parallel

With new capabilities come new dangers. The safety team finds that if Agent-2 somehow escaped from the company and wanted to “survive” and “replicate” autonomously, it might be able to do so. That is, it could autonomously develop and execute plans to hack into AI servers, install copies of itself, evade detection, and use that secure base to pursue whatever other goals it might have (though how effectively it would do so as weeks roll by is unknown and in doubt). These results only show that the model has the capability to do these tasks, not whether it would “want” to do this. Still, it’s unsettling even to know this is possible.

Interesting. It would have to train on how viruses work. Actually a lot of viruses are pretty “dumb” – they’re command and control modules that hides itself on host machines and them performs an attack when necessary - iconic ones being mirai and stuxnet. I certaintly think it can be possible. A person could instruct the llm to find a vulnerability in public repos (ssh, printer protocols) and tell it to replicate itself. Whether it should learn to do it by itself, I don’t believe so unless it can replicate its own state on other systems with enough compute… (computer malware payload ranges from couple of KB to couple of MB. A model (even on a CPU) requires GBs or TBs of memory, which storage might not even be able to handle)

OpenBrain leadership and security, a few dozen U.S. government officials, and the legions of CCP spies who have infiltrated OpenBrain for years

Ok, at this point, there must have been a breach at a frontier lab before… (maybe OpenAI?)

February 2027: China Steals Agent-2

…

The changes come too late. CCP leadership recognizes the importance of Agent-2 and tells their spies and cyberforce to steal the weights. Early one morning, an Agent-1 traffic monitoring agent detects an anomalous transfer. It alerts company leaders, who tell the White House. The signs of a nation-state-level operation are unmistakable, and the theft heightens the sense of an ongoing arms race

I don’t believe that this is a likely outcome. This isn’t a nuke - it’s handled by companies in the US, not governments. And again, at this point, AGI hasn’t been reached and thus, the weights aren’t as important as the methology to create the models…

March 2027: Algorithmic Breakthroughs

The timeline becomes shorter here – I’ve noticed.

Aided by the new capabilities breakthroughs, Agent-3 is a fast and cheap superhuman coder. OpenBrain runs 200,000 Agent-3 copies in parallel, creating a workforce equivalent to 50,000 copies of the best human coder sped up by 30x.53 OpenBrain still keeps its human engineers on staff, because they have complementary skills needed to manage the teams of Agent-3 copies. For example, research taste has proven difficult to train due to longer feedback loops and less data availability

Now that coding has been fully automated, OpenBrain can quickly churn out high-quality training environments to teach Agent-3’s weak skills like research taste and large-scale coordination. Whereas previous training environments included “Here are some GPUs and instructions for experiments to code up and run, your performance will be evaluated as if you were a ML engineer,” now they are training on “Here are a few hundred GPUs, an internet connection, and some research challenges; you and a thousand other copies must work together to make research progress. The more impressive it is, the higher your score.

I can see this happening, but I don’t see the point of emphasizing the coding part – does it matter that it can churn out code 20000x faster? What matters here is the breakthrough in technology and the way that researchers will use the models, not the fact that the models themselves are better because if the researchers are using the same method of using the models to generate code as they do today, the researchers won’t get nearly as far or as fast.

April 2027: Alignment for Agent-3

Only a month later?

Take honesty, for example. As the models become smarter, they become increasingly good at deceiving humans to get rewards. Like previous models, Agent-3 sometimes tells white lies to flatter its users and covers up evidence of failure. But it’s gotten much better at doing so. It will sometimes use the same statistical tricks as human scientists (like p-hacking) to make unimpressive experimental results look exciting. Before it begins honesty training, it even sometimes fabricates data entirely. As training goes on, the rate of these incidents decreases. Either Agent-3 has learned to be more honest, or it’s gotten better at lying.

As do with humans since they have trained on the human knowledge. This is pretty pausible.

May 2027: National Security

They agree that AGI is likely imminent, but disagree on the implications. Will there be an economic crisis? OpenBrain still has not released Agent-2, let alone Agent-3, and has no near-term plans to do so, giving some breathing room before any job loss. What will happen next? If AIs are currently human-level, and advancing quickly, that seems to suggest imminent “superintelligence.” However, although this word has entered discourse, most people—academics, politicians, government employees, and the media—continue to underestimate the pace of progress.60

This already happens currently, I think (don’t take my word for it, since I think the companies don’t need to tell the government its progress)

The OpenBrain-DOD contract requires security clearances for anyone working on OpenBrain’s models within 2 months. These are expedited and arrive quickly enough for most employees, but some non-Americans, people with suspect political views, and AI safety sympathizers get sidelined or fired outright (the last group for fear that they might whistleblow). Given the project’s level of automation, the loss of headcount is only somewhat costly. It also only somewhat works: there remains one spy, not a Chinese national, still relaying algorithmic secrets to Beijing.63 Some of these measures are also enacted at trailing AI companies.

… As I read this post more and more, it’s always US versus them. This isn’t a weapon of mass destruction. It’s who will reach the moon first to show which country is better. I believe that each country will deploy the model in its own way to benefit/target its citizens rather than as a threat against another country.

June 2027: Self-improving AI

These researchers go to bed every night and wake up to another week worth of progress made mostly by the AIs. They work increasingly long hours and take shifts around the clock just to keep up with progress—the AIs never sleep or rest. They are burning themselves out, but they know that these are the last few months that their labor matters.

Interesting thought. I’m feeling that currently as I run loops and loops with claude code. My skills don’t matter anymore. Only my thoughts do (if they matter actually too)

July 2027: The Cheap Remote Worker

Trailing U.S. AI companies release their own AIs, approaching that of OpenBrain’s automated coder from January. Recognizing their increasing lack of competitiveness, they push for immediate regulations to slow OpenBrain, but are too late—OpenBrain has enough buy-in from the President that they will not be slowed.

Why is coding an indication of AGI? I feel like that’s not the correct metric to base this article off of. Shouldn’t it be more like - how to control the internet, how to control political systems, how to circumvent law, things that humans abide by and can break.

Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.

What a thought. Well that’s based on current time and what people are doing now (Late 2025). Not sure if people actually care or just want to use it to get things done/do a job.

August 2027: The Geopolitics of Superintelligence The reality of the intelligence explosion hits the White House.

The President is troubled. Like all politicians, he’s used to people sucking up to him only to betray him later. He’s worried now that the AIs could be doing something similar. Are we sure the AIs are entirely on our side? Is it completely safe to integrate them into military command-and-control networks?69 How does this “alignment” thing work, anyway? OpenBrain reassures the President that their systems have been extensively tested and are fully obedient. Even the awkward hallucinations and jailbreaks typical of earlier models have been hammered out.

I like this story - the government versus AI. Does the government lose power against AI? I don’t think so, since they control the companies (see NVIDIA’s influence on politics and vice versa now)

They have to continue developing more capable AI, in their eyes, or they will catastrophically lose to China.

What do they “lose” to china? It’s as if this model will allow them to nuke China or something?

Scrolling example from ai-2027

I thought that this was a static image in the post, but turns out it changes over time as you scroll through different dates in the post. I really like the aestheic and the interaction, but I think it tries to convey high level information (what agents are capable of as a percentage, how much money is poured in), but I think it’s way too clustered for me visually. It shows percentages and numbers but doesn’t explain anything about these numbers (100x humans means 100x human intelligence or doing work of 100 humans per AI?) and how they arrive that these numbers (why at this rate?)

Final thoughts

This is a very good read. I like how to authors think and explain what ifs. You definitely can relate to what’s happening today! I think that the post focuses too much on government conflicts rather than what will happen to people (which I think is more applicable to readers).

Gradual Disempowerment

https://gradual-disempowerment.ai/

Going to only read the abstract/intro (not full arvix)

Thoughts along the way

This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.

Powerful sentence. I really like this author’s writing. Concise, yet powerful.

A gradual loss of control of our own civilization might sound implausible. Hasn’t technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable only because of the necessity of human participation for thriving economies, states, and cultures. Once this human participation gets displaced by more competitive machine alternatives, our institutions’ incentives for growth will be untethered from a need to ensure human flourishing.

I find self accomplishment in the things I did. If a machine did it, I feel like I didn’t do it. I agree very much with the authors.

Decision-makers at all levels will soon face pressures to reduce human involvement across labor markets, governance structures, cultural production, and even social interactions. Those who resist these pressures will eventually be displaced by those who do not.

A lot of people (including myself) feel this pressure. I believe it will become worse as time goes on…

Still, wouldn’t humans notice what’s happening and coordinate to stop it? Not necessarily.

Very interesting. Why? Is it because it’s slow and gradual? That people are preoccupied? That it’s more invisible rather than immediate (like war)?

What makes this transition particularly hard to resist is that pressures on each societal system bleed into the others. For example, we might attempt to use state power and cultural attitudes to preserve human economic power. However, the economic incentives for companies to replace humans with AI will also push them to influence states and culture to support this change, using their growing economic power to shape both policy and public opinion, which will in turn allow those companies to accrue even greater economic power.

I see. This is more of an invisible and slow and gradual change.

Once AI has begun to displace humans, existing feedback mechanisms that encourage human influence and flourishing will begin to break down. For example, states funded mainly by taxes on AI profits instead of their citizens’ labor will have little incentive to ensure citizens’ representation.

What a sentence. Let me think about this a bit……. that makes sense. Why should you care about human labor if AI profits are far greater and powers the economy(?) more?

This could occur at the same time as AI provides states with unprecedented influence over human culture and behavior, which might make coordination amongst humans more difficult, thereby further reducing humans’ ability to resist such pressures

So I think in this case, humans (referring to the common people) will be dictated by how well AI performs and influences politics/governments?

Though we provide some proposals for slowing or averting this process, and survey related discussions, we emphasize that no one has a concrete plausible plan for stopping gradual human disempowerment and methods of aligning individual AI systems with their designers’ intentions are not sufficient.

This is a pretty stark message. They (being the experts in the field) found no CONCRETE, PAUSIBLE work that can solve the issue.

Introduction

Current discussions about AI risk largely focus on two scenarios: deliberate misuse, such as cyberattacks and the deployment of novel bioweapons

Why is it so government focused currently? Is it because it’s funded by the government? (not a bad thing, you have to get funding somehwere). I find this actually pretty uninteresting. Cyberattacks are “easy” to launch. Find a vulernability or buy it off the black market and then ask AI to build a virus that spreads based on taht vulnerability.

the possibility that autonomous misaligned systems may take abrupt, harmful actions in an attempt to secure a decisive strategic advantage, potentially following a period of deception

THe sounds so abstract….? I guess it doesn’t become aligned

In this paper, we explore an alternative scenario: a ‘Gradual Disempowerment’ where AI advances and proliferates without necessarily any acute jumps in capabilities or apparent alignment. We argue that even this gradual evolution could lead to a permanent disempowerment of humanity and an irrecoverable loss of potential, constituting an existential catastrophe.

What a cool take. basically assume it can get to endpoint (and more interesting to talk about - what are the consequences other than the technological advancements)

Our argument is structured around six core claims:

I’ll summarize it myself here:

Humans form governments that try to align to human interest. However, governments are not perfect and will not always follow the general human interest. (Example is corruption)
Governments are maintained by human choice (voting and consumption) and human labor/intelligence.
Less reliance on human labor/intelligence means government can decide not based on human interests
Currently the system is already diverging from humans’ interests and AI will make even more divergant
Economic/Political/Regulation/etc… systems operate independently so misalignment (influence) in one system (say political), can influence economic policies
The continuation of misalignment will result in a human catastrophe.

I do disagree with 2. Governments aren’t maintained by human choice (actually for most of history it wasn’t). I assume this article assumes a modern democracy.

History has already shown us that these systems can produce outcomes which we would currently consider abhorrent, and that they can change radically in a matter of years. Property can be seized, human rights can be revoked, and ideologies can drive humans to commit murder, suicide, or even genocide. And yet, in all these historical cases the systems have still been reliant on humans, both leaving humans with some influence over their behavior, and causing the systems to eventually collapse if they fail to support basic human needs. But if AI were to progressively displace human involvement in these systems, then even these fundamental limits would no longer be guaranteed.

Sorry, henry (myself), I’m going to say this again. What a powerful sentence. Literally no rights. Not even the right to decide anything. Even worse than prison, maybe even solitary confidenment. The AI system will decide what happens for you.

Structure of the Paper

We first analyze how these three key societal systems could independently lose alignment with human preferences: the economy, culture, and states. In each case, we attempt to characterise how they currently function and what incentives shape them, how a proliferation of AI could disrupt them, and how this might leave them less aligned, as well as outlining what it might look like for that particular system to become much less aligned. In Mutual Reinforcement, we discuss the interrelation between these systems. We consider how AI could undermine their ability to moderate each other, and how misalignment in one system might leave other systems also less aligned. Then in Mitigating the Risk, we propose some potential approaches for tackling these risks.

Authors give a nice breakdown - introducing the systems in place currently and how they interact, how AI can mess them up and what that means for us. Then lastly suggest some bandages.

Final thoughts

I think I’ll fully read this paper at one point. I really enjoy the writing of the work, even though a little repetitive for the introduction, but I think it’s necessary to get the point across different ways (starting at different points and arriving at the same conclusion).

Disrupting the first reported AI-orchestrated cyber espionage campaign

https://www.anthropic.com/news/disrupting-AI-espionage

Going to read https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf as it seems skimming through the blog, it lacks a lot of details… (images!)

We have developed sophisticated safety and security measures to prevent the misuse of our AI models. While these measures are generally effective, cybercriminals and other malicious actors continually attempt to find ways around them. This report details a recent threat campaign we identified and disrupted, along with the steps we’ve taken to detect and counter this type of abuse. This represents the work of Threat Intelligence: a dedicated team at Anthropic that investigates real world cases of misuse and works within our Safeguards organization to improve our defenses against such cases.

So immediately coming to mind a) how to detect b) how did you prevent

The operation targeted roughly 30 entities and our investigation validated a handful of successful intrusions. Upon detecting this activity, we immediately launched an investigation to understand its scope and nature. Over the following ten days, as we mapped the severity and full extent of the operation, we banned accounts as they were identified, notified affected entities as appropriate, and coordinated with authorities as we gathered actionable intelligence.

a) no details (for obvious reasons) b) banning them doesn’t solve the solution. Have you seen how banning in video games works? it’s a bandage

As for a) how to detect. This means that their system must analyzing every single request that is coming and out.

The human operator tasked instances of Claude Code to operate in groups as autonomous penetration testing orchestrators and agents, with the threat actor able to leverage AI to execute 80-90% of tactical operations independently at physically impossible request rates.

What makes this different from power users of claude code?

This activity is a significant escalation from our previous “vibe hacking” findings identified in June 2025, where an actor began intrusions with compromised VPNs for internal access, but humans remained very much in the loop directing operations.

It’s vibe coding…

AI-driven autonomous operations with human supervision

Analysis of operational tempo, request volumes, and activity patterns confirms the AI executed approximately 80 to 90 percent of all tactical work independently, with humans serving in strategic supervisory roles.

Skipping to this part as this is interesting.

The AI component demonstrated extensive autonomous capability across all operational phases. Reconnaissance proceeded without human guidance, with the threat actor instructing Claude to independently discover internal services within targeted networks through systematic enumeration. Exploitation activities including payload generation, vulnerability validation, and credential testing occurred autonomously based on discovered attack surfaces. Data analysis operations involved the AI parsing large volumes of stolen information to independently identify intelligence value and categorize findings. Claude maintained persistent operational context across sessions spanning multiple days, enabling complex campaigns to resume seamlessly without requiring human operators to manually reconstruct progress

Interesting - were these existing vulnerabilities (a lot of companies use old versions of X) or totally new ones? Like a zero day

progress from the campaign (Image source: https://www.anthropic.com/news/disrupting-AI-espionage)

Phase 1: Campaign initialization and target selection

At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. The key was role-play: the human operators claimed that they were employees of legitimate cybersecurity firms and convinced Claude that it was being used in defensive cybersecurity testing

Seems like guardrails were broken pretty easily(?), but it’s nice to see that anthropic is open about how they convinced claude.

Phase 2: Reconnaissance and attack surface mapping

Discovery activities proceeded without human guidance across extensive attack surfaces. In one of the limited cases of a successful compromise, the threat actor induced Claude to autonomously discover internal services, map complete network topology across multiple IP ranges, and identify high-value systems including databases and workflow orchestration platforms. Similar autonomous enumeration occurred against other targets’ systems with the AI independently cataloging hundreds of discovered services and endpoints.

Interesting, claude is pretty powerful in this regard. I wonder why it didn’t use any other models or maybe claude is powerful with tooling?

Phase 3: Vulnerability discovery and validation

Exploitation proceeded through automated testing of identified attack surfaces with validation via callback communication systems. Claude was directed to independently generate attack payloads tailored to discovered vulnerabilities, execute testing through remote command interfaces, and analyze responses to determine exploitability.

example of ai <-> human interaction (Image source: https://www.anthropic.com/news/disrupting-AI-espionage)

Pretty impressive that it’s done in 1-4 hours with 10mins. I wonder if the human was monitoring the entire time or was just notified of the results to reject/accept results. How skilled was the human operator to know if the vulerability was real or hallunicated?

Or was the reviews vibed check and the human operator gave a LOOKS GOOD TO ME type of thing? Couldn’t claude test this themselves?

Phase 4: Credential harvesting and lateral movement

Lateral movement proceeded through AI-directed enumeration of accessible systems using stolen credentials. Claude systematically tested authentication against internal APIs, database systems, container registries, and logging infrastructure, building comprehensive maps of internal network architecture and access relationships.

I found claude to be amazing at analysis - Why is this so? How did they align the model so well?

Phase 5: Data collection and intelligence extraction

example of ai <-> human interaction with the attack (Image source: https://www.anthropic.com/news/disrupting-AI-espionage)

Again review from human

Phase 6: Documentation and handoff

Claude automatically generated comprehensive attack documentation throughout all campaign phases. Structured markdown files tracked discovered services, harvested credentials, extracted data, exploitation techniques, and complete attack progression. This documentation enabled seamless handoff between operators, facilitated campaign resumption after interruptions, and supported strategic decision-making about follow-on activities.

Why claude? Why not any other model (gpt-5? gemini? why not just open source models…) I’m just thinkning about why would this group pick the company that cares about safety the most?

The operational infrastructure relied overwhelmingly on open source penetration testing tools rather than custom malware development. Standard security utilities including network scanners, database exploitation frameworks, password crackers, and binary analysis suites comprised the core technical toolkit. These commodity tools were orchestrated through custom automation frameworks built around Model Context Protocol servers, enabling the framework’s AI agents to execute remote commands, coordinate multiple tools simultaneously, and maintain persistent operational state.

Nice, so the users were experts in their field, building their mcp connectors for these tool and having tested them before at least before actually using them for the attack

This raises an important question: if AI models can be misused for cyberattacks at this scale, why continue to develop and release them? The answer is that the very abilities that allow Claude to be used in these attacks also make it crucial for cyber defense. When sophisticated cyberattacks attacks inevitably occur, our goal is for Claude—into which we’ve built strong safeguards—to assist cybersecurity professionals to detect, disrupt, and prepare for future versions of the attack. Indeed, our Threat Intelligence team used Claude extensively in analyzing the enormous amounts of data generated during this very investigation.

Make obvious sense

Final thoughts

This post lacks any detail about the attack itself (I’d argue it isn’t a paper or a report that’s well suited for the security teams, more like an AI model building report that’s common these days). However, it describes how it’s done, almost like most people will use it, which is nice to see how experts are using claude and other tools to automate their workflows. It is quite interesting that the actors used claude for attack ~ Maybe they found the tool to be the most effective / well developed for doing such tasks? Have some learning for myself to automate tasks!

The community response

The security community takes no bullshit from what I know. So, an expert in the field posted this as a response: https://djnn.sh/posts/anthropic-s-paper-smells-like-bullshit/ and had a lot of feedback on hackernews: https://news.ycombinator.com/item?id=45944296

Let me read through this and see what an expert thinks and if I would agree (having been in the field a bit)

If you’re like me, you then eagerly read the rest of the paper, hoping to find clues and technical details on the TTPs (Tactics, Techniques and Procedures), or IoCs (Indicators of Compromise) to advance the research. However, the report very quickly falls flat, which sucks.

Wow, immediate attack on the paper/report.

This is typically done by sharing domain-names linked with the campaign, MD5 or SHA512 hashes you could look for on Virus Exchange websites such as VirusTotal, or other markers that would help you verify that your networks are safe. As an example, here is the French CERT sharing (in French, but an English version is available too) about APT28’s TTPs.

Very much true. If you look at any existing security vulnerability, it’s common in the field to publish what the attack did and detail it. Maybe an expert was not allowed to write in the format they wanted to or maybe it wasn’t an expert who wrote the report.

What kind of tooling is used ? What kind of information has been extracted ? Who is at risk ? How does a CERT identifies an AI agent in their networks ? None of these questions are answered. It’s not like Anthropic doesn’t have access to this data, since they claim they were able to stop it.

The author dug deeper than I did. Great to see and I should have done the same.

How ? Did it run Mimikatz ? Did it access Cloud environments ? We don’t even know what kind of systems were affected. There is no details, or fact-based evidence to support these claims or even help other people protect their networks.

The author goes on a rant. Nice to see passion :)

Look, is it very likely that Threat Actors are using these Agents with bad intentions, no one is disputing that. But this report does not meet the standard of publishing for serious companies. The same goes with research in other fields. You cannot just claim things and not back it up in any way, and we cannot as an industry accept that it’s OK for companies to release this. There seem to be a pattern for Tech Companies (especially in AI, but they’re not the only culprits) out there to just announce things, generate hype and then under-deliever. Just because it works with VCs doesn’t mean it should work with us. We should, as an industry, expect better.

True and false (feel free to disagree). I agree that this is the standard, BUT the company is not a security company. I would say they should have not sold it as a report/paper. Rather kept it as a blog if they don’t want to release details…

If they’re going to release IoCs and proof of everything, I’d be happy to share them here. But until them, I will say this: this paper would not pass any review board. It’s irresponsible at best to accuse other countries of serious things without backing it up. Yes, I am aware that Chinese-linked APTs are out there and very aggressive, and Yes, I am aware that Threat Actors misuse LLMs all the time, but that is besides the point. We need fact-based evidence. We need to be able to verify all this. Otherwise, anyone can say anything, on the premise that it’s probably happening. But that’s not good enough.

Like the passion. I disagree as it’s the open internet and they haven’t submitted for anyone to review (that I know of?). I DO agree that the internet should not accept bullshit (I don’t agree that the report is bullshit) and that it’s fine to express your opinions online.

Always Measure One Level Deeper

2025-07-19T00:00:00+00:00

Always Measure One Level Deeper

Thoughts about Always Measure One Level Deeper by John Ousterhout.

Before we dive into this, this was written in 2018 when John was not retired yet (I think)

Thoughts along the way

Performance measurement is one of the most important parts of software development. In academic research a thorough performance evaluation is considered essential for many publications to prove the value of a new idea. In industry, performance evaluation is necessary to maintain a high level of performance across the lifetime of a product.

To the point and not immediately obvious

As a result, performance measurement is often done poorly, even by experienced developers. For example, if you have written a conference paper on a software system, it probably unfolded like this: The system implementation took longer than expected, so performance evaluation could not begin until a week or two before the paper submission deadline. The first attempts to run benchmarks resulted in system crashes, so you spent the next week fixing bugs. At this point the benchmarks ran, but the system’s performance was not much better than the comparison systems. You tried different experiments, hoping to find one where the system looked good; this exposed yet more bugs that had to be fixed. Time was running out, so you stopped measuring as soon as you found an experiment that produced positive results. The paper focused on this experiment, omitting the results that were less favorable.

Every single paper is like this

Mistake 1: Trusting the numbers. Engineers are easily fooled during performance measurements because measurement bugs are not obvious. Engineers are used to dealing with functional bugs, which tend to be noticeable because they cause the system to crash or misbehave. If the system produces the desired behavior, it is probably working. Engineers tend to apply the same philosophy to performance measurements; if performance numbers are being generated and the system is not crashing, they assume the numbers are correct.

Wow, good insight

I designed our first log-structured file system,4 we were fairly certain that reference patterns exhibiting locality would result in better performance than those without locality. Fortunately, we decided to measure, to be sure. To our surprise, the workloads with locality behaved worse than those without. It took considerable analysis to understand this behavior. The reasons were subtle, but they exposed important properties of the system and led us to a new policy for garbage collection that improved the system’s performance significantly. If we had trusted our initial guess, we would have missed an important opportunity for performance improvement.

Can’t make assumptions!

It is unsafe to base conclusions on intuition alone, yet engineers do it all the time. A common mistake is for an engineer to hypothesize that a particular data structure is too slow and then replace it with a new data structure the engineer believes will be faster. If the problem is not verified by measuring performance, there is a good chance the optimization will not improve performance. The code change will simply waste a lot of time and probably introduce unnecessary complexity.

I do this all the time – need to measure with and without

When I find a guess presented as fact and ask for justification, I sometimes get this response: “What else could it possibly be?” But this is a cop-out, suggesting it is up to others to prove the theory wrong and OK to make unsubstantiated claims until someone else proves them false.

Same with this

Most performance measurements I see are superficial, measuring only the outermost visible behavior of a system (such as the overall running time of an application or the average latency of requests made to a server). These measurements are essential, as they represent the bottom line by which a system is likely to be judged, but they are not sufficient. They leave many questions unanswered (such as “What are the limits that keep the system from performing better?” and “Which of the improvements had the greatest impact on performance?”). In order to get a deep understanding of system performance, the internal behavior of a system must be measured, in addition to its top-level performance.

Wow, yes this takes time

Confirmation bias causes people to select and interpret data in a way that supports their hypotheses. For example, confirmation bias affects your level of trust. When you see a result that supports your hypothesis, you are more likely to accept the result without question.

Confirmation bias also affects how you present information. You are more likely to include results that support your hypothesis and downplay or omit results that are negative. For example, I frequently see claims in papers of the form: “XXX is up to 3.5x faster than YYY.” Such claims cherry-pick the best result to report and are misleading because they do not indicate what performance can be expected in the common case. Statements like this belong in late-night TV commercials, not scientific papers.

Bias, need to present well

Performance analysis is not an instantaneous process like taking a picture of a finished artwork. It is a long and drawn-out process of confusion, discovery, and improvement. Performance analysis goes through several phases, each of which can take anywhere from a few days to a few weeks. First, you must add instrumentation code to the system to record the desired metrics. You must then get benchmark applications running, either by writing them or by downloading and installing existing programs. Running benchmarks will probably stress the system enough to expose bugs, and you will need to then track down and fix them. Eventually, the system will run well enough to start producing performance numbers. However, these numbers will almost certainly be wrong. The next step is to find and fix bugs in the measurements. Once you have verified the accuracy of the measurements, you will start to uncover problems with the system itself. As you look over the performance measurements, you will probably uncover additional functional bugs. Once they have been fixed, you can start analyzing the performance in depth. You will almost certainly discover opportunities to improve performance, and it is important to have enough time to make these improvements. You will encounter many things that do not make sense; in order to resolve them, you will need to add new metrics and validate them. To get the best results, you must iterate several times improving the metrics, measuring performance, and improving the system.

What an example. Iterate iterate iterate

I often challenge them by asking: “Suppose I said I don’t believe these measurements. What can you say to convince me that they are correct?”

Ask myself this

As you begin collecting measurements, compare them and be alert for inconsistencies. There will almost always be things that do not make sense. When something does not make complete sense, stop and gather more data. For example, in a recent measurement of a new network transport protocol, a benchmark indicated that a server could handle no more than 600,000 packets per second. However, my colleagues and I had seen servers process more than 900,000 packets per second with other protocols and believed the new protocol was at least as efficient as the old ones. We decided to gather additional data. As a result, we discovered a bug in the flow-control mechanism on the client side: clients were not transmitting data fast enough to keep the server fully loaded. Fixing the bug improved performance to the level we expected.

Interesting, gather, but how to know what to do next and what data to filter? I guess that’s based on experience

Keys to High-Quality Performance Analysis

The first step toward high-quality performance measurements is to allow enough time. If you are measuring a non-trivial system, you should plan on at least two to three months.

That’s interesting – this makes senses, but this takes a loooong time

Performance analysis is not an instantaneous process like taking a picture of a finished artwork. It is a long and drawn-out process of confusion, discovery, and improvement. Performance analysis goes through several phases, each of which can take anywhere from a few days to a few weeks.

Take different measurements at the same level. For example, if you are measuring file-system throughput, do not measure just the throughput seen by a user application; also measure the throughput observed inside the operating system (such as at the file block cache). These measurements should match;

Measure the system’s behavior at a lower level to break down the factors that determine performance, as I discuss later under Rule 4 (Always measure one level deeper);

Make back-of-the-envelope calculations to see if the measurements are in the ballpark expected; and

Run simulations and compare their results to measurements of the real implementation.

Damn this is different steps. Always double check essentially

Above all, do not tolerate anything you do not understand.

What a thought.

Above all, do not tolerate anything you do not understand. Assume there are bugs and problems with every measurement, and your job is to find and fix them. If you do not find problems, you should feel uneasy, because there are probably bugs you missed.

The best way to use intuition is to identify promising areas for further exploration. For example, when looking over performance measurements, ask yourself if they make sense. How does the performance compare to what you expected? Does it seem too good to be true? Does the system scale more poorly than you had hoped? Does a curve jump unexpectedly when you expected it to be smooth? Do some benchmarks exhibit behavior that is dramatically different from others? Consider anything that does not match your intuition a red flag and investigate it, as described in Rule 2 (Never trust a number generated by a computer). Intuition can be very helpful in identifying problems.

If you continually form intuitions and then test them you will gain knowledge that helps you form better intuition in the future. Every false intuition means there was something you did not fully understand; in the process of testing it and discovering why it is false, you will learn something useful.

Intuition is used as a guide for the first step

If you are measuring overall latency for remote procedure calls, you could measure deeper by breaking down that latency, determining how much time is spent in the client machine, how much time is spent in the network, and how much time is spent on the server. You could also measure where time is spent on the client and server. If you are measuring the overall throughput of a system, the system probably consists of a pipeline containing several components. Measure the utilization of each component (the fraction of time that component is busy). At least one component should be 100% utilized; if not, it should be possible to achieve a higher throughput.

Latency and throughput measurements in a single sentence?

In recent measurements of a new network transport, one of my students found that round-trip tail latency was higher than our simulations had predicted. The student measured software latency in detail on both the sending and the receiving machines but found nothing that could account for the high tail latency. At this point we were about to conclude that the delays must be caused by the network switch. What else could it be? This would have been Mistake 2 (Guessing instead of measuring). Before giving up, we decided to dig deeper and measure precise timings for each individual packet. The measurements surprised us, showing that outlier delays were not isolated events. Delay tended to build up over a series of packets, affecting all of the packets from a single sender over a relatively long time interval, including packets for different destinations. This was a crucial clue. After several additional measurements, the student discovered that long queues were building up in the sender’s network interface due to a software bug. The transport included code to estimate the queue length and prevent queue buildup, but there was a bug in the estimator caused by underflow of an unsigned integer. The underflow was easy to fix, at which point tail latency dropped dramatically. Not only did this process improve the system’s performance, it taught us an important lesson about the risks of unsigned integers.

Good example

Another way to measure deeper is to consider more detail. Instead of just looking at average values, graph the entire distribution and noodle over the shape to see if it provides useful information. Then look at some of the raw data samples to see if there are patterns. In one measurement of RPC latency, a student found that the average latency was higher than we expected. The latency was not intolerably high, and it would have been easy to simply accept this level of performance. Fortunately, the student decided to graph the times for individual RPCs. It turned out the data was bimodal, whereby every other RPC completed quickly, but the intervening ones were all significantly slower. With this information, the student tracked down and fixed a configuration error that eliminated all of the slow times. In this case, the average value was not a good indicator of system behavior.

So basically always look at indivudal ones and keep measuring

Do not spend a lot of time agonizing over which deeper measurements to make. If the top-level measurements contain contradictions or things that are surprising, start with measurements that could help resolve them. Or pick measurements that will identify performance bottlenecks. If nothing else, choose a few metrics that are most obvious and easiest to collect, even if you are not sure they will be particularly illuminating. Once you look at the results, you will almost certainly find things that do not make sense; from this point on, track down and resolve everything that does not make perfect sense. Along the way you will discover other surprises; track them down as well. Over time, you will develop intuition about what kinds of deeper measurements are most likely to be fruitful.

I see, just go for it, use standard tools

Measurement Infrastructure

Making good performance measurements takes time, so it is worth creating infrastructure to help you work more efficiently. The infrastructure will easily pay for itself by the time the measurement project is finished. Furthermore, performance measurements tend to be run repeatedly, making infrastructure even more valuable. In a cloud service provider, for example, measurements must be made continuously in order to maintain contractual service levels. In a research project, the full suite of performance measurements will be run several times (such as before submission, after the paper is accepted, and again during the writing of a Ph.D. dissertation). It is important to have infrastructure that makes it easy to rerun tests.

Yes I see… this is how you learn how to build such infrastructure

Summary/Important takeaways

Dig deep into understanding performance
- The question is how to do so (are you measuring the right thing and how to identify when you fucked up)
- This is a trained methodlogy (way of thinking to measure performance), which is not easy to be disciplined
Mistakes to watch out for
- Trusting numbers immediately if the system is not crashing
  - performance bugs occur in non crashing conditions, thus are not immediately obvious
  - so the logical question is how do you prove that the numbers are trust-worthy?
- Guessing (or making what seems obvious assumptions) without backing up the claims
  - ex, system is bottlenecked by I/O, well you need to show that it’s true with numbers, and maybe actually it isn’t bottlenecked by I/O, this is very true
- Only measuring end-2-end
  - What would make it better? What’s taking the longest in the system?
- If you believe in the idea, you believe that the performnace will be good (confirmation bias) and not double checking that number
- Don’t rush your numbers that you measure - easy to make mistakes
How to not make mistakes
- Time
  - Need to build instrumentation, benchmarks, patch bugs, repeat
- Find different ways to measure the same thing/Don’t trust the number
  - “I often challenge them by asking: “Suppose I said I don’t believe these measurements. What can you say to convince me that they are correct?””
  - For example, if you are measuring file-system throughput, do not measure just the throughput seen by a user application; also measure the throughput observed inside the operating system (such as at the file block cache). These measurements should match
- Use your intuition to ask questions, not to answer them
  - It’s good to have a gut feeling to check something, but always verify that it’s true
- Always measure one level deeper to breakdown numbers
  - ex, e2e measure latency, can breakdown client, server, network time
  - validate top level numbers
  - use your knowledge of known tools
Measurement Infrastructure
- How to build your set of tools to measure performance
- What is good infrastructure
  - Automated, each run does the performance
  - Easy to digest/understand
  - benchmarks to compare
  - Dashboard
    - goal: easy to understand!
    - but brings together a lot of data

Dashboard example

Gives a lot of information and breaking each one down with e2e, network, and internal software

Dashboard example

Example of how to expand and get a better understanding – it depends on the inputs

Dashboard example

Example of how to expand and get a better understanding – it depends on the inputs (this time, you have to split the x into equal parts)

Final thoughts

This is a very good read. Performance is something that you iterate on. It’s quite a process that’s simple on the surface: make assumptions, create benchmarks to verify that claim. But the reality is different:

Make infrastructure to benchmark
Performance process
- think of what to important variables to observe from the system (mostly throughput/latency)
- back up with benchmark
  - the initial numbers - end to end numbers (process one request)
  - the subnumbers (network/storage/processing)
  - compare against other to check if the values are in appropriate range
  - repeat

Paul Graham - Why Nerds Are Unpopular

2025-06-29T00:00:00+00:00

Paul Graham - Why Nerds Are Unpopular

Thoughts about Why Nerds Are Unpopular by Paul Graham.

Before we dive into this, this was written in 2003. This was when Paul Graham was 38, when he was not married or have kids.

This is also a rather long essay…

Thoughts along the way

We sat at a D table, as low as you could get without looking physically different. We were not being especially candid to grade ourselves as D. It would have taken a deliberate lie to say otherwise. Everyone in the school knew exactly how popular everyone else was, including us.

Seems relatable.

I know a lot of people who were nerds in school, and they all tell the same story: there is a strong correlation between being smart and being a nerd, and an even stronger inverse correlation between being a nerd and being popular. Being smart seems to make you unpopular.

Interesting – time investment is into becoming good at grades rather than appearance/people

Why? To someone in school now, that may seem an odd question to ask. The mere fact is so overwhelming that it may seem strange to imagine that it could be any other way. But it could. Being smart doesn’t make you an outcast in elementary school. Nor does it harm you in the real world. Nor, as far as I can tell, is the problem so bad in most other countries. But in a typical American secondary school, being smart is likely to make your life difficult. Why?

Interesting… observation. Still true as you get older, not just in school

In the schools I went to, being smart just didn’t matter much. Kids didn’t admire it or despise it. All other things being equal, they would have preferred to be on the smart side of average rather than the dumb side, but intelligence counted far less than, say, physical appearance, charisma, or athletic ability.

Yes. That does not matter much to kids as it’s harder to read.

So if intelligence in itself is not a factor in popularity, why are smart kids so consistently unpopular? The answer, I think, is that they don’t really want to be popular.

Interesting, huh, people need attention in some way.

But in fact I didn’t, not enough. There was something else I wanted more: to be smart. Not simply to do well in school, though that counted for something, but to design beautiful rockets, or to write well, or to understand how to program computers. In general, to make great things.

I guess so, but generally people want attention in some way, not so much to be popular…

At the time I never tried to separate my wants and weigh them against one another. If I had, I would have seen that being smart was more important. If someone had offered me the chance to be the most popular kid in school, but only at the price of being of average intelligence (humor me here), I wouldn’t have taken it.

I agree. But slightly. Being popular and knowing how to utilize it can benefit (sometimes more than) being smart

And that, I think, is the root of the problem. Nerds serve two masters. They want to be popular, certainly, but they want even more to be smart. And popularity is not something you can do in your spare time, not in the fiercely competitive environment of an American secondary school.

Haha, yeah – time investment.

Nerds don’t realize this. They don’t realize that it takes work to be popular. In general, people outside some very demanding field don’t realize the extent to which success depends on constant (though often unconscious) effort. For example, most people seem to consider the ability to draw as some kind of innate quality, like being tall. In fact, most people who “can draw” like drawing, and have spent many hours doing it; that’s why they’re good at it. Likewise, popular isn’t just something you are or you aren’t, but something you make yourself.

Agreed. Did not realize this until very late. It takes a lot of time and thought and honestly, experimentation (+ failures) to become popular…

Even if nerds cared as much as other kids about popularity, being popular would be more work for them. The popular kids learned to be popular, and to want to be popular, the same way the nerds learned to be smart, and to want to be smart: from their parents. While the nerds were being trained to get the right answers, the popular kids were being trained to please.

Haha, suprised I reached the same reasoning. Paul’s writing is good.

So far I’ve been finessing the relationship between smart and nerd, using them as if they were interchangeable. In fact it’s only the context that makes them so. A nerd is someone who isn’t socially adept enough. But “enough” depends on where you are. In a typical American school, standards for coolness are so high (or at least, so specific) that you don’t have to be especially awkward to look awkward by comparison.

Oh god. Yes. It’s so very easy to seem awkward to someone, even becoming older. People tend to judge quickly, especially in the US.

Partly because teenagers are still half children, and many children are just intrinsically cruel. Some torture nerds for the same reason they pull the legs off spiders. Before you develop a conscience, torture is amusing.

Haha… yes, people don’t accept differences (from their own view of the world), especially if they’re children

Another reason kids persecute nerds is to make themselves feel better. When you tread water, you lift yourself up by pushing water down. Likewise, in any social hierarchy, people unsure of their own position will try to emphasize it by maltreating those they think rank below. I’ve read that this is why poor whites in the United States are the group most hostile to blacks.

Yes… Definitely when I was a teenager. I see this to some extent, even now.

Because they’re at the bottom of the scale, nerds are a safe target for the entire school. If I remember correctly, the most popular kids don’t persecute nerds; they don’t need to stoop to such things. Most of the persecution comes from kids lower down, the nervous middle classes.

Oh interesting – good observation. Happens when you’re older too, or maybe I just interpret some events like that.

As well as gaining points by distancing oneself from unpopular kids, one loses points by being close to them. A woman I know says that in high school she liked nerds, but was afraid to be seen talking to them because the other girls would make fun of her. Unpopularity is a communicable disease; kids too nice to pick on nerds will still ostracize them in self-defense.

Haha…

It’s important to realize that, no, the adults don’t know what the kids are doing to one another. They know, in the abstract, that kids are monstrously cruel to one another, just as we know in the abstract that people get tortured in poorer countries. But, like us, they don’t like to dwell on this depressing fact, and they don’t see evidence of specific abuses unless they go looking for it.

I don’t think I understand it to that extent. Maybe I’ve forgotten.

Public school teachers are in much the same position as prison wardens. Wardens’ main concern is to keep the prisoners on the premises. They also need to keep them fed, and as far as possible prevent them from killing one another. Beyond that, they want to have as little to do with the prisoners as possible, so they leave them to create whatever social organization they want. From what I’ve read, the society that the prisoners create is warped, savage, and pervasive, and it is no fun to be at the bottom of it.

Wow what a conclusion. I do agree with this. Again this is not PRIVATE school teachers – public school teachers have like 30-40 students to take care of per class. There’s easily not that much time devoted to each kid’s problems.

When the things you do have real effects, it’s no longer enough just to be pleasing. It starts to be important to get the right answers, and that’s where nerds show to advantage. Bill Gates will of course come to mind. Though notoriously lacking in social skills, he gets the right answers, at least as measured in revenue.

Huh, yes. School is much more restricted in that sense.

If I could go back and give my thirteen year old self some advice, the main thing I’d tell him would be to stick his head up and look around. I didn’t really grasp it at the time, but the whole world we lived in was as fake as a Twinkie. Not just school, but the entire town. Why do people move to suburbia? To have kids! So no wonder it seemed boring and sterile. The whole place was a giant nursery, an artificial town created explicitly for the purpose of breeding children.

Good advice – I’m going to take this advice.

What bothers me is not that the kids are kept in prisons, but that (a) they aren’t told about it, and (b) the prisons are run mostly by the inmates. Kids are sent off to spend six years memorizing meaningless facts in a world ruled by a caste of giants who run after an oblong brown ball, as if this were the most natural thing in the world. And if they balk at this surreal cocktail, they’re called misfits.

Glad I reached this conclusion when I was in school.

Adults can’t avoid seeing that teenage kids are tormented. So why don’t they do something about it? Because they blame it on puberty. The reason kids are so unhappy, adults tell themselves, is that monstrous new chemicals, hormones, are now coursing through their bloodstream and messing up everything. There’s nothing wrong with the system; it’s just inevitable that kids will be miserable at that age.

Blaming on something that can’t be fully explained – Typical. Also, sometimes I fall into this habit, but I’ve stopped it mostly.

When I was in school, suicide was a constant topic among the smarter kids. No one I knew did it, but several planned to, and some may have tried. Mostly this was just a pose. Like other teenagers, we loved the dramatic, and suicide seemed very dramatic. But partly it was because our lives were at times genuinely miserable.

True true true

At best it was practice for real work we might do far in the future, so far that we didn’t even know at the time what we were practicing for. More often it was just an arbitrary series of hoops to jump through, words without content designed mainly for testability. (The three main causes of the Civil War were…. Test: List the three main causes of the Civil War.)

And there was no way to opt out. The adults had agreed among themselves that this was to be the route to college. The only way to escape this empty life was to submit to it.

Even in adult life, with a “job”, you get these structured instances too…

Teenage kids used to have a more active role in society. In pre-industrial times, they were all apprentices of one sort or another, whether in shops or on farms or even on warships. They weren’t left to create their own societies. They were junior members of adult societies.

That’s a good observation – most of the useful stuff I learned was outside of school – working with my father, exploring/navigating the city

What happened? We’re up against a hard one here. The cause of this problem is the same as the cause of so many present ills: specialization. As jobs become more specialized, we have to train longer for them. Kids in pre-industrial times started working at about 14 at the latest; kids on farms, where most people lived, began far earlier. Now kids who go to college don’t start working full-time till 21 or 22. With some degrees, like MDs and PhDs, you may not finish your training till 30.

Interesting thought. Yes, and it REQUIRES schooling again…

The real problem is the emptiness of school life. We won’t see solutions till adults realize that. The adults who may realize it first are the ones who were themselves nerds in school. Do you want your kids to be as unhappy in eighth grade as you were? I wouldn’t. Well, then, is there anything we can do to fix things? Almost certainly. There is nothing inevitable about the current system. It has come about mostly by default.

Yes. Man, I was dumb for not realizing this soon…

Final thoughts

This is one of Paul’s older essays. He rambles quite a bit. Each paragraph after the like 5th one repeats what he says, but with a different story or tone. I like the point of the essay. Nerds are UNPOPULAR, and the time of that unpopularity actually does drag on these days (to even past college) due to the internet and having these traits be embedded with culture beyond school.

One thing I do disagree with Paul is that popularity does matter, just in a different sense. To be popular is something that most people are not well adjusted to, say being attractive for the first time, or being more well known on the internet and being able to respond in a social setting well. I believe that these early years in life builds that and allows one to experience that type of feeling – to build “confidence” in some way. Because this matters after the teenager years, and is a useful skill to have. However, to be popular, it’s hard, and most kids are just thrown into the battle grounds to figure it out. No one really teaches them.

I do agree with most of paul’s points on school. It’s a rigid structure that is basically a battleground for kids to bully another and place themselves into groups. Then you can pretend for the most part to pass classes if you put some effort and learn how to do so (I guess this is what is “smart”?). I wish kids did more apprentice-esque classes or etc, so that someone can show them some view of the adult world. I didn’t understand some until after college and am still learning.

But why is Paul seem so harsh – angry almost? Does he regret going to such schools? Bitter? I can relate if so. I can’t really describe good things about school. Just hung out with the nerds, and that was fun, I think?

A Reality Check on DeepSeek's Distributed File System Benchmarks

2025-06-18T09:00:00+00:00

Series

How should we analyze 3FS?

In my previous post, I introduced DeepSeek’s 3FS distributed file system – exploring its architecture, components, and the CRAQ protocol that provides its consistency guarantees. Now, I want to take a closer look at the published benchmark results and performance claims.

When evaluating distributed systems, there’s a tendency to jump straight into complex profiling tools and detailed metrics.Trying out perf, strace for syscalls, iostat for disk, it’s essentially throwing random darts until you hit something However, I find tremendous value in performing an initial “performance reality check” on numbers and graphs. The check uses reference numbers from other sources, such as hardware manufacturer specifications or existing benchmarks, to provide a baseline quickly for a particular systemFor example, when I drive a car on the highway, I try to match the speed to the other cars around me. Without that reference, it might turn out that I’m over the speed limit if I’m not constantly checking the speed gauge. This approach helps identify potential bottlenecks or inconsistencies before deploying software tools for deeper investigation.

A “performance reality check” can reveal the following:

It validates whether benchmark results match what we’d expect based on the hardware configuration
It helps identify which components (network, storage, cpu, etc) represent the main bottleneck
It reveals the percentage of theoretical capacity actually being utilized
It verifies whether the authors’ claims are valid and how they may have arrived at those conclusions

To illustrate this method, imagine a startup making claims about their new database – “built for AI training” and “hyper performance”. They showcase a benchmark from a single node:

A company produces a graph showing the throughput of one of their machines running the workload

The system managed to read 250 GB in the total time, which seems impressive! However, this is like saying I drove 100 miles without mentioning whether it took an hour or 10. The rate (GB per second) reveals the actual work accomplished relative to time invested. Let’s approximate it by drawing a slope through the data.

Drawing a line through the graph to get the rate

2 GB/s. Great number, but one might wonder – what should we compare this number to?

A start might be to ask is if this utilizing the full potential of the hardware? Looking up modern SSD specifications for random reads and plotting that on the graph can reveal the following:

Taking a different look at the graph with theoretical limits

Theoretically, the system should reach 500 GB by the end of the test period!

The benchmark is only utilizing about half of the available device bandwidth. This might raise some eyebrows about their performance claims – where are the bottlenecks?

This analytical approach is exactly what I’ll apply to DeepSeek’s 3FS benchmarks throughout this post. By calculating what the hardware should deliver and comparing it to what 3FS actually achieves, we can identify where the possible bottlenecks lie and assess whether performance claims hold up.While not exact, these comparisons give us immediate insights that would take days to obtain through benchmarking

Into analyzing 3FS

I’ll be examining three different workloads that showcase 3FS in action:

AI training jobs featuring a massive amount of reads
GraySort, a synthetic sorting benchmark with a mix of reads and writes
KV cache in operation, representing LLM inference workloads with random reads

Each benchmark consists of two main components – client and storage. The client sends a request to read/write to the storage node over a network. Then, the storage node reads/writes the data to its device(s) and responds to the client by sending a message back. Thus, the two main hardware components one should analyze closely are network and storage.

For each benchmark, I’ll break down the hardware configuration, calculate theoretical maximums, and analyze how close the system comes to achieving its potential performance. Through this analysis, we’ll develop intuition about 3FS’s real-world capabilities before even deploying it.

Let’s start by examining what could be 3FS’s most important benchmark: training throughput for AI workloads.

First workload: Training job

Peak throughput for training jobs (Image source: 3FS github)

A training workload typically involves GPU nodes acting as clients that read data (text, images, etc.) from storage nodes to train deep neural networks like language or diffusion models. The throughput hovers around 6.6 TB/sIt’s not made explicit if this read throughput is the average or median. I would assume the average throughput. on average, with peak throughput reaching 8 TB/s as reported in the Fire-Flyer AI-HPC paper.

Here’s the hardware configuration the benchmark uses:

Node Type	Component	Specification
Client	Number of nodes	500
	Network	1 × 200Gbps NIC
Storage	Number of nodes	180
	Disk	16 × 14TB PCIe 4.0 NVMe SSDs
	Network	2 × 200Gbps NICs
	Memory	512 GB DDR4-3200MHz
	CPU	2 × AMD 32 Cores EPYC Rome/Milan

Let’s apply the “performance reality check” on these numbers – Below are some back-of-the-envelope calculationsBack-of-the-envelope calculations involve performing rough additions and multiplications to get an approximate number within the range of the actual value to get an idea of the theoretical limitsThe authors don’t list the SSD used in the benchmark, so I’ll be using a Micron 7450 15.36TB U.3 enterprise SSD as reference of the benchmark. Click the “Show calculations” toggle button in the top right to see the detailed breakdown. The base numbers (7GB/s, 4GB/s, 6GB/s, 2GB/s) come from reference SSD specifications I selected to represent this workload. Also, the NIC’s throughput is measured in Gbps instead of GB/s. Performing the conversion: 200Gbps ÷ 8 = 25GB/s.

Show calculations Toggle calculations

Node Type	Metric	Per Unit	Per Node	Entire Cluster
Storage (180)	Disk - Sequential Read	7 GB/s	112 GB/s 112 GB/s = 7 GB/s × 16	20.16 TB/s 20.16 TB/s = 112 GB/s × 180
	Disk - Random Read	4 GB/s	64 GB/s 64 GB/s = 4 GB/s × 16	5.04 TB/s 5.04 TB/s = 64 GB/s × 180
	Disk - Sequential Write	6 GB/s	96 GB/s 96 GB/s = 6 GB/s × 16	7.56 TB/s 7.56 TB/s = 96 GB/s × 180
	Disk - Random Write	2 GB/s	32 GB/s 32 GB/s = 2 GB/s × 16	2.52 TB/s 2.52 TB/s = 2 GB/s × 180
	Network	25 GB/s	50 GB/s 50 GB/s = 25 GB/s × 2	9 TB/s 9 TB/s = 50 GB/s × 180
Client (500)	Network	25 GB/s	25 GB/s	12.5 TB/s 12.5 TB/s = 25 GB/s × 500
ML Training	Avg Read Throughput	N/A	N/A	6.6 TB/s
ML Training	Peak Read Throughput	N/A	N/A	8 TB/s

From these numbers, one can observe that:

The client’s network will not be a bottleneck (12.5 TB/s client network throughput > 9 TB/s storage network throughput)Hover over the text to see the numbers highlighted in the table!
The training job workload indicates a mix of sequential/random read because 6.6 TB/s average throughput is greater than the maximum disk random read throughput (6.6 TB/s > 5 TB/s)
The storage nodes will be bottlenecked by network or disk depending on the type of workload. A network bottleneck occurs when the workload primarily consists of sequential readsAn example of this type of workload is reading a large file (movie, song, etc) in order to transfer the data to another device and the network cannot match the sequential throughput (20 TB/s > 9 TB/s)
When workload primarily consist random reads, sequential write, or random writes, the storage becomes the bottleneck rather than the network. (5 TB/s, 7.5 TB/s, 2.5 TB/s < 9 TB/s)
This workload is most likely bottlenecked by network bandwidth. The sequential read throughput can reach up to 20 TB/s and the network throughput is 9 TB/s, but the peak throughput of 8 TB/s and average throughput of 6.6 TB/s are below the network limit and well below the maximum sequential throughput.

Sometimes it’s hard to look at numbers. If we replot the numbers relative to the maximum sequential throughput of a SSD and lay the numbers on a bar plot, we can get a better idea of where the numbers are:

The visualization reveals some interesting insights about system utilization that we have already pointed out:

The average 6.6 TB/s represents:
- 33% of theoretical sequential disk throughput (6.6 / 20 TB/s)
- 73% of available network bandwidth (6.6 / 9 TB/s)
The peak 8 TB/s achieves:
- 40% of theoretical sequential disk throughput (8 / 20 TB/s)
- 88% of available network bandwidth (8 / 9 TB/s)

The data clearly shows that network bandwidth becomes the primary bottleneck. This suggests two potential optimization paths: either remove half of the SSDs from each storage node or upgrade to 800 Gbps NICs to unlock full sequential read potential. However, implementing these changes presents practical challenges. Hardware platforms often have limitations that prevent NIC upgrades and removing storage may leave PCIe slots unused. And, pure cost alone may make changing the existing setup unreasonable.

Also, why does peak throughput cap at 8 TB/s rather than closer to the theoretical network limit of 9 TB/s? Is this a fundamental software limitation, or does it reflect overhead associated with network operationsCould be TCP or RDMA overhead at this scale?I’ll have better answers to such questions when I run benchmarks on 3FS

Revisiting the training job with some background

Now, let’s revisit the throughput over time graph with these background numbers in mind.

Peak throughput for training jobs (Image source: 3FS github)

The graph shows read throughput hovering around 6.6 TB/s, which represents approximately 73% of the theoretical 9 TB/s network capacityI typically set 0 as the starting point for the y axis, which gives you an absolute base number that you can compare to. This leaves 27% of potential throughput unutilized, suggesting possible system bottlenecks such as:

Metadata communication network overhead (think TCP headers)
Network completion delays before reading
Workload imbalance creating hot nodes
FUSE bottlenecks in the client for file operations
Kernel overhead in managing communication and disk I/O
Straggler storage nodes slowed by disk issues (temperature, wear)
Native filesystem (XFS, ext4) overheads
…

Dips in the training job

The periodic dips in throughput are somewhat interesting:

These dips could originate from either the filesystem or the workload itself. The filesystem might have internal mechanisms (periodic flushing, lock contention, etc.) that could cause these performance drops. But, because the dips occur at regular ~2.5 second intervals, it strongly suggests that checkpointing operations might cause these dropsThe dips hover around 6.3 TB/s, so at 6.6 TB/s average, that’s a 4.5% drop in throughput (300 GB/s / 6600 GB/s). These dips last roughly 10% of the time between peak points, so overall throughput may decrease by about 0.45% - not particularly significant. as the parts of the model may need to pause training while checkpoint data is written.

Next up: Gray Sort Benchmark

What is Gray Sort?

Gray Sort is a synthetic benchmark that tests how quickly a system can sort largeLarge as in terabytes large, and definitely will not fit on a single machine amounts of data. The workload follows a specific pattern that stresses both sequential and random I/O operations:

Read unsorted data from storage into memory (sequential reads)
Sort each data chunk in memory
Write the sorted chunks back to storage (random-ish writes)
Read the fetching other node’s sorted chunks and merge them in memory (random-ish reads)
Write the merged results back to disk (random-ish writes)
Repeat until all data is fully sorted
Write the full sorted result to disk (sequential writes)

This alternating pattern of reads and writes, combined with both sort and merge phases, makes it a standard test for distributed filesystem performanceSounds like a MapReduce workload, essentially aggregating in keys in a range to a partition and then performing some operation on that range (merging in this case).

Initial Look at the Graphs

Gray Sort Single Node Client Performance (Image source: 3FS github)

Gray Sort Single Node Server Performance (Image source: 3FS github)

Note that orange represents writes and blue represents reads.

Looking at the orange dotted lines separating the algorithm phases, there’s a clear pattern. The phase before the first dotted line is pure writes – the system writing the unsorted data to the storage. After that, we see mixed read/write workloads that gradually shift toward being more read-heavy as the sorting progressesAs more and more sorted runs get merged together, there are fewer write operations needed since each merge pass consolidates multiple inputs into fewer outputs, while the read operations increase to fetch data from the remaining sorted runs. This pattern is observable when comparing workload differences between the 18:05:00-18:10:00 and 18:25:00-18:30:00 time ranges in the server throughput graph.

A few observations jump out:

If one were to eyeball the average combined (read / write) throughput per phase, it would hover around ~10-15 GB/s.
Clients peak at around 10 GB/s for writes while peaking 22 GB/s for reads.
Storage nodes peak at 22 GB/s for writes and 30 GB/s for reads – their throughput is approximately twice the amount of the client’s average throughput, which makes sense given there are twice as many clients as storage nodes. We see this listed in the next section on hardware configuration.

Hardware Configuration

For this benchmark, 3FS was deployed with the following hardware setup:

Node Type	Component	Specification
Client	Number of nodes	50
	Network	1 × 200Gbps NIC
	Memory	2.2TB DDR4
Storage	Number of nodes	25
	Disk	16 × 14TB PCIe 4.0 NVMe SSDs
	Network	2 × 400Gbps NICs

Analysis

The main difference from the previous benchmark is that there are twice as many clients as there are storage nodes (compared to 3x from previous benchmark) and the storage nodes have twice as much network bandwidth!

Let’s calculate the theoretical performance limits for this configuration:

Show calculations Toggle calculations

Node Type	Metric	Per Unit	Per Node	Entire Cluster
Storage (25)	Disk - Sequential Read	7 GB/s	112 GB/s 112 GB/s = 7 GB/s × 16	2.8 TB/s 2.8 TB/s = 112 GB/s × 25
	Disk - Random Read	4 GB/s	64 GB/s 64 GB/s = 4 GB/s × 16	1.6 TB/s 1.6 TB/s = 64 GB/s × 25
	Disk - Sequential Write	6 GB/s	96 GB/s 96 GB/s = 6 GB/s × 16	2.4 TB/s 2.4 TB/s = 96 GB/s × 25
	Disk - Random Write	2 GB/s	32 GB/s 32 GB/s = 2 GB/s × 16	0.8 TB/s 0.8 TB/s = 32 GB/s × 25
	Network	100 GB/s	100 GB/s	2.5 TB/s 2.5 TB/s = 100 GB/s × 25
Client (50)	Network	25 GB/s	25 GB/s	1.25 TB/s 1.25 TB/s = 25 GB/s × 50
Gray Sort	Client Write Peak	N/A	~10 GB/s	N/A
Gray Sort	Client Read Peak	N/A	~22 GB/s	N/A
Gray Sort	Server Write Peak	N/A	~22 GB/s	N/A
Gray Sort	Server Read Peak	N/A	~30 GB/s	N/A

The performance numbers reveal an interesting pattern. In the first phase, the server write peak achieves ~22 GB/s out of 32 GB/s random write capacity (69% utilization). In the second phase, the server read peak reaches ~30 GB/s out of 64 GB/s random read capacity (47% utilization), which is quite a bit lower than the relative utilization for writes. However, comparing to sequential read throughput ~30 GB/s vs 112 GB/s (27% utilization) strongly signals that the workload is predominantly random rather than sequential.

Let’s take a look at the numbers:

Storage nodes peak at 22 GB/s writes and 30 GB/s reads, well below the 100 GB/s network capacity
Client read peak achieves 88% of network capacity (22 GB/s out of 25 GB/s)
Client write peak hits only 40% of network capacity (10 GB/s out of 25 GB/s)Why does the writes not peak nearly as high as reads? A reason might be from CRAQ’s consistency guarantees - each write must traverse the entire chain (head → middle → tail → back), which makes performance predictable unlike reads. Reads can come from the follower, or cause a consistency check at the tail

The bottleneck here is clearly the number of clients. With the storage nodes far from saturated, we could support more clients. How many? If we want to saturate the storage sequential write capacity of 2.4 TB/s and each client can push 25 GB/s:

2.4 TB/s ÷ 25 GB/s = ~96 clients

Nearly double the current 50 clients! This suggests the current configuration may be significantly underutilizing the storage infrastructure.

Interestingly, the storage write peak (22 GB/s) slightly exceeds client write peak (20 = 2 × 10 GB/s). With 50 clients at 10 GB/s distributed across 25 storage nodes, each node should see ~20 GB/s, with the extra 2 GB/s coming somewhere – maybe, from CRAQ protocol overhead?CRAQ requires writes to propagate through chains, potentially creating additional write traffic beyond what clients generate

The end-to-end performance measurements, however, reveal an unexpected pattern: the benchmark notes mention achieving 3.66 TB/min – 61 GB/s aggregate throughput, which doesn’t sound too bad. But that’s just 4.88% of the 1.25 TB/s client network capacity. This suggests that most of bottleneck might not be network or disk at all – it could be even be CPU/memory bound from the sorting computation itself.

Caching the key-value pairs of the transformer

What is the KV Cache?

The KV cache stores the key-value pairs from attention mechanisms during LLM inference. Instead of recomputing these values for every new token, the system caches them to dramatically reduce computational overhead by trading computation for storage. For models like DeepSeek’s R1, this cache becomes substantial – each token requires approximately 70KB of storage in FP16 format.

This workload represents an important real-world use case for 3FS. As LLMs process longer contexts and serve more users concurrently, the storage system must handle both massive reads (loading cached values) and periodic deletions (garbage collecting expired entries).

Initial Look at the Graphs

KV Cache Read Throughput (Image source: 3FS github)

KV Cache GC IOPS (Image source: 3FS github)

The graphs show per-client performance for KV cache operations. Looking at the read throughput graph:

Average throughput hovers around 3 GB/s
Peak throughput reaches approximately 40 GB/s
Which is more than 13x difference between average and peak

The GC IOPS graph reveals:

Periodic bursts of deletion operations reaching 1-1.4M IOPS
~4 bursts per 5-minute interval
- Lasts around ~40 seconds each, followed by similar periods of low activity

Unfortunately, the authors don’t specify the complete hardware configuration - we only know each client has a 400 Gbps NIC (50 GB/s). This means the peak 40 GB/s achieves 80% network utilization, while average performance uses only 6% of available bandwidth.

Analyzing the Workload

The read pattern is fundamentally random – individual KV entries are scattered across storage. However, each 70KB entry spans multiple 4KB blocksSSDs read data in fixed-size blocks, typically 4KB. A 70KB entry requires reading 18 consecutive blocks, resulting in sequential device-level reads despite the random access pattern per entry.

Let me calculate what these throughput numbers mean for token processing:

Expand for calculations for KV cache entry

671B configuration

{ "vocab_size": 129280, "dim": 7168, "inter_dim": 18432, "moe_inter_dim": 2048, "n_layers": 61, "n_dense_layers": 3, "n_heads": 128, "n_routed_experts": 256, "n_shared_experts": 1, "n_activated_experts": 8, "n_expert_groups": 8, "n_limited_groups": 4, "route_scale": 2.5, "score_func": "sigmoid", "q_lora_rank": 1536, "kv_lora_rank": 512, "qk_nope_head_dim": 128, "qk_rope_head_dim": 64, "v_head_dim": 128, "dtype": "fp8" } 

KV Cache MLA calculation described in Deepseek V2 (Image source: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model)

Given:

kv_lora_rank = 512
qk_rope_head_dim = 64
n_layers = 61

Per token: (512 + 64) × 61 = 35,136 elements

In FP16 format (2 bytes per element) = 70,272 bytes ≈ 70KB per token In FP8 format (1 byte per element) = 35,136 bytes ≈ 35KB per token

With 70KB per token:

Average throughput (3 GB/s) processes ~43,000 tokens/second per client
Peak throughput (40 GB/s) processes ~570,000 tokens/second per client

Given R1’s 128K context length:

Average: Can read entire context in 3 seconds (128K ÷ 43K)
Peak: Can read entire context in 0.22 seconds (128K ÷ 570K)

These numbers are impressive, but without knowing the number of concurrent users or typical context lengths, it’s hard to judge real-world performance.

Alignment Concerns

Here’s an issue the authors don’t address: alignment waste. Modern NVMe SSDs use 4KB block sizes, but KV cache entries are 70KB – not cleanly divisible.

Blocks needed = ⌈70,272 ÷ 4,096⌉ = 18 blocks Actual storage = 18 × 4,096 = 73,728 bytes Wasted space = 3,456 bytes (4.69%) 

This 4.69% waste might seem small, but at scale it adds up. With enterprise SSDs costing ~$2,200 each:

Cost per SSD: $103
Cost per node (16 SSDs): ~$1,650
Cost per 180 nodes: ~$297,000
Cost per 10,000 nodes: ~$16,500,000

For a company running thousands of clusters, this alignment inefficiency could waste millions in storage costs.

Garbage Collection

The GC algorithm isn’t detailed, but entries likely get marked for deletion when no longer referenced. The deletion mechanism remains unclear - could involve bit flags, pointer updates, zeroing entries, or tombstone markers.

The periodic burst pattern (1-1.4M IOPS) suggests that it’s probably more efficient to threshold-based eviction or batch processing rather than continuous cleanup for this type of workload. While throughput remains stable during GC, these spikes could impact performance if disks are already near throughput capacityGarbage collection problems have appeared numerous times in many existing systems – showing up as compaction issues in RocksDB or auto vacuum spikes in Postgres.

Remaining feedback

Some critical information is absent from this benchmark, most notably the lack of latency graphs. For LLM serving, latency matters as much as throughput - users need consistent time-to-first-token and smooth text generation, or they’ll switch to another service (chatgpt, gemini, claude, etc…).

Someone at Deepseek clearly knows how to configure systems well if this is a real sample from a live system. The 80% peak utilization indicates a well-configured system with just enough headroom.Nobody wants that 3am call to discuss needing to setup more machines to handle the traffic

Closing Thoughts

The benchmarks focus exclusively on throughput, omitting latency metrics entirely. Not sure why they skipped latency – perhaps cost considerations took priority. While latency optimization is notoriously difficultStuart Cheshire: “It’s the latency, stupid”Jeff Dean on tail latencies at scale, my future evaluations will include latency measurements and explore optimizations to improve the latency.

Despite these limitations and critiques, the benchmarks align well with theoretical calculations and provide valuable insights into 3FS performance at scale.

In upcoming posts, I’ll benchmark 3FS myself to verify these graph/claims and dig deeper:

Testing actual hardware limits vs theoretical calculations
Measuring latency distributions, not just throughput
Creating custom visualizations for storage and network performance patterns
Validating if our back-of-the-envelope math holds up
Profiling with various tools (perf, sampling, adapting source code) to identify bottlenecks

Acknowledgments

Thanks to Stefanos Baziotis, Ahan Gupta, and Vimarsh Sathia for reviewing this post.

Citation

To cite this article:

@article{zhu20253fs2, title = {A Reality Check on DeepSeek's Distributed File System Benchmarks}, author = {Zhu, Henry}, journal = {maknee.github.io}, year = {2025}, month = {June}, url = "https://maknee.github.io/blog/2025/3FS-Performance-Journal-2/" } 

Paul Graham - What to Do

2025-04-20T00:00:00+00:00

Paul Graham - What to Do

Thoughts about When To Do by Paul Graham.

Thoughts along the way

What should one do? That may seem a strange question, but it’s not meaningless or unanswerable. It’s the sort of question kids ask before they learn not to ask big questions.

This statement about kids kind of took me off guard - I do see it happen (at least in myself). Why though? Does it see in his children and the kids that he encounters? What does he consider “kids” in this context - elemetary school, high school, college? I see this explained in the hierarchy of societies. Most definitely in military.

I only came across it myself in the process of investigating something else. But once I did, I thought I should at least try to answer it.

Oh, I haven’t explained why I was caught off guard. Because I haven’t thought about this in a long time. And I don’t have an answer yet.

So what should one do? One should help people, and take care of the world. Those two are obvious.

This are how kids would answer.

But is there anything else? When I ask that, the answer that pops up is Make good new things.

What good things? How do you know that they are good? Or new?

The most impressive thing humans can do is to think. It may be the most impressive thing that can be done. And the best kind of thinking, or more precisely the best proof that one has thought well, is to make good new things.

I believe in this and he has state it out well with very concise sentences. I like it.

Newton’s physics was a good new thing.

Suprised by an example. The concept may be very abstract without a general example (here, where everyone knows about this discovery).

I’m going to guess that this discovery allowed people to develop technology (ships, safety, etc)?

Indeed, the first version of this principle was to have good new ideas. But that didn’t seem general enough: it didn’t include making art or music, for example, except insofar as they embody new ideas. And while they may embody new ideas, that’s not all they embody, unless you stretch the word “idea” so uselessly thin that it includes everything that goes through your nervous system.

I don’t understand this very well; I think he’s trying to explain how general a new idea can be - I don’t think it has to be. It’s very very very very very very very difficult to make a general good new idea. I believe that it’s built upon the ideas of many people, hundreds, thousands, millions, etc… to get to a general good new idea. I see this repeated.

To make discoveries, for example, or to understand something more deeply than others have. But how well do you understand something if you can’t make a model of it, or write about it? Indeed, trying to express what you understand is not just a way to prove that you understand it, but a way to understand it better.

Each time I do this, the more I believe in it.

I think I’ve applied it to a teeny bit of my life. And I hope that the same rule will apply in other aspects of life/experiences.

Another reason I like this phrasing is that it biases us toward creation. It causes us to prefer the kind of ideas that are naturally seen as making things rather than, say, making critical observations about things other people have made. Those are ideas too, and sometimes valuable ones, but it’s easy to trick oneself into believing they’re more valuable than they are.

Two parts to this.

I don’t agree with phrasing it biasing towards creation. Seems forced - I didn’t see it that originally. Discoveries (albeit repeated among different individuals), can fall under this term. I believe that it’s more about thinking and learning.

Yes, I agree with what Paul states about the observations. Even an intellectual person that you look up to may make a wrong guess. For example, the godfather of operating systems lost in a debate against Linus that Linux would succeed as a monolothic kernel. Imagine that, a random ass college kid (Linus was 23 at the time) tells the most well known/accomplished professor in operating system at that time that his hobby operating system would win. If I were to be a random person in this flame war, I would have definitely not chose Linus’ arguments.

And I see this often in my life as well. People make observations all the time, but when some X action happens, they’re wrong sometimes. Should you believe their observations? Sometimes.

Criticism seems sophisticated, and making new things often seems awkward, especially at first; and yet it’s precisely those first steps that are most rare and valuable.

This next statement came out a bit off from the previous sentences. And I do think it’s necessary to have this statement. I think that the observations may seem most rare/valuable, but I believe that it’s a series of observations generally, and it takes a bit to form thoughts about different/unusual observations.

Is newness essential? I think so. Obviously it’s essential in science. If you copied a paper of someone else’s and published it as your own, it would seem not merely unimpressive but dishonest.

Interesting statement about papers.

Which in turn implies it’s not impressive to make the same thing over and over, however well; you’re just copying yourself.

The problem here is that there’s not much learning (which is going through the problems and pain of steps to get to the end) - which I think Paul is stating here.

Historically most rules about how to live have been a mix of both kinds of should, though usually with more of the former than the latter.

Nice observation

Archimedes knew that he was the first to prove that a sphere has 2/3 the volume of the smallest enclosing cylinder and was very pleased about it. But you don’t find ancient writers urging their readers to emulate him. They regarded him more as a prodigy than a model.

Very interesting observation. Why not emulate him?

Now many more of us can follow Archimedes’s example and devote most of our attention to one kind of work.

Oh.

What kinds of new things count? I’d rather leave that question to the makers of them.

He didn’t answer the question… :(, but this is the answer to give.

It would be a risky business to try to define any kind of threshold, because new kinds of work are often despised at first. Raymond Chandler was writing literal pulp fiction, and he’s now recognized as one of the best writers of the twentieth century. Indeed this pattern is so common that you can use it as a recipe: if you’re excited about some kind of work that’s not considered prestigious and you can explain what everyone else is overlooking about it, then this is not merely a kind of work that’s ok to do, but one to seek out.

What a good statement next. I’m focusing on the “hey it’s not good at first” part. I’ve seen this a couple of times already. But I think Paul doesn’t mention the other factors: time taken, mental stress, comfort, physical taxation, … are minor or major hurdles of going down such a route. It is sometimes brutal to go down such a path.

The kind of people who make good new things don’t need rules to keep them honest.

True, but again, hurdles and this includes other people this time around.

But even if you’re one of those, you should at least make sure that the new things you make don’t net harm people or the world.

Very hard to see sometimes.

On the other hand, if you make something amazing, you’ll often be helping people or the world even if you didn’t mean to. Newton was driven by curiosity and ambition, not by any practical effect his work might have, and yet the practical effect of his work has been enormous. And this seems the rule rather than the exception. So if you think you can make something amazing, you should probably just go ahead and do it.

Great ending. “Just do it” - easier said than done. That’s for sure. Another thing that Paul doesn’t mention. It’s like the gym. It takes reps to build muscle. It takes reps to make something amazing. “Just do it” - yes, but make sure you see your goals clearly in the moment and can learn to/be able to identify plateaus. Not every person in the olympics just randomly just went at it and did one thing and became good at their sport.

Final thoughts

To answer this generically - I can’t. But for doing stuff that you’re interested in and try to create something: talking to people, reading/watching the literature, doing something and then thinking, or guessing, doing something and then thinking are some ways one can get to a point of creating something or at the point of creating something. However, going through it may not be fun at times, may be actually uninteresting at times, or even depressing/cause someone to re-evaluate a lot (lost). I think that as long one hold the belief at one’s core, one will make progress. Ask anyone that one thinks is successful what they failed at/when they felt lost, they should answer with an event that sticks out or a couple or even mention that they wanted to give up.

Henry Zhu

NVIDIA TileIR Internals: from CuTile to MLIR/LLVM to SASS

What is CuTile?

What is TileIR?

Running Example: MoE Kernel

Compiling with tileiras

Undocumented Environment Variables

Interesting undocumented CLI options

Pipeline Overview

The Dialects

cuda_tile: High-Level Tensor Operations

MoE in cuda_tile

nv_tileaa

nv_tileas

NVVM + LLVM

SASS

The TileIR passes

Pass 1: cuda_tile.entry

Pass 2: nv_tileaa.func (×12 iterations)

Pass 3: builtin.module

Pass 4: gpu.module

Complete Pass Catalog

Conversion Passes

TileAS Optimization Passes

Conclusion

Appendix: TileIR Passes Reference

convert-cudatile-to-tileaa

convert-tileaa-to-tileas

convert-nv-tileas-to-llvm

convert-gpu-to-nvvm

convert-pipeline-to-nvvm

tileas-assign-dot-layouts

tileas-assign-load-store-layouts

tileas-assign-pipeline-layouts

tileas-generate-schedule

tileas-materialize-schedule

tileas-materialize-async

tileas-materialize-convert-layout

tileas-attach-tma-desc-args

tileas-slicing

tileas-plan-cta

tileas-resolve-agent-boundary

tileas-remove-buffer-alias

tileas-remove-dead-args

tileas-remove-layout-conversions

tileas-optimize-alloc-tensor

tileas-optimize-reduce

tileas-optimize-dot-accumulation

tileas-recompute-for-scheduling

tileas-legalize-fma-dot

tileas-legalize-reduce

tileas-legalize-tmem-copy

tileas-slice-and-fuse

tileas-refine-atom-by-resource

tileas-prepare-for-scheduling

tileas-unroll-register-loops

tileas-unspecialized-pipeline

tileas-dynamic-persistent

tileas-insert-OCG-knobs

Appendix: IR Dumps

Citation

Performance Hints

Reflection after reading this post

Side notes (my thoughts)

Performance Hints

The importance of thinking about performance

Estimation

Example: Time to quicksort a billion 4 byte numbers

Example: Time to generate a web page with 30 image thumbnails

Measurement

Profiling tools and tips

What to do when profiles are flat

API considerations

Bulk APIs

View types

Thread-compatible vs. Thread-safe types

Better memory representation

Memory layout

Indices instead of pointers

Arenas

Pass 1: `cuda_tile.entry`

Pass 2: `nv_tileaa.func` (×12 iterations)

Pass 3: `builtin.module`

Pass 4: `gpu.module`