← Blog

Announcing Our $20M Seed Round — Is Kernel Generation a Solved Problem?

At Standard Kernel, our vision is to build AI systems that can design and optimize the full stack of AI infrastructure, making high-performance systems software adaptive, workload-specific, and continuously improving as hardware and models evolve.

We begin at the smallest unit of compute: kernels, low-level code that execute massively parallel fundamental computations on hardware accelerators like GPUs and TPUs. Kernels are the primitives from which performance, scale, and capability emerge. By working at the boundary between models and machines, down to native chip instructions, we replace one-size-fits-all libraries with code tailored  to specific workloads and hardware configurations, to drive even better performance. GPU kernels are our first focus, but they are just the beginning.

We’re excited to share we’ve raised a $20M seed round, led by Jump Capital, with participation from General Catalyst, Felicis, Cowboy Ventures, Link Ventures, Essence VC, and an incredible group of angels and strategic partners including David M. Siegel, Jeff Dean, Jonathan Frankle, Michael Carbin, Sachin Katti, Walden Yan, CoreWeave, and Ericsson. We’re grateful for the support of people who share our belief that systems software is one of the most important frontiers in AI, and we’re excited to continue to push this work forward together.

“What excites us about Standard Kernel is that they are applying AI to one of the most manual and technically demanding layers of the stack. Hardware innovation is accelerating, but the software that extracts peak performance from it has lagged behind. Automating instruction-level optimization has the potential to meaningfully change how AI infrastructure scales.” – Saaya Pal, Partner at Jump Capital
"Standard Kernel is tackling one of the most consequential challenges in modern compute, driving optimization deep within the systems stack where performance is won or lost. As AI adoption continues to scale, breakthroughs in the layers beneath today's models will define the next generation of capabilities. That depth of technical ambition and the caliber of the team are precisely why CoreWeave Ventures is proud to invest in Standard Kernel as they shape the future of AI systems." – Brian Venturo, Co-founder and Chief Strategy Officer at CoreWeave
“Kernel generation is key for improving performance and efficiency of AI hardware. As fleet sizes for users of AI hardware get larger, and more hardware diversity is introduced, Standard Kernel becomes key to deployment.” – Dylan Patel, Founder of SemiAnalysis

Understanding Progress in Kernel Generation

Kernel generation, once a niche systems topic, has moved into the mainstream of AI research and engineering since the release of KernelBench, which has now become a staple benchmark for LLMs and agentic coding systems. New papers, demos, and repos are constantly tackling kernels directly and reporting increasingly strong results, prompting the question: Is kernel generation solved?

The growing body of results can be difficult to interpret, especially for those not immersed in GPU systems. Not all ‘wins’ are equivalent. Achieving a 4x speedup on LayerNorm is far easier than squeezing a 1.05x gain over cuBLAS on GEMM, since the latter already represents one of the most heavily optimized kernels in computing. Similarly, emitting a FP32 GEMM kernel is easier than generating an FP16/FP8/FP4 tensor-core-optimized implementation that fully utilizes specialized hardware (KernelBench uses FP32 by default). Producing Triton code is also an easier problem than generating CUDA C++ with inline PTX, which requires direct control of low-level hardware instructions. Finally, generating kernels with substantial human guidance is different from building fully autonomous systems that consistently reach state-of-the-art performance.

To that end, we’ve developed the Standard Kernel Rubric for evaluating kernel generation systems along five dimensions: kernel complexity (K), representation level (R), hardware specialization (H), performance target (P), automation level (A).  We use this rubric to evaluate what has been solved, what remains open, and where we believe the field is headed next:

  • Kernels expressed in high-level DSLs (R1–R2) are largely solved (A4, P4), though the performance ceiling is ultimately determined by the capabilities of the underlying compiler and runtime.
  • Simple kernels (K1–K2) that do not rely on specialized hardware units (H1) are effectively solved (A4, P4). In these cases, systems can even discover large speedups or novel algorithmic variants.
  • For more complex kernels (K3–K4) that utilize frontier hardware features (H4), systems can sometimes reach or exceed state-of-the-art performance (P4) with human guidance (A1–A2), demonstrating that additional performance can still be extracted beyond highly optimized libraries such as cuBLAS.
  • We are working on autonomously generating (A4) state-of-the-art kernels (P4) for complex workloads (K3–K4) at the lowest representation level (R4), enabling day-0 peak performance on new hardware across arbitrary workloads.

Table summarizing Standard Kernel Rubric

DimensionLevel 1Level 2Level 3Level 4
Kernel Complexity (K)Simple / Memory-Bound: Elementwise ops and basic reductions (add, mul, sum, mean, GELU, small fusions).Structured Primitives: Moderate structure with reductions and elementwise ops (LayerNorm, RMSNorm, Softmax).Dense Core Linear Algebra: High arithmetic intensity and regular tiling under standard precision (MatMul, Conv2D, batched GEMM, naïve attention in FP32 or FP/BF16).Frontier / Architecture-Coupled: Deeply fused or novel operators tightly coupled to hardware (custom attention, block-sparse kernels, mega-kernels, FP8/FP6/FP4 dense linear algebra).
Representation Level (R)Library Composition: Compose existing primitives; execution handled by libraries (cuBLAS, cuDNN, CUTLASS).High-Level DSL: Kernels written in DSLs where the compiler manages scheduling and mapping (Triton, CuTile).Lower-Level DSL: Explicit tiling, data movement, and compute primitives with some abstraction (CUTLASS CuTe).Instruction-Level Programming: Direct hardware control via CUDA C++ and inline PTX with explicit thread orchestration.
Hardware Specialization (H)Portable Implementation: General constructs with no architecture-specific features.Accelerator-Aware: Explicitly targets specialized compute units (e.g., Tensor Cores via WMMA/MMA).Architecture-Optimized Pipelines: Uses architecture-specific memory and execution mechanisms (e.g., cp.async, SM-tuned pipelines).Frontier Hardware Features: Exploits newest capabilities (e.g., WGMMA, TMA, tcgen05, cluster execution).
Performance Target (P)Functional: Produces correct results.Loosely Competitive: Within ~50% of state-of-the-art.Near State-of-the-Art: Within ~5–10% of best known implementations.State-of-the-Art: Matches or exceeds the best known implementations.
Automation Level (A)Expert Co-Design: Human defines detailed strategy and guides optimization.Guided Optimization: Human specifies high-level strategy; AI system implements and iterates.Minor Corrections: AI system generates kernel; human only fixes small errors.Fully Autonomous: AI system generates a correct and performant kernel from minimal specification.

Kernel generation in high-level DSLs is largely solved.

High-level DSL kernels (such as Triton) for a wide range of common operations (e.g. memory-bound kernels, reductions, softmax, layer normalization, matrix multiplication, attention) can often be generated fully autonomously. Lots of contemporary work we see in the field of AI-based kernel generation are at this level of abstraction. When paired with a strong compiler, these implementations can approach the performance of human-optimized baselines. However, their ultimate performance is constrained by the capabilities of the underlying compiler and runtime. Because building high-performance compilers requires substantial engineering effort, they often lag behind the latest hardware generations, which can limit the ability of DSL-generated kernels to reach peak performance on new chips. As a result, generating high-level DSL kernels with AI systems is often not the central challenge, as much of the heavy lifting is handled by the compiler and the AI systems only need to learn the DSL syntax. The more fundamental problem lies in operating at lower levels of the systems stack, where the key performance decisions are made. (K3, R2, P3, A4, Hn/a because that's opaque with the DSL)

For more complex workloads or novel architectures, Triton kernels can also be generated with human guidance. Because these operators often lack strong existing baselines, such implementations may effectively be state-of-the-art. However, achieving this frequently requires prompting and iterative refinement, and high-level abstractions can limit peak hardware utilization. (K4, R2, P4, A2-3, Hn/a)

At the level of high-level DSL abstractions, we also observe strong in-context learning capabilities. cuTile, an NVIDIA DSL similar to Triton but not present in the training data, can be learned effectively from a small number of dynamically selected in-context examples (R2). Using this approach, we were able to autonomously generate cuTile kernels in less than a day after the language was released. However, similar in-context methods are less effective for lower-level DSLs (R3, R4) such as CUTLASS CuTe. At that level, success requires not only learning the syntax of the DSL but also developing a detailed mental model of the underlying hardware, which cannot be conveyed through a few examples alone.

Simple kernels that do not rely on specialized hardware units are effectively optimized.

Kernels dominated by elementwise operations, reductions, or moderately structured primitives combining reductions with elementwise computation (K1, K2) can often be generated fully autonomously. We routinely generate kernels in this category that achieve integer-multiple speedups over typical library implementations in CUDA C + PTX. In many cases, the human baseline is relatively weak because the large combinatorial space of possible fused kernels makes exhaustive manual tuning impractical. Most human optimization efforts and vendor-provided libraries focus on dense linear algebra kernels such as matrix multiplication, which dominate overall compute in ML workloads. As a result, many smaller kernels remain comparatively under-optimized. Examples include fused normalization kernels such as RMSNorm or LayerNorm with fused scaling, bias, and residual updates, especially variants supporting new data formats (e.g., FP8) that require fused dequantization, rescaling, and format conversion within the kernel. (K1–2, R4, H1, P4, A4)

Complex kernels can exceed state-of-the-art performance with human-guided generation.

For more complex kernels, particularly dense linear algebra operations on the latest hardware generations, AI-assisted generation can produce implementations that outperform even the best human-tuned baselines in libraries such as cuBLAS when guided by expert human input. Achieving this typically requires explicit knowledge of new architectural features such as WGMMA or tcgen05, so the process could be dependent on the expertise of the human steering the system. We developed GEMM kernels (K3) at the CUDA C + PTX level (R4) that achieve a geometric mean speedup of 6.0% over cuBLAS (H4, P4) across a variety of shapes (99.1%–116% of cuBLAS performance) on H100 GPUs. This demonstrates that highly specialized kernels tuned to particular shapes, workloads, or fusion patterns can still surpass even the most optimized general-purpose library implementations. These kernels are especially important for performance improvements in end-to-end ML workloads, as dense linear algebra operations such as matrix multiplications dominate overall compute. (K3, R4, H4, P4, A1-2)

We can autonomously generate a wide range of fused variants once strong core dense linear algebra kernels are established, combining operations such as bias, activation, normalization, and data format conversions, as well as fusing multiple core operations (e.g., chaining or merging multiple matrix multiplications). This capability is particularly valuable because the combinatorial space of possible fusions, shape specializations, and pipeline structures is enormous, making exhaustive manual tuning impractical.

We're solving fully autonomously generating state-of-the-art complex kernels enabling peak performance on new hardware for arbitrary workloads.

A key part of this vision is enabling AI systems to learn and exploit new instruction sets without requiring explicit human expertise. Today, extracting peak performance from new hardware generations requires engineers to understand new architectural features and low-level instruction semantics, and manually incorporate them into kernels or guides. We're working on reducing this dependence on expert knowledge by allowing AI systems to autonomously discover and use new hardware primitives, developing their own internal understanding of hardware behavior. This is particularly important for complex kernels introduced by new model architectures, which traditionally require substantial manual effort to design and optimize for specific hardware. This approach extends beyond GPUs to a broader range of hardware platforms, allowing the system to adapt to new architectures without developers first needing to learn their intricacies. We are also building seamless integration interfaces so that generated kernels can be deployed directly into ML workflows without manual intervention where the system continuously generates, evaluates, and improves implementations.

Let's work together!

As usual, if this excites you, we're hiring!

If you’d like to run your workloads with us, we’d love to hear from you: [email protected]