LLM-Based Kernel Generation: From Manual Optimization to Automated Code Synthesis

Automated GPU kernel generation using large language models: from benchmarks and evaluation frameworks to agentic systems and compiler infrastructure.

Writing efficient GPU kernels has traditionally required deep expertise in hardware architecture, memory hierarchies, and low-level programming. A single expert-written CUDA or Triton kernel can deliver orders of magnitude speedup over naive implementations, but the expertise barrier limits who can optimize code at this level. Large language models are beginning to change this landscape, offering the promise of automated kernel generation that matches or exceeds hand-tuned implementations.

This post surveys the emerging field of LLM-based kernel generation, tracing the evolution from initial benchmarks revealing the difficulty of the task to sophisticated agentic systems that iterate toward optimal implementations. We organize the literature into a problem-solution hierarchy: benchmarks that quantify the challenge, LLM-based approaches that tackle generation, and the compiler infrastructure that enables it all.

Benchmarks and Evaluation

KernelBench and Beyond

The first question when evaluating LLM-based kernel generation is: how do we measure success? Early attempts used simple correctness metrics, but correctness alone is insufficient—a correct kernel that runs 10× slower than the baseline has failed its primary purpose.

Evaluation Frameworks

LLM-Based Generation Approaches

Reinforcement Learning Methods

Reinforcement learning has emerged as a powerful paradigm for kernel generation, treating the problem as a sequential decision-making task where the reward signal comes from actual kernel performance.

Agentic Systems

Rather than generating kernels in a single forward pass, agentic systems employ iterative refinement loops with feedback from profilers and verification systems.

Test-Time Scaling

A key 2025 insight is that allocating more compute at inference time—generating multiple candidates, iteratively refining, or using chain-of-thought reasoning—can dramatically improve kernel quality.

Compiler Infrastructure and DSLs

Triton and ML-Triton

The Triton ecosystem has become the primary target for many LLM-based kernel generation efforts due to its higher-level abstractions compared to CUDA.

Superoptimizers and Megakernels

Beyond generating individual kernels, recent work explores compiling entire computation graphs into optimized fused kernels.

Classical Auto-Scheduling

While LLMs represent a new paradigm, classical compiler techniques remain essential infrastructure and provide baselines for comparison.

The Road Ahead

The field of LLM-based kernel generation has made remarkable progress in 2025, but significant challenges remain:

Reality Check: Benchmarks reveal that even frontier LLMs struggle with kernel generation—matching PyTorch performance in fewer than 20% of KernelBench tasks without iterative refinement. The gap between human expert performance and automated systems remains substantial.

What’s Working: Three approaches show consistent promise:

  1. Agentic loops with verifiable rewards (TritonRL) that ensure training signals are clean and meaningful
  2. Test-time/inference-time scaling (DeepSeek-R1, GEAK) that allocates more compute to harder problems
  3. Domain-specialized data and benchmarks (KernelBench, TritonBench, robust-kbench) that drive systematic progress

Systems Backdrop: Classical compiler infrastructure (TVM, Ansor, MetaSchedule, TensorIR, Relax) remains essential. These systems define search spaces, handle lowering, and enable portability across hardware targets. LLM-based generation doesn’t replace this infrastructure—it augments it by generating candidates and guiding search.

Megakernels and Fusion: An exciting frontier is compiling entire computation graphs into fused kernels (Mirage/MPK, Hazy Research’s megakernels). For latency-bound inference workloads, eliminating kernel launch overhead through fusion can be transformative.

Open Questions: Can LLMs learn to reason about memory hierarchies and hardware constraints? How do we ensure generated kernels are not just fast but also numerically stable? Can we extend these techniques beyond GPUs to other accelerators? The next generation of research will need to address these fundamental challenges.