LLM-Based Kernel Generation: From Manual Optimization to Automated Code Synthesis
Automated GPU kernel generation using large language models: from benchmarks and evaluation frameworks to agentic systems and compiler infrastructure.
Writing efficient GPU kernels has traditionally required deep expertise in hardware architecture, memory hierarchies, and low-level programming. A single expert-written CUDA or Triton kernel can deliver orders of magnitude speedup over naive implementations, but the expertise barrier limits who can optimize code at this level. Large language models are beginning to change this landscape, offering the promise of automated kernel generation that matches or exceeds hand-tuned implementations.
This post surveys the emerging field of LLM-based kernel generation, tracing the evolution from initial benchmarks revealing the difficulty of the task to sophisticated agentic systems that iterate toward optimal implementations. We organize the literature into a problem-solution hierarchy: benchmarks that quantify the challenge, LLM-based approaches that tackle generation, and the compiler infrastructure that enables it all.
Benchmarks and Evaluation
KernelBench and Beyond
The first question when evaluating LLM-based kernel generation is: how do we measure success? Early attempts used simple correctness metrics, but correctness alone is insufficient—a correct kernel that runs 10× slower than the baseline has failed its primary purpose.
KernelBench (February 2025)
Introduced the fastp metric that combines correctness with speedup relative to PyTorch baselines. The benchmark contains 250 diverse tasks spanning common GPU operations (matmul, attention, convolution, etc.). Initial results were sobering: frontier LLMs matched PyTorch performance in fewer than 20% of tasks without test-time refinement, revealing that kernel generation remains a hard problem even for state-of-the-art models.
Artifact:GitHub
TritonBench (Findings of ACL, July 27–Aug 1, 2025; Vienna)
Designed specifically for evaluating Triton code generation with two evaluation tracks: GitHub kernels (real-world code, TritonBench-G with ~23.9% execution accuracy) and PyTorch-aligned tasks (operator implementations, TritonBench-T with ~53.0% execution accuracy). Best models achieve speedups of 1.56×/1.91× respectively. Provides systematic evaluation methodology for Triton generation quality beyond simple correctness.
robust-kbench (September 2025)
Addresses data contamination concerns by introducing stronger evaluation conditions with anti-cheating measures. Bundled with an agentic pipeline for kernel discovery, verification, and optimization, ensuring that benchmarks remain valid as models improve.
Evaluation Frameworks
METR: Measuring Automated Kernel Engineering (February 2025)
Early empirical assessment framework from METR examining how well AI systems can optimize compute kernels in practice. Focuses on measuring actual engineering capability rather than just benchmark performance.
GEAK Evaluation Suites (AMD, July 2025)
AMD’s GEAK introduces an evaluation suite and an agent for Triton on MI300X/MI250, reporting up to 63% correctness and up to 2.59× speedup, evaluated on TritonBench-modified workloads. Demonstrates importance of hardware-specific benchmarks as different GPU architectures have distinct optimization characteristics.
LLM-Based Generation Approaches
Reinforcement Learning Methods
Reinforcement learning has emerged as a powerful paradigm for kernel generation, treating the problem as a sequential decision-making task where the reward signal comes from actual kernel performance.
CUDA-LLM (June 2025)
Introduced “Feature Search & Reinforcement” approach using correctness and runtime as reward signals. Demonstrates that RL can discover non-obvious optimizations, reporting significant speedups on selected kernels through learned search over the optimization space.
TritonRL (October 2025)
Trains Triton-specialized LLMs using verifiable rewards without test-set contamination. Employs hierarchical reward structure that guides the model toward both correct and performant implementations. Achieves state-of-the-art results on KernelBench among Triton-focused models by ensuring training signals are truly verifiable rather than relying on potentially contaminated test sets.
Kevin-32B (May 2025)
A 32B parameter model trained with multi-turn reinforcement learning specifically for CUDA kernel generation. Reports improvements on KernelBench-style tasks by learning from iterative refinement trajectories.
CuAsmRL (CGO 2025)
Takes a different approach by applying RL at the assembly level for NVIDIA SASS instruction rescheduling. Achieves up to 26% additional speedup on specialized kernels (9% average) by optimizing instruction scheduling—demonstrating that RL can work at multiple levels of the compilation stack.
Agentic Systems
Rather than generating kernels in a single forward pass, agentic systems employ iterative refinement loops with feedback from profilers and verification systems.
AutoTriton (July 2025)
Uses iterative feedback from KernelBench and TritonBench to refine generated kernels. The system generates an initial implementation, profiles it, analyzes bottlenecks, and generates improved versions in subsequent iterations.
GEAK: Generating Efficient AI-centric GPU Kernels (AMD, July 2025)
AMD’s agentic generation framework for Triton, particularly targeting MI300X/MI250. Employs inference-time compute scaling, allocating more compute budget to harder optimization problems during the generation process.
GPU Kernel Scientist (August 2025)
Implements a multi-turn refine-and-verify loop where each iteration analyzes performance characteristics and generates targeted improvements. The “scientist” framing emphasizes hypothesis-driven optimization rather than random search.
Multi-Agent System for GPU Kernel Performance Optimization (September 2025)
Employs multiple specialized agents targeting different aspects of kernel performance. Evaluated on real kernels from production systems like SGLang, demonstrating that multi-agent collaboration can tackle the multi-faceted nature of kernel optimization (memory access patterns, thread utilization, instruction scheduling, etc.).
EvoEngineer (October 2025)
Applies evolutionary algorithms to CUDA kernel generation, maintaining a population of candidate implementations and evolving them through mutation and crossover operations guided by performance metrics.
CudaForge (October 2025)
Implements a two-agent framework with distinct coder and judge agents. The coder generates implementations while the judge evaluates them using profiler feedback, creating an adversarial refinement loop.
Test-Time Scaling
A key 2025 insight is that allocating more compute at inference time—generating multiple candidates, iteratively refining, or using chain-of-thought reasoning—can dramatically improve kernel quality.
ConCuR: Conciseness Makes State-of-the-Art Kernel Generation (October 2025)
Introduces KernelCoder model trained on curated data with concise reasoning traces. Shows that data quality and reasoning structure matter as much as model size, achieving strong KernelBench results through carefully designed training data.
NVIDIA Developer Blog: Automating GPU Kernel Generation with DeepSeek-R1 and Inference-Time Scaling (February 2025)
Demonstrates how test-time compute scaling with DeepSeek-R1’s reasoning capabilities improves kernel generation. By spending more inference-time compute on harder problems, the system achieves better results than single-shot generation.
KernelLLM
Llama-based model fine-tuned specifically on PyTorch↔Triton translation pairs, demonstrating that domain-specific fine-tuning on kernel translation tasks can produce specialized models for this narrow but important task.
Compiler Infrastructure and DSLs
Triton and ML-Triton
The Triton ecosystem has become the primary target for many LLM-based kernel generation efforts due to its higher-level abstractions compared to CUDA.
Triton (MAPL 2019)
The foundational Pythonic DSL that compiles to efficient GPU kernels. Triton’s block-level programming model abstracts away many low-level details while maintaining performance competitive with hand-written CUDA.
ML-Triton (March 2025)
Multi-level compilation framework extending Triton’s capabilities. Addresses optimization challenges across different levels of the compilation stack, making Triton more suitable as a target for automated generation.
TileLang (April 2025)
A composable tiled programming model that decouples dataflow specification from scheduling primitives (thread binding, layout, tensorization, pipelining). This separation of concerns makes it easier for LLMs to reason about correctness and performance independently.
Superoptimizers and Megakernels
Beyond generating individual kernels, recent work explores compiling entire computation graphs into optimized fused kernels.
Mirage: Multi-Level Superoptimizer for Tensor Programs (OSDI 2025)
Uses μGraphs to represent and optimize tensor programs across kernel, block, and thread levels. Can discover custom fused kernels that outperform hand-written implementations by exploring a vast optimization space through principled search.
Mirage Persistent Kernel (MPK) (June 2025)
Extends Mirage to compile entire LLM inference pipelines into single “megakernels.” By eliminating kernel launch overhead and enabling cross-kernel optimizations, MPK reduces LLM inference latency by 1.2×–6.7× for inference workloads.
Hazy Research: Look Ma, No Bubbles! (May 2025)
Demonstrates end-to-end megakernel compilation for Llama-1B with minimal pipeline bubbles, achieving low-latency single-device inference through aggressive kernel fusion.
Classical Auto-Scheduling
While LLMs represent a new paradigm, classical compiler techniques remain essential infrastructure and provide baselines for comparison.
TVM (OSDI 2018)
End-to-end optimizing compiler with AutoTVM for cost-model-driven kernel tuning. Established the template-based approach where search focuses on schedule optimization rather than code generation.
Ansor (OSDI 2020)
Hierarchical search with learned cost models for generating high-performance tensor programs. Demonstrates that learned cost models can guide search more effectively than hand-crafted heuristics.
MetaSchedule (NeurIPS 2022)
Probabilistic programming approach to defining search spaces for tensor programs. Enables expressing complex scheduling constraints in a principled way.
TensorIR (ASPLOS 2023)
IR specifically designed for tensorized optimization, generalizing traditional loop-nest IR to handle tensor intrinsics and block-level operations that are common in modern accelerators.
Relax (ASPLOS 2025)
Cross-level IR unifying computational graphs, tensor programs, and library calls. Handles dynamic shapes and enables optimization across abstraction boundaries.
Hidet (ASPLOS 2023)
Task-mapping DSL that embeds scheduling decisions directly in programs rather than separating compute and schedule. Makes scheduling more explicit and debuggable.
SparseTIR (ASPLOS 2023)
Composable abstractions for sparse tensor compilation. Demonstrates that systematic compiler support for sparsity requires co-designing formats and transformations.
EINNET (OSDI 2023)
Derivation-based approach to tensor program optimization. Uses mathematical properties of tensor operations to guide transformations, ensuring correctness by construction.
ROLLER (OSDI 2022)
Construction-based tensor compilation using the rTile abstraction. Shows that construction-based approaches can be faster and more effective than template-based search for certain workload classes.
The Road Ahead
The field of LLM-based kernel generation has made remarkable progress in 2025, but significant challenges remain:
Reality Check: Benchmarks reveal that even frontier LLMs struggle with kernel generation—matching PyTorch performance in fewer than 20% of KernelBench tasks without iterative refinement. The gap between human expert performance and automated systems remains substantial.
What’s Working: Three approaches show consistent promise:
Agentic loops with verifiable rewards (TritonRL) that ensure training signals are clean and meaningful
Test-time/inference-time scaling (DeepSeek-R1, GEAK) that allocates more compute to harder problems
Domain-specialized data and benchmarks (KernelBench, TritonBench, robust-kbench) that drive systematic progress
Systems Backdrop: Classical compiler infrastructure (TVM, Ansor, MetaSchedule, TensorIR, Relax) remains essential. These systems define search spaces, handle lowering, and enable portability across hardware targets. LLM-based generation doesn’t replace this infrastructure—it augments it by generating candidates and guiding search.
Megakernels and Fusion: An exciting frontier is compiling entire computation graphs into fused kernels (Mirage/MPK, Hazy Research’s megakernels). For latency-bound inference workloads, eliminating kernel launch overhead through fusion can be transformative.
Open Questions: Can LLMs learn to reason about memory hierarchies and hardware constraints? How do we ensure generated kernels are not just fast but also numerically stable? Can we extend these techniques beyond GPUs to other accelerators? The next generation of research will need to address these fundamental challenges.