Predict, Monitor, Profile: A Framework for HPC & MLSys Performance Analysis
A review and guide to the evolution of performance analysis tools, from methodologies to simulators and profilers. Work in progress:)
Your LLM inference is too slow. Your distributed training job isn’t scaling. You’ve thrown more GPUs at the problem, but costs are spiraling and performance gains are minimal. Where is the bottleneck? Is it the code, the compiler, the network, or the hardware itself? Answering these questions requires more than guesswork—it requires a systematic approach to performance analysis.
The landscape of performance analysis tools has undergone remarkable transformation over the decades, evolving from rudimentary profilers to sophisticated, ML-aware frameworks. This post explores performance analysis through a three-phase lens: Pre-Execution (modeling and prediction), During Execution (real-time monitoring), and Post-Execution (profiling and debugging). We trace this evolution from the foundational HPC profilers to today’s specialized ML performance ecosystems.
Figure 1. Three-phase framework for performance analysis in machine learning systems: Pre-Execution (prediction and modeling), During Execution (real-time monitoring), and Post-Execution (profiling and debugging).
Why This Framework Matters: Whether you’re a compiler engineer optimizing kernel performance, an infrastructure engineer scaling distributed training, or an ML scientist deploying LLMs to production, understanding when and how to apply these tools can dramatically reduce debugging time and improve system efficiency. This three-phase approach helps you choose the right tool for your specific performance challenge—from predicting bottlenecks before they occur to diagnosing complex distributed training failures.
Pre-Execution: Modeling and Prediction
Performance analysis before running an ML workload – predicting how fast it will run or how an optimization will affect runtime – is incredibly valuable. This phase encompasses analytical modeling, trace-driven simulation, and benchmark generation.
Analytical Models
LogP Model (1993)Roofline Model (2009)Cache-aware Roofline (2013)GPU Instruction Roofline (2019)Amped (2023)
In the era of supercomputers, researchers built simple performance models to estimate runtime and scalability without executing full experiments. As distributed-memory parallel architectures proliferated in the 1990s, inter-node communication emerged as a critical bottleneck. The LogP Model (1993) elegantly captures this through three parameters: L (latency), o (overhead), and g (gap, representing bandwidth limits). By quantifying whether bottlenecks are latency- or bandwidth-dominated, LogP enables algorithm designers to optimize strategies across diverse parallel machines.
Figure 2. The LogP model quantifies communication bottlenecks through its parameters: L (latency), o (overhead), and g (gap, the reciprocal of which corresponds to the available per-processor communication bandwidth).
The Roofline Model (2009) revolutionized performance analysis by providing analytical upper bounds based on operation intensity and memory bandwidth. As Jouppi et al. describe in their TPU analysis :
This simple visual model is not perfect, yet it offers insights on the causes of performance bottlenecks. The assumption behind the model is that applications don’t fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited.
The approach proved so valuable that it spawned numerous extensions: the Cache-aware Roofline Model (2013) accounts for cache hierarchy effects, while the Instruction Roofline Model for GPUs (2019) adapts the framework for GPU architectures with instruction-level analysis.
More recently, Amped (2023) demonstrates how analytical modeling continues to evolve for modern workloads, providing performance predictions specifically for distributed transformer training. These efforts established that complex systems can be understood through abstract performance metrics and that the right abstraction enables reasoning about performance before exhaustive experiments.
Performance Modeling for Deep Learning
Modern performance modeling for deep learning considers the entire distributed training process. Recent surveys analyze simulators across three dimensions: workload representation, the simulation infrastructure itself, and models for Total Cost of Ownership (TCO) that include carbon emissions. These studies compare how different frameworks abstract workloads and detail their underlying assumptions and capabilities.
Workload Representation
MLIR (2021)Chakra (2023)
A critical aspect of simulation is how DNN workloads are represented, which directly impacts accuracy, performance, and scalability. Workload representations are broadly classified into configuration-based and operator/layer-level Intermediate Representations (IRs). There has been a clear trend away from high-level configuration-based descriptions toward more detailed operator-level IRs, as they provide the fidelity needed to model fine-grained behaviors like scheduling and communication overlap, and specific operator optimizations. These IRs can be either framework-specific (e.g., PyTorch’s Torch.fx) or framework-agnostic (e.g., ONNX, StableHLO, Chakra). While many simulators use custom IRs, Torch.fx is a popular choice due to its tight integration with PyTorch’s profiling infrastructure.
Figure 3. A summary of intermediate representations with potential for use in DNN simulations .
Chakra (2023) (by MLCommons) standardizes ML execution trace representation, providing a unified trace schema and tools to collect traces from frameworks like PyTorch, making it easier to share and analyze traces across different tools and enabling collaborative performance research. (MLSys ‘23 Workshop) Chakra GitHubChakra Replay GitHubYouTubeChakra Wiki
Figure 4. Chakra execution trace schema and tools for standardizing ML performance analysis across different frameworks.
Distributed DNN Training Simulators
Simulators for distributed DNN training are essential for exploring the vast design space of modern hardware and software systems without the prohibitive cost of physical prototyping. These simulators can be categorized by their core methodology into three main types: analytical, profiling-based, and execution-driven, each offering a different trade-off between speed, fidelity, and scalability.
Analytical simulators (e.g., Paleo, Calculon) use mathematical models for rapid, though less precise, performance estimation.
Profiling-based simulators (e.g., ASTRA-sim, SimAI) balance fidelity and speed by using empirical data from real hardware to calibrate analytical models. This has become an increasingly popular approach.
Execution-driven simulators (e.g., LLMCompass) provide the highest precision by modeling hardware at a micro-architectural level, but this comes at a high cost to simulation speed and scalability.
Figure 5. Taxonomy of distributed DNN training simulators categorized into analytical, profiling-based, and execution-driven simulation methodologies. .
The landscape of distributed DNN training simulators is diverse, with tools employing different methodologies to balance prediction accuracy, simulation speed, and the level of system detail they model. These simulators are crucial for researchers and engineers to explore design spaces, evaluate new hardware, and optimize training strategies without the need for extensive real-world experimentation. To provide a clearer comparative overview, the following figure summarizes key characteristics of prominent simulators in this domain, drawing from recent surveys:
Figure 6. Summary of distributed DNN training simulators and their main characteristics. .
Benchmarking and Workload Generation
Standardized benchmarks are crucial for fair, reproducible performance comparisons across diverse hardware and software stacks.
Mystique (ISCA '23) : Generates miniature, representative benchmarks from production traces, enabling performance testing and analysis on representative workloads without exposing proprietary models or requiring access to full-scale production systems
MLPerf (MLSys '20, ISCA '20) : The industry standard for architecture-neutral training and inference benchmarks, covering domains from datacenter to edge. (Training, Inference)
ByteMLPerf: A production-oriented benchmark suite that evaluates performance, accuracy, and software/hardware usability, with a focus on compiler coverage
PARAM: Provides parametrized benchmarks for recommendation and AI models, complementing MLPerf with detailed micro-benchmarks for communication and compute analysis
Research in Performance Modeling
2025:
NeuSight (ASPLOS '25) : Data-driven approach for forecasting deep learning performance on GPUs
Lumos (MLSys '25) focuses on large-scale LLM training performance, achieving ~3.3% average error on up to 512 NVIDIA H100 GPUs and can estimate performance for new configurations by intelligently modifying and replaying existing execution traces
TrioSim (ISCA '25) : A lightweight simulator for large-scale DNN workloads on multi-GPU systems
SimAI (NSDI '25) is a profiling-based simulator validated on a 1024-node A100 GPU cluster, notable for its detailed network simulation at scale
Multiverse (NSDI '25) : GPU-based parallel simulator for large-scale LLM training design space exploration, achieving <3% error and up to 73.2× speedup on clusters with 54,000+ GPUs
LLM Performance Prediction (AI4Sys Workshop @ HPDC '25) : Investigating whether large language models can predict parallel code performance
shallowsim (GitHub): A specialized inference performance simulator for DeepSeek-V3/R1 models that exemplifies the practical value of pre-execution analysis. Rather than simply providing benchmark numbers, shallowsim generates comprehensive performance rankings that guide critical infrastructure decisions
Figure 7. Performance ranking analysis from shallowsim showing hardware configurations for LLM inference. The visualization demonstrates how pre-execution modeling guides infrastructure decisions through systematic performance comparisons and bottleneck identification, enabling cost-effective hardware selection before deployment. [Image Source]
2024:
TraceSim (MLArchSys '24) improves accuracy for large-scale training by explicitly modeling concurrent streams and inter-thread dependencies. It can predict performance for scaled-up scenarios with high accuracy
Proteus (TPDS '24) : Simulates the performance of distributed DNN training with a focus on communication patterns
Centimani (ATC '24) : Performance predictor for rapid AI accelerator selection in DNN training
Vidur (MLSys '24) : A large-scale simulation framework specifically designed for LLM inference workloads
LLMServingSim (IISWC '24) : HW/SW co-simulation infrastructure for LLM inference serving at scale, included in recent surveys for its architectural strengths and potential for extension to training
LLM Performance Modeling (IISWC '24) : Performance modeling and workload analysis of distributed large language model training and inference
2022-2023:
dPRO (MLSys '22) constructs a global dataflow graph from operation-level traces to estimate training performance and guide optimizations for distributed training
ASTRA-sim (ISPASS '20, '23) : An influential, modular framework that can be extended with external compute, network, and memory models. Its 2.0 version added support for hierarchical networks, disaggregated systems, and improved trace-based simulation capabilities
2020:
Daydream (ATC '20) pioneered this approach by building fine-grained dependency graphs using NVIDIA’s CUPTI. By applying graph transformations that emulate optimizations, Daydream can simulate training runtime with those optimizations applied
But models are never perfect. To validate these predictions and uncover unexpected bottlenecks, we need to observe the system as it runs, which brings us to real-time analysis.
During Execution: Real-Time Analysis
The second phase of our framework focuses on real-time monitoring and on-the-fly adaptations to catch performance issues as they occur. This phase bridges prediction and post-mortem analysis by providing immediate visibility into system behavior.
This capability is powered by low-overhead technologies. eBPF allows for non-intrusive tracing of kernel and API interactions, while vendor suites like NVIDIA's DCGM expose critical hardware counters.
eGPU (2025) : Extending eBPF programmability and observability to GPUs
GPUprobe (2024): Uses eBPF to provide zero-instrumentation CUDA monitoring Blog post
PrecisionProbe (2024) : Non-intrusive performance analysis tool specifically designed for DLRM, providing real-time monitoring capabilities
Variorum (LLNL, 2018+): Vendor-neutral library for exposing power and performance monitoring and control interfaces across diverse architectures (AMD, ARM, IBM, Intel, NVIDIA) Docs
Dynolog (Meta, 2022+) - A fleet-wide continuous profiling agent for production-scale monitoring
nvitop (2021+) - A popular, htop-like command-line dashboard for real-time GPU monitoring
While essential for live monitoring, deep diagnosis requires the comprehensive data gathered in post-execution profiling.
Post-Execution: Profiling and Debugging
The third phase of our framework—post-mortem profiling—provides the most detailed insights through offline analysis of execution data. While sampling profilers offer low-overhead glimpses into average performance, they often miss tail latency issues and complex interactions. Tracing methodologies capture comprehensive event logs, enabling a deeper understanding of system behavior and root causes that sampling might miss—a distinction well-articulated in Dan Luu’s Sampling vs. Tracing.
The Early Days
prof (1979) : Unix profiler for CPU and memory usage
gprof (1982) : Early portable profiler producing human-readable reports and call graphs
ProfileMe (MICRO '97) : Instruction-level sampling to accurately attribute performance incidents (cache misses, latencies) to specific instructions on out-of-order processors, enabling precise bottleneck identification through paired sampling analysis
HPC Profilers
Historic Foundation Tools:
HPCToolkit (2000s-present) : Performance analysis suite using sampling and hardware counters without requiring source modifications, with recent extensions for GPU-accelerated applications.
TAU (2006-present, IJHPCA '25) : Portable profiling and tracing toolkit emphasizing flexibility across programming models, now extended to support modern ML frameworks.
Other influential tools from this era include Score-P (a measurement infrastructure), Vampir (trace visualization), and Scalasca (large-scale parallel performance analysis)
These tools established core principles: use sampling to reduce overhead, incorporate hardware counters, and support hierarchical analysis.
Modern CPU Profilers
System Profilers:
Linux: perf (2009+), Perfetto, DTrace, bpftrace - high-level tracing language using eBPF, inspired by DTrace
Ray: Comprehensive profiling suite with native Dashboard integration for CPU (py-spy), memory (memray), GPU (PyTorch Profiler, Nsight System), and Ray Task/Actor timeline profiling
Perfetto: System-wide trace visualization merging CPU, GPU, and framework events
Flame Graphs: Visualization technique for profiled software, allowing quick identification of hot code paths
Research in Post Mortem Analysis
2025:
Tintin (OSDI '25)
Neutrino (OSDI '25) : Fine-grained GPU kernel profiling via programmable probing (Website, Blog post)
KPerfIR (OSDI '25) : Compiler-centric ecosystem for GPU kernel performance tooling on modern AI workloads
Dilu (ASPLOS '25) : Enabling GPU resourcing-on-demand for serverless DL serving via introspective elasticity
hpcanalysis (ICS '25) : A new approach to analyzing the performance of applications measured with HPCToolkit, with a focus on measurement data from large-scale executions GitLab
STAD (TPDS '25) : A performance analysis tool for parallel programs that considers both spatial and temporal patterns within trace data
SKIP (System-Aware Kernel Inference Profiler) (ISPASS '25) analyzes LLM inference with fine-grained latency attribution, introducing TKLQT (Total Kernel Launch and Queuing Time) and critical batch size identification
Shared-Memory Atomic Bottlenecks (GPGPU '25) : A study on modeling utilization to identify bottlenecks in shared-memory atomic operations
GPU Utilization Measurement (2025) : Measuring GPU utilization one level deeper for more accurate performance analysis
2024:
DrPy (CGO '24) : A tool for pinpointing inefficient memory usage in multi-layer Python applications
EasyView (CGO '24) : A tool that integrates performance profiles into development environments to enhance debugging and optimization GitHubZenodo
Snoopie (ICS '24) : A multi-GPU communication profiler and visualizer for analyzing distributed GPU workloads GitHub
bperf and BCOZ (OSDI ‘24) : A method to identify both on- and off-CPU bottlenecks using blocked samples
Retrospection (HiPC '24) : A review of performance analysis tools for large-scale HPC programs
PRoof (ICPP '24) : A comprehensive hierarchical profiling framework for deep neural networks that integrates Roofline analysis to provide insights into performance bottlenecks across different system levels
Orion (EuroSys '24) : Interference-aware, fine-grained GPU sharing for ML applications
Scalana (TPDS '24) : Leveraging graph analysis to pinpoint root causes of scalability issues for parallel applications
Comparative Profiling (MLSys Workshop '24) : Insights into latent diffusion model training through comparative profiling
LLM Serving Simulation (2024) : Simulation-based approach for identifying optimal parallelism in high-performance LLM serving
Nanoflow (2024) : Towards optimal large language model serving throughput
Sniper (TACO '24) : Exploiting spatial and temporal sampling for large-scale performance analysis
DeepContext (2024) : Context-aware, cross-platform performance analysis for deep learning workloads
Low-overhead GPU Trace Collection (TOPC '24) : Low-overhead trace collection and profiling on GPU compute kernels
2023:
Propeller (ASPLOS '23) : A profile-guided, relinking optimizer for warehouse-scale applications GitHub
Vclinic (ASPLOS '23) : A portable and efficient framework for fine-grained value profilers
Drgpum (ASPLOS '23) : Memory optimization guidance for GPU-accelerated applications
Scalene (OSDI '23) : CPU, GPU, and memory profiling GitHub
GPU Execution Time Prediction (MICRO '23) : Fast and accurate prediction of GPU execution time for DNN workloads beyond traditional simulators
TrivialSpy (SC '23) : Identifying software triviality through fine-grained and dataflow-based value profiling
Hotline Profiler (MLSys '23) : Automatic annotation and multi-scale timeline for visualizing time-use in DNN training
Thicket (HPDC '23) : Seeing the performance experiment forest for the individual run trees GitHubTutorialPublications
Drgpu (ICPE '23) : Top-down profiler providing systematic guidance from performance issues to root causes
ML Training Monitoring (MLSys Workshop '23) : Profiling and monitoring deep learning training tasks
Torchbench (2023) : Benchmarking tool for PyTorch that ensures high API surface coverage.
2022:
ValueExpert (ASPLOS '22) : Analyzes value patterns and redundancy in GPU applications to uncover optimization opportunities
PerFlow (PPoPP '22) : A domain-specific framework for automatic performance analysis of parallel applications
GPU Variability Study (SC '22) : Characterizing variability in large-scale, accelerator-rich systems
Low Overhead GPU Profiling (ICS '22) : Low overhead and context sensitive profiling of GPU-accelerated applications
Hatchet Call Path Querying (e-Science '22) : Enabling call path querying in hatchet to identify performance bottlenecks in scientific applications
Amazon SageMaker Profiling (KDD '22) : Profiling deep learning workloads at scale using Amazon SageMaker
2021:
Tiered Memory Profiling (IPDPS '21) : Techniques for profiling performance on systems with tiered memory architectures.
GPA (CGO '21) : A GPU performance advisor based on instruction sampling
Efficient Python-Native Interactions (ESEC/FSE '21) : Exploring efficient interactions between Python and native libraries to improve performance
2020:
Precision of PEBS (APSys '20) : An investigation into the precision of Precise Event-Based Sampling.
XSP (IPDPS '20) : Across-stack profiling and analysis of machine learning models on GPUs
DrCCTProf (SC '20) : A fine-grained call path profiler for arm-based clusters. (Performance Track Best Paper, All Tracks Best Paper nominee)GitHub
GVProf (SC '20) : A value profiler designed for GPU-based clusters to enhance performance analysis
Scalana (SC '20) : Automating scaling loss detection with graph analysis for parallel applications
ZeroSpy (SC '20) : Exploring software inefficiency with redundant zeros to identify optimization opportunities
GPU Top-Down Analysis (ICS '20) : Tools for top-down performance analysis of GPU-accelerated applications
Skyline (UIST '20) : Interactive in-editor computational performance profiling for deep neural network training
GPU Computing Tools (2020) : Debugging and performance analysis tools for heterogeneous HPC applications
2019:
Diogenes (SC '19) : An honest CPU/GPU performance measurement tool
Memory Profiling (SC '03) : Memory profiling using hardware counters
2002:
Dynamic Statistical Profiling (SIGMETRICS '02) : Dynamic statistical profiling of communication activity in distributed applications
2000:
Store Instruction Locality (ISCA '00) : Analysis of value locality in store instructions
1997:
Continuous Profiling (TOCS '97) : Foundational work on continuous profiling to identify where all the cycles have gone
1996:
Exceeding Dataflow Limits (MICRO '96) : Techniques for exceeding dataflow limits via value prediction
Value Locality (ASPLOS '96) : Foundational work on value locality and load value prediction
The detailed traces from post-mortem analysis don't just solve today's problems. They become the input for building more accurate predictive models for the next generation of workloads, bringing our framework full circle.
Approach Selection Guide
Choosing the right performance analysis approach depends on your specific scenario:
Scenario
Pre-Execution
During Execution
Post-Execution
New model architecture
Roofline, ASTRA-sim
-
PyTorch Profiler + HTA
Distributed training scaling
SimAI, Multiverse
nvitop, Dynolog
Nsight Systems
Production LLM serving
Centimani (hardware selection)
nvidia-smi, monitoring
SKIP, Nsight Compute
Memory optimization
Analytical models
-
Drgpum, ValueExpert
Kernel-level debugging
-
perf top
KPerfIR, Drgpu
Cross-platform analysis
Benchmarking (MLPerf, PARAM)
-
HPCToolkit, DeepContext
Notable PhD Dissertations
This section highlights doctoral dissertations that have made significant contributions to performance analysis:
Qidong Zhao. "Novel Profiling Techniques for Emerging Architectures and Applications." PhD Dissertation, North Carolina State University, 2025.
Yueming Hao. "Let GPUs Work Harder: Performance Analysis and Optimization for GPGPU Applications." PhD Dissertation, North Carolina State University, 2024.
Zhongyi Lin. "Performance Modeling and Optimization for Machine Learning Workloads." PhD Dissertation, University of California, Davis, 2023.
Hongyu Zhu. "Benchmarking, Profiling and White-Box Performance Modeling for Deep Neural Network Training Workloads." PhD Dissertation, University of Toronto, 2022.
Keren Zhou. "Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications." PhD Dissertation, Rice University, 2022.
Pengfei Su. "Understanding Performance Inefficiencies in Native and Managed Languages." PhD Dissertation, The College of William and Mary, 2021.
Hui Zhang. "Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models." PhD Dissertation, University of Maryland, College Park, 2018.
Xu Liu. "Performance Analysis of Program Executions on Modern Parallel Architectures." PhD Dissertation, Rice University, 2014.
Awards and Recognition
SIGHPC Outstanding Doctoral Dissertation Award recognizes doctoral dissertations with HPC as a central research theme, including computational capabilities delivering much higher performance than desktop systems. The award is presented annually at the SC conference.
Further Reading
For readers interested in diving deeper into specific aspects of performance analysis, we recommend the following comprehensive resources:
NVLink-C2C Interconnect (ISSCC '23) : Details on the NVLink-C2C chip-to-chip interconnect, a key technology for modern multi-GPU systems
Trends in Data Locality Abstractions (TPDS '17) : A comprehensive overview of data locality abstractions for HPC systems
GPU-Centric Communication Survey (2024) : Comprehensive analysis of the landscape of GPU-centric communication patterns and optimization strategies
Survey on Distributed Training Optimization: Provides an extensive survey on efficient training of large language models on distributed infrastructures, covering optimization techniques, communication strategies, and system design considerations
Centauri (ASPLOS '24): A novel approach for enabling efficient scheduling and communication-computation overlap in large model training through communication partitioning techniques
Data Locality in Post-Exascale Era (CSE '25) : Analysis of persistent challenges in data locality for emerging computing paradigms
Understanding Data Movement in Tightly Coupled Heterogeneous Systems (2024) : Understanding data movement in tightly coupled heterogeneous systems: A case study with the Grace Hopper superchip
The journey from a slow, inefficient ML system to a highly optimized one is no longer an art form defined by guesswork. By adopting a structured, three-phase approach—Predicting bottlenecks before they occur, Monitoring systems in real-time, and Profiling for deep insights—engineers and scientists can systematically dismantle performance mysteries. Whether you are designing the next generation of hardware or deploying a model to millions of users, this framework provides a map for navigating the complex landscape of modern HPC and ML performance. The right tool exists for your challenge; the key is knowing when and how to use it.