Predict, Monitor, Profile: A Framework for HPC & MLSys Performance Analysis

A review and guide to the evolution of performance analysis tools, from methodologies to simulators and profilers. Work in progress:)

Your LLM inference is too slow. Your distributed training job isn’t scaling. You’ve thrown more GPUs at the problem, but costs are spiraling and performance gains are minimal. Where is the bottleneck? Is it the code, the compiler, the network, or the hardware itself? Answering these questions requires more than guesswork—it requires a systematic approach to performance analysis.

The landscape of performance analysis tools has undergone remarkable transformation over the decades, evolving from rudimentary profilers to sophisticated, ML-aware frameworks. This post explores performance analysis through a three-phase lens: Pre-Execution (modeling and prediction), During Execution (real-time monitoring), and Post-Execution (profiling and debugging). We trace this evolution from the foundational HPC profilers to today’s specialized ML performance ecosystems.

Figure 1. Three-phase framework for performance analysis in machine learning systems: Pre-Execution (prediction and modeling), During Execution (real-time monitoring), and Post-Execution (profiling and debugging).

Why This Framework Matters: Whether you’re a compiler engineer optimizing kernel performance, an infrastructure engineer scaling distributed training, or an ML scientist deploying LLMs to production, understanding when and how to apply these tools can dramatically reduce debugging time and improve system efficiency. This three-phase approach helps you choose the right tool for your specific performance challenge—from predicting bottlenecks before they occur to diagnosing complex distributed training failures.

Pre-Execution: Modeling and Prediction

Performance analysis before running an ML workload – predicting how fast it will run or how an optimization will affect runtime – is incredibly valuable. This phase encompasses analytical modeling, trace-driven simulation, and benchmark generation.

Analytical Models

LogP Model (1993) Roofline Model (2009) Cache-aware Roofline (2013) GPU Instruction Roofline (2019) Amped (2023)

In the era of supercomputers, researchers built simple performance models to estimate runtime and scalability without executing full experiments. As distributed-memory parallel architectures proliferated in the 1990s, inter-node communication emerged as a critical bottleneck. The LogP Model (1993) elegantly captures this through three parameters: L (latency), o (overhead), and g (gap, representing bandwidth limits). By quantifying whether bottlenecks are latency- or bandwidth-dominated, LogP enables algorithm designers to optimize strategies across diverse parallel machines.

Figure 2. The LogP model quantifies communication bottlenecks through its parameters: L (latency), o (overhead), and g (gap, the reciprocal of which corresponds to the available per-processor communication bandwidth).

The Roofline Model (2009) revolutionized performance analysis by providing analytical upper bounds based on operation intensity and memory bandwidth. As Jouppi et al. describe in their TPU analysis :

This simple visual model is not perfect, yet it offers insights on the causes of performance bottlenecks. The assumption behind the model is that applications don’t fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited.

The approach proved so valuable that it spawned numerous extensions: the Cache-aware Roofline Model (2013) accounts for cache hierarchy effects, while the Instruction Roofline Model for GPUs (2019) adapts the framework for GPU architectures with instruction-level analysis.

More recently, Amped (2023) demonstrates how analytical modeling continues to evolve for modern workloads, providing performance predictions specifically for distributed transformer training. These efforts established that complex systems can be understood through abstract performance metrics and that the right abstraction enables reasoning about performance before exhaustive experiments.

Performance Modeling for Deep Learning

Modern performance modeling for deep learning considers the entire distributed training process. Recent surveys analyze simulators across three dimensions: workload representation, the simulation infrastructure itself, and models for Total Cost of Ownership (TCO) that include carbon emissions. These studies compare how different frameworks abstract workloads and detail their underlying assumptions and capabilities.

Workload Representation

MLIR (2021) Chakra (2023)

A critical aspect of simulation is how DNN workloads are represented, which directly impacts accuracy, performance, and scalability. Workload representations are broadly classified into configuration-based and operator/layer-level Intermediate Representations (IRs). There has been a clear trend away from high-level configuration-based descriptions toward more detailed operator-level IRs, as they provide the fidelity needed to model fine-grained behaviors like scheduling and communication overlap, and specific operator optimizations. These IRs can be either framework-specific (e.g., PyTorch’s Torch.fx) or framework-agnostic (e.g., ONNX, StableHLO, Chakra). While many simulators use custom IRs, Torch.fx is a popular choice due to its tight integration with PyTorch’s profiling infrastructure.

Figure 3. A summary of intermediate representations with potential for use in DNN simulations .

Chakra (2023) (by MLCommons) standardizes ML execution trace representation, providing a unified trace schema and tools to collect traces from frameworks like PyTorch, making it easier to share and analyze traces across different tools and enabling collaborative performance research. (MLSys ‘23 Workshop) Chakra GitHub Chakra Replay GitHub YouTube Chakra Wiki

Figure 4. Chakra execution trace schema and tools for standardizing ML performance analysis across different frameworks.

Distributed DNN Training Simulators

Simulators for distributed DNN training are essential for exploring the vast design space of modern hardware and software systems without the prohibitive cost of physical prototyping. These simulators can be categorized by their core methodology into three main types: analytical, profiling-based, and execution-driven, each offering a different trade-off between speed, fidelity, and scalability.

Figure 5. Taxonomy of distributed DNN training simulators categorized into analytical, profiling-based, and execution-driven simulation methodologies. .

The landscape of distributed DNN training simulators is diverse, with tools employing different methodologies to balance prediction accuracy, simulation speed, and the level of system detail they model. These simulators are crucial for researchers and engineers to explore design spaces, evaluate new hardware, and optimize training strategies without the need for extensive real-world experimentation. To provide a clearer comparative overview, the following figure summarizes key characteristics of prominent simulators in this domain, drawing from recent surveys:

Figure 6. Summary of distributed DNN training simulators and their main characteristics. .

Benchmarking and Workload Generation

Standardized benchmarks are crucial for fair, reproducible performance comparisons across diverse hardware and software stacks.

Figure 8. MLPerf family of benchmarks. [Image Source]

Research in Performance Modeling

2025:

Figure 7. Performance ranking analysis from shallowsim showing hardware configurations for LLM inference. The visualization demonstrates how pre-execution modeling guides infrastructure decisions through systematic performance comparisons and bottleneck identification, enabling cost-effective hardware selection before deployment. [Image Source]

2024:

2022-2023:

2020:

But models are never perfect. To validate these predictions and uncover unexpected bottlenecks, we need to observe the system as it runs, which brings us to real-time analysis.

During Execution: Real-Time Analysis

The second phase of our framework focuses on real-time monitoring and on-the-fly adaptations to catch performance issues as they occur. This phase bridges prediction and post-mortem analysis by providing immediate visibility into system behavior.

This capability is powered by low-overhead technologies. eBPF allows for non-intrusive tracing of kernel and API interactions, while vendor suites like NVIDIA's DCGM expose critical hardware counters.

While essential for live monitoring, deep diagnosis requires the comprehensive data gathered in post-execution profiling.

Post-Execution: Profiling and Debugging

The third phase of our framework—post-mortem profiling—provides the most detailed insights through offline analysis of execution data. While sampling profilers offer low-overhead glimpses into average performance, they often miss tail latency issues and complex interactions. Tracing methodologies capture comprehensive event logs, enabling a deeper understanding of system behavior and root causes that sampling might miss—a distinction well-articulated in Dan Luu’s Sampling vs. Tracing.

The Early Days

HPC Profilers

Historic Foundation Tools:

These tools established core principles: use sampling to reduce overhead, incorporate hardware counters, and support hierarchical analysis.

Modern CPU Profilers

System Profilers:

Python Profilers:

Continuous Profiling:

Profiling for GPUs and AI Frameworks

GPU Profilers by Vendor:

ML Framework Profilers:

Collective Communication:

Production-Scale Systems:

Visualization Tools:

Research in Post Mortem Analysis

2025:

2024:

2023:

2022:

2021:

2020:

2019:

2018:

2017:

2016:

2015:

2014:

2012:

2011:

2010:

2007:

2003:

2002:

2000:

1997:

1996:

The detailed traces from post-mortem analysis don't just solve today's problems. They become the input for building more accurate predictive models for the next generation of workloads, bringing our framework full circle.

Approach Selection Guide

Choosing the right performance analysis approach depends on your specific scenario:

Scenario Pre-Execution During Execution Post-Execution
New model architecture Roofline, ASTRA-sim - PyTorch Profiler + HTA
Distributed training scaling SimAI, Multiverse nvitop, Dynolog Nsight Systems
Production LLM serving Centimani (hardware selection) nvidia-smi, monitoring SKIP, Nsight Compute
Memory optimization Analytical models - Drgpum, ValueExpert
Kernel-level debugging - perf top KPerfIR, Drgpu
Cross-platform analysis Benchmarking (MLPerf, PARAM) - HPCToolkit, DeepContext

Notable PhD Dissertations

This section highlights doctoral dissertations that have made significant contributions to performance analysis:

Awards and Recognition

SIGHPC Outstanding Doctoral Dissertation Award recognizes doctoral dissertations with HPC as a central research theme, including computational capabilities delivering much higher performance than desktop systems. The award is presented annually at the SC conference.

Further Reading

For readers interested in diving deeper into specific aspects of performance analysis, we recommend the following comprehensive resources:

Conclusion: From Guesswork to a Science

The journey from a slow, inefficient ML system to a highly optimized one is no longer an art form defined by guesswork. By adopting a structured, three-phase approach—Predicting bottlenecks before they occur, Monitoring systems in real-time, and Profiling for deep insights—engineers and scientists can systematically dismantle performance mysteries. Whether you are designing the next generation of hardware or deploying a model to millions of users, this framework provides a map for navigating the complex landscape of modern HPC and ML performance. The right tool exists for your challenge; the key is knowing when and how to use it.