Extra-P and Score-P: Automated Performance Modeling for HPC

Extra-P's empirical performance modeling and Score-P's measurement infrastructure: from automated scalability bug detection to noise-resilient modeling for exascale systems.

Performance modeling in high-performance computing has traditionally been a manual, expert-driven process. Developers would run small-scale experiments, manually derive scaling formulas, and hope these predictions held at production scale. This approach is time-consuming, error-prone, and doesn’t scale to the complexity of modern HPC applications with thousands of parameters and code paths.

The Extra-P (Exascale Performance Prediction) framework, developed collaboratively by TU Darmstadt and ETH Zürich’s SPCL group, automates this entire process. Combined with Score-P, the unified measurement infrastructure from the VI-HPS community, these tools enable developers to automatically discover performance models, detect scalability bugs before they manifest at scale, and make data-driven optimization decisions.

This post traces the evolution of Extra-P from its foundational work on automated scalability bug detection to recent advances in noise-resilient modeling using deep neural networks, all built on Score-P’s robust measurement infrastructure.

The Extra-P Ecosystem

Extra-P and Score-P form a complete performance analysis pipeline:

Score-P provides the measurement layer: instrumentation, profiling, and tracing infrastructure that captures performance data from running applications
Extra-P provides the analysis layer: empirical performance modeling that automatically derives scaling formulas from Score-P measurements

The partnership between TU Darmstadt’s Laboratory for Parallel Programming and ETH Zürich’s Scalable Parallel Computing Lab (SPCL) has driven continuous innovation in both tools since 2013. Score-P serves as the measurement backend not just for Extra-P but for the entire VI-HPS tool ecosystem including Vampir, Scalasca, and TAU.

Core Methodology

Automated Scalability Analysis

The foundational insight behind Extra-P is that performance bottlenecks often follow predictable mathematical patterns—but discovering these patterns manually is impractical for complex codes.

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes (SC 2013) The foundational Extra-P paper. Introduced automated empirical performance modeling that analyzes every program part (by call path) from modest small-scale runs and extrapolates to identify regions whose growth will impede scaling. This approach dramatically improved coverage and speed of scalability analysis beyond hand-picked kernel routines. Rather than requiring developers to manually select what to model, Extra-P systematically models everything and automatically flags problematic scaling behavior.
10,000 Performance Models per Minute—Scalability of the UG4 Simulation Framework (Euro-Par 2015) Demonstrated Extra-P’s ability to generate performance models at unprecedented scale, automatically analyzing UG4’s complex multigrid solver across thousands of code regions. Showed that automated modeling could scale to realistic scientific applications with intricate call graphs.

Multi-Parameter Modeling

Real applications don’t scale along a single dimension—they have multiple problem sizes, decompositions, and algorithmic parameters.

Fast Multi-Parameter Performance Modeling (CLUSTER 2016) Extended Extra-P to multi-parameter modeling, handling applications where performance depends on multiple independent variables (e.g., problem size, processor count, algorithmic parameters). Introduced efficient search strategies for the multi-dimensional parameter space that avoid exponential blowup.
Off-Road Performance Modeling—How to Deal with Segmented Data (Euro-Par 2017) Addressed the challenge of modeling performance when data exhibits distinct regimes (e.g., cache effects causing performance phase transitions). Developed techniques to automatically detect and model segmented performance behavior where different scaling formulas apply in different parameter ranges.
Following the Blind Seer—Creating Better Performance Models Using Less Information (Euro-Par 2017) Explored how to build accurate performance models with minimal measurement overhead by intelligently selecting which configurations to measure. Demonstrated that strategic sampling of the parameter space, guided by partial models, can achieve accuracy comparable to exhaustive measurement while dramatically reducing profiling costs.

Noise-Resilient Modeling

Performance measurements in real systems are noisy—network jitter, OS interference, and hardware variability create measurement uncertainty that can corrupt performance models.

Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling (IPDPS 2020) Introduced machine learning techniques to determine optimal sampling strategies that balance measurement cost against model accuracy. Learns which measurements contribute most to model quality and adapts sampling accordingly, reducing the time required to build accurate models by focusing measurement effort where it matters most.
Noise-Resilient Empirical Performance Modeling with Deep Neural Networks (IPDPS 2021) Applied deep neural networks to performance modeling, demonstrating that DNNs can learn robust models even from noisy measurements. The approach uses neural networks as a denoising layer that filters measurement noise before fitting parametric models, achieving better accuracy on real-world data than traditional regression.
Denoising Application Performance Models with Noise-Resilient Priors (arXiv 2025) Latest advance in noise-resilient modeling. Incorporates domain knowledge as Bayesian priors to guide model fitting in the presence of noise. By encoding expectations about performance behavior (e.g., monotonicity, smoothness), the approach achieves robust models even when measurements are severely corrupted. Particularly important for production systems where measurement overhead must be minimal and noise is unavoidable.

Score-P: Measurement Infrastructure

System Architecture

Score-P serves as the unified measurement infrastructure for the entire VI-HPS tool ecosystem, providing a common foundation for diverse performance tools.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (HPC 2011) The canonical Score-P system paper. Describes the architecture that unifies measurement capabilities across multiple HPC performance tools. Score-P provides both profiling (aggregated metrics) and tracing (detailed event streams), allowing tools to choose the appropriate level of detail. The unified infrastructure eliminates redundant instrumentation overhead when using multiple tools together.
Score-P: A Unified Performance Measurement System for Petascale Applications (2011) Early system overview describing Score-P’s design for petascale systems. Addresses challenges of low-overhead instrumentation, scalable data collection, and efficient parallel I/O for performance data.
Open Trace Format 2 (OTF2): The Next Generation of Scalable Trace Formats and Support Libraries (2013) Describes OTF2, the trace format used by Score-P and downstream analysis tools. OTF2 was designed for exascale systems with efficient parallel I/O, compression, and support for diverse event types (communication, I/O, accelerators, etc.).
Generic Support for Remote Memory Access Operations in Score-P and OTF2 (2013) Extended Score-P to capture remote memory access (RMA) operations from one-sided communication models like PGAS languages and MPI-3 RMA. Essential for profiling modern communication patterns beyond traditional message passing.

Language and Platform Support

Score-P has evolved to support diverse programming models and execution platforms as HPC software has diversified.

An LLVM Instrumentation Plug-in for Score-P (SC 2017) Developed compiler-based instrumentation through LLVM, enabling Score-P to instrument applications at compile time without source modification. The LLVM plugin approach provides finer control over instrumentation granularity and enables optimization-aware instrumentation that adapts to compiler transformations.
Advanced Python Performance Monitoring with Score-P (2020) Brought Score-P’s capabilities to Python applications, addressing the growing use of Python in scientific computing. Developed techniques to handle Python’s dynamic nature while maintaining Score-P’s low-overhead measurement approach, enabling performance analysis of mixed Python/compiled workflows.

Advanced Applications

Isoefficiency and Configuration

Understanding how applications scale requires more than just runtime measurements—it requires principled analysis of efficiency and configuration choices.

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications (PPoPP 2017) Applied isoefficiency analysis using Extra-P-generated models to understand and configure task-based runtime systems. Isoefficiency characterizes how problem size must grow with processor count to maintain constant efficiency—a fundamental scalability metric. The paper demonstrates using automated models to guide runtime configuration decisions that optimize efficiency at scale.
Engineering Algorithms for Scalability through Continuous Validation of Performance Expectations (IEEE TPDS 2019) Introduced continuous validation methodology where Extra-P models are used as executable specifications during algorithm development. Developers express expected performance behavior as models, and the system automatically validates that implementations match expectations. When deviations occur, they indicate potential bugs or optimization opportunities.
Lightweight Requirements Engineering for Exascale Co-design (CLUSTER 2018) Demonstrates using Extra-P models in co-design processes where hardware and software are developed in tandem. Performance models enable reasoning about whether proposed hardware features will benefit target applications before hardware exists, informing design decisions with quantitative predictions.

Case Studies and Real-World Applications

Extra-Deep: Automated Empirical Performance Modeling for Distributed Deep Learning (SC-W 2023) Extended Extra-P methodology to distributed deep learning, addressing the distinct performance characteristics of ML training workloads. Handles GPU kernels, communication collectives, and pipeline parallelism—performance factors that differ substantially from traditional HPC. Demonstrates that Extra-P’s empirical approach generalizes beyond its original HPC target domain.
Practical Empirical Performance Modeling for CFD Applications Using Extra-P (2024) Case study applying Score-P and Extra-P to OpenFOAM computational fluid dynamics workflows. Provides detailed workflow describing how to instrument CFD codes, collect measurements, build models, and use insights for optimization. Demonstrates practical integration into domain scientist workflows.
ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications (2020) Overview of the ExtraPeak project that extended Extra-P with advanced modeling capabilities. Describes the research program underlying Extra-P’s evolution from initial scalability bug detection to comprehensive performance modeling framework.

The Path Forward

The Extra-P and Score-P ecosystem has fundamentally changed how HPC developers approach performance analysis. What once required manual derivation of scaling formulas and expert intuition is now automated, systematic, and data-driven.

Key Achievements:

Automation: From manual performance modeling to 10,000+ models per minute
Noise resilience: From fragile fits corrupted by measurement noise to robust models using deep learning and Bayesian priors
Breadth: From single-parameter models to multi-parameter analysis across complex parameter spaces
Integration: From isolated tools to unified measurement infrastructure (Score-P) supporting an entire ecosystem

Current Frontiers:

Deep learning workloads: Extending to GPU-accelerated ML training with distinct performance characteristics (Extra-Deep)
Noise and uncertainty: Continued improvements in noise-resilient modeling as systems become more complex
Real-time validation: Moving toward continuous validation where models guide development in real-time
Cross-platform analysis: Adapting techniques across diverse architectures (CPUs, GPUs, accelerators)

Open Challenges:

Can we extend empirical modeling to capture emergent behavior in extreme-scale systems that doesn’t appear at small scales?
How do we model performance when hardware behavior is increasingly non-deterministic (power management, thermal throttling, etc.)?
Can we integrate performance models directly into compilers and runtime systems for automated optimization?

The partnership between measurement infrastructure (Score-P) and automated analysis (Extra-P) has created a powerful platform for performance understanding. As HPC systems continue to grow in complexity, automated empirical modeling becomes not just useful but essential—manual approaches simply cannot scale to the complexity of exascale applications with millions of lines of code and thousands of configuration parameters.

For researchers and practitioners, the Extra-P/Score-P ecosystem provides both a mature toolchain for immediate use and an active research platform for advancing the state of performance analysis. The code is open source, the methodology is well-documented, and the community continues to push the boundaries of what automated performance modeling can achieve.