Reading Notes

A running log of my casual notes, thoughts, and summaries on papers in performance engineering, MLSys, and HPC.

This is a living document containing my notes on various papers.

ProfileMe ‘97

Compaq's Alpha 21264 provides two performance counters that support privilege-mode qualification and interrupt generation on counter overflow.

Feature ProfileMe NVIDIA CUPTI PC Sampling
Company Digital Equipment Corp. (DEC) NVIDIA
Target Architecture Out-of-Order Superscalar CPUs GPUs (SIMT architecture)
Chips Compaq Alpha 21264 processor NVIDIA GPUs (e.g., Ampere, Hopper, Blackwell)
Core Goal Identify instruction-level bottlenecks and accurately attribute events to specific instructions Identify warp-level stall bottlenecks and attribute them to the responsible code
Sampling Philosophy “Instruction Biography”: Follow a single, randomly chosen instruction through its entire lifecycle. “Warp Snapshot”: Periodically check the current state and stall reason of all active warps.
Sampling Trigger Instruction-Triggered: A hardware counter randomly selects an instruction to profile. Time-Triggered: A periodic clock cycle interval initiates a sample collection across the GPU.
Data Collection Model Longitudinal & Deep: Gathers a rich, multi-faceted record for one instruction’s complete execution. Cross-Sectional & Broad: Gathers a focused record from many warps at a single moment in time.
Data Richness / Sample Very High: Includes PC, multiple pipeline stage latencies, cache miss data, effective addresses, completion status, etc. Focused: Primarily includes the PC and a single, dominant stall reason for a warp.
Unit of Analysis A single instruction. A warp (a group of 32 threads executing in lockstep).
Hardware Requirement Requires new, specific hardware modifications to the CPU core, such as dedicated counters and registers. Uses existing, built-in Performance Monitoring Units (PMUs) that are standard on modern GPUs.
Key Innovation Paired Sampling: Simultaneously profiles two instructions to measure concurrency and interaction on an out-of-order CPU. Scalability: Provides a scalable method to analyze performance across thousands of parallel threads.

gprof ‘82

1982

gprof measures the cost with statistical sampling, determining each function's total self-time by periodically recording which function is executing. Next, it distributes the cost with proportional propagation, where a function's measured self-time is divided among its callers based on call counts to ensure fair attribution.

A critical aspect of gprof's design is a simplifying assumption it makes to maintain low overhead: it assumes that every call to a particular function takes the same amount of time. The profiler calculates a function's total self-time from all its executions and then effectively computes an "average cost per call." This average cost is what gets distributed to its callers.

However, this model has limitations, as the paper itself acknowledges. Consider a function like matrix_multiply(size). If one part of the program calls it with a small 4x4 matrix, while another calls it with a massive 1000x1000 matrix, the actual costs are vastly different. gprof will average these costs, potentially over-attributing time to the cheap call and under-attributing it to the expensive one. This trade-off between precision and low performance impact is a fundamental characteristic of gprof's design.

DSL and DSA ‘20

2020

slides

The Hardware Lottery ‘21

2021

Hooker introduces the term “hardware lottery” to describe a phenomenon where research ideas succeed not because they are inherently superior, but because they happen to be compatible with available hardware and software at the time. This creates noise in the scientific process, potentially causing promising research directions to be overlooked or delayed for decades due to lack of suitable tools, while suboptimal ideas that align with existing infrastructure flourish.

To avoid future hardware lotteries, the key lies in reducing the cost and complexity of exploring different algorithm-software-hardware combinations.

Hardware Challenges:

Software Revolution:

The Bitter Lesson ‘19

2019

link

The most powerful advances in AI come not from encoding human knowledge, but from leveraging massive computational power through general methods like search and learning.

The “Bitter Lesson” refers to a recurring pattern in AI history:

  1. Short-term Temptation: Researchers often try to embed their own understanding and domain-specific knowledge into AI systems.
  2. Short-term Success: These efforts yield initial improvements, giving a sense of progress.
  3. Long-term Limitations: Over time, however, such hand-crafted approaches hit a ceiling and fail to scale.
  4. Breakthrough via Computation: Ultimately, breakthroughs emerge from approaches that exploit large-scale computation—often through brute-force search or learning—rather than human-designed strategies.

Historical Examples:

Underlying Reason:
This lesson is driven by the exponential growth in computational power (a la Moore’s Law). Methods that scale with computation—search and learning—consistently outperform those limited by human knowledge.

What Should We Learn?

In essence, the future of AI lies in creating systems that learn and discover as we do, not ones that merely reflect what we already know.

3Ps of PP ‘19

2011

link

A successful parallel programming model should have three broad characteristics (known as the three Ps): productivity, performance, and portability.

Torch ‘02

2002

link

This paper reminds me to have a look at the history of PyTorch.

Torch’s journey spans 20 + years: the C‑based Torch (2002) laid the modular groundwork, Torch7 (2011) rebuilt it in LuaJIT for GPU speed, and frustration with Lua’s niche led the same researchers to ship PyTorch 0.1 in January 2017—bringing define‑by‑run automatic differentiation to Python. PyTorch 1.0 (Dec 2018) added TorchScript and ONNX export so the research‑friendly API could ship to production, and in 2022 Meta transferred the project to a vendor‑neutral PyTorch Foundation. The compiler‑driven PyTorch 2.0 series (March 2023 →) layers torch.compile and TorchInductor on top of eager mode for dramatic performance gains without changing user code.

Historical evolution at a glance

Era (Years) Library Key milestones & capabilities Representative articles
Early Torch
2002 – 2005
Torch (C + scripting) Collobert & Bengio introduce a BSD‑licensed, modular ML library with BLAS/ATLAS kernels (publications.idiap.ch) “Torch: A Modular Machine Learning Software Library” (2002) (publications.idiap.ch)
Torch7
2011 – 2017
Torch7 (LuaJIT) Full LuaJIT rewrite; CUDA tensor core; adopted at Facebook AI & DeepMind (ronan.collobert.com, en.wikipedia.org) Collobert et al., “Torch7: A MATLAB‑like Environment for ML” (2011) (ronan.collobert.com)
PyTorch 0.x Preview
Jan 2017 – mid‑2017
PyTorch 0.1–0.2 Define‑by‑run dynamic graphs, CUDA tensors, eager autograd; fast researcher uptake (github.com, pytorch.org) GitHub release timeline (0.1) (github.com); “PyTorch, a Year In” (2018‑01‑19) (pytorch.org)
PyTorch 1.x Growth
2018 – 2021
PyTorch 1.0–1.13 1.0 stable (Dec 2018) merges Caffe2 runtime, adds TorchScript compiler & ONNX export; ecosystem libraries and cloud support blossom (engineering.fb.com, developers.facebook.com, pytorch.org) FB Eng. Blog “Developer Ecosystem Expands, 1.0 Stable” (2018‑12‑07) (engineering.fb.com); FB Dev Blog “Announcing PyTorch 1.0” (2018‑05‑02) (developers.facebook.com)
Governance shift
Sept 2022
PyTorch Foundation Meta donates project to new PyTorch Foundation under Linux Foundation; multi‑vendor board (AMD, AWS, Google, Microsoft, Nvidia) (linuxfoundation.org, axios.com) Linux Foundation press release (2022‑09‑12) (linuxfoundation.org); Axios coverage “Meta moves PyTorch to LF” (2022‑09‑12) (axios.com)
PyTorch 2.x Compiler era
Mar 2023 →
PyTorch 2.0–2.6 torch.compile captures eager code, feeds TorchDynamo + Inductor for 30‑200 % speedups; dynamic shapes & backend plug‑ins keep expanding (pytorch.org, pytorch.org) PyTorch Blog “PyTorch 2.0: Next‑generation Release” (2023‑03‑15) (pytorch.org); “PyTorch 2.x Overview” docs (pytorch.org)

EasyView ‘24

Traditional profiling tools are often separate from IDEs, creating a steep learning curve and forcing developers to switch contexts between coding and analysis, which is inefficient. While some IDEs have attempted to integrate profilers, these solutions typically offer limited support, basic functionality, and poor performance with large analysis files.

EASYVIEW addresses these pain points by bringing performance analysis directly into the coding environment. Its core advantages include:

Figure 1. Overview of EASYVIEW. EASYVIEW consists of three components: profile abstraction, analysis, and visualization.
Figure 2. EASYVIEW shows an aggregate memory profile for a gRPC application, collected by PProf. This snapshot overviews EASYVIEW that consists of flame graphs, source code annotations, and metric histogram. This profile identifies potential memory leaks.

Deep learning with COTS HPC systems ‘15

The paper demonstrated that large-scale deep neural networks could be efficiently trained using clusters of commodity hardware—specifically, GPU servers interconnected with high-speed networks. Prior to this work, training billion-parameter models typically required massive cloud infrastructure and thousands of CPU cores. The authors showed that by leveraging GPUs—which offer far greater arithmetic throughput than standard CPUs—along with optimized distributed computing strategies, it was possible to achieve better performance using far fewer machines.

Neutrino ‘25

Figure 3. NEUTRINO DMAT plot (captured on RTX3080) for different attention algorithms. By comparing the DMAT of different algorithms, we can visually identify the improvement of (b) FlashAttn-v1 w.r.t. (c) Memory Efficient Attention comes from I/O efficiency, while the gain of (a) FlashAttn-v2 comes from better pipelining, both consistent with their respective claims.
Dimension Deep Context NEUTRINO
Instrumentation / Data Pipeline Uses a shim layer (DLMonitor) to intercept deep‑learning framework calls, CUPTI/RocTracer GPU events and CPU Perf/PAPI counters, then co‑aggregates them in one profile database. Injects “probes” (snippet + trace‑point + map) directly into GPU assembly at runtime; a hook driver and probe engine reassemble the code and stream results to user space.
Context‑breadth reconstructed Rebuilds the full call‑path—from Python ➜ DL operator ➜ C++ ➜ GPU kernel—across PyTorch & JAX, NVIDIA & AMD GPUs, and even CPUs, so one trace covers the whole stack. Stays inside a single GPU kernel but reaches instruction‑level detail; nevertheless works on both NVIDIA & AMD back‑ends.
Automation & Guidance Ships a pattern‑matching performance analyser that auto‑suggests fixes (e.g., fuse ops, change data layout). Provides raw measurements; interpretation is left to the user. The paper explicitly says leveraging those traces “is not our primary goal.”
Visual Experience Integrated VS Code panel with hierarchical flame‑paths and metric drill‑downs. Introduces the DMAT plot that overlays memory‑pressure, time and thread density in one heat‑map for instant visual insight.
Programmability / Extensibility Users extend analysis by writing new pattern‑rules over the stored context/metrics; instrumentation itself is fixed (handled by DLMonitor). A Python DSL or TOML lets you write custom probes, combine them cooperatively and even build ring‑buffer maps—akin to eBPF for GPUs.
Cross‑Platform Coverage One binary profiles PyTorch or JAX on NVIDIA, AMD, x86 or ARM with the same UI. Currently supports CUDA 12 + ROCm/HIP on Linux for NVIDIA & AMD GPUs.
Profiling Overhead Comparable to state‑of‑the‑art tools while cutting trace size dramatically. Lightweight probes slow kernels only 1.04× on average and burn ≈ 4 extra registers; heavy DMAT probes are ~7×.
Safety / Verification Relies on mature vendor APIs, so safety comes “for free” from CUPTI/RocTracer. Includes a verifier that rejects probes touching original registers, flow‑control instructions or shared‑memory abuse, preventing crashes.

How they fit together

XSP ‘20

Dimension DeepContext NEUTRINO XSP
Instrumentation / Data Pipeline Uses a shim layer (DLMonitor) to intercept deep-learning framework calls, CUPTI/RocTracer GPU events and CPU Perf/PAPI counters, then co-aggregates them in one profile database. Injects “probes” (snippet + trace-point + map) directly into GPU assembly at runtime; a hook driver and probe engine reassemble the code and stream results to user space. Leverages distributed tracing by combining outputs from existing tools: manual timing for model-level, framework-specific profilers for layer-level, and NVIDIA’s CUPTI for GPU kernel-level.
Context-breadth Reconstructed Rebuilds the full call-path—from Python ➜ DL operator ➜ C++ ➜ GPU kernel—across PyTorch & JAX, NVIDIA & AMD GPUs, and even CPUs, so one trace covers the whole stack. Stays inside a single GPU kernel but reaches instruction-level detail; it works on both NVIDIA & AMD back-ends. Correlates events across the model, layer, and GPU kernel levels by building an interval tree from span timestamps to reconstruct parent-child relationships.
Automation & Guidance Ships a pattern-matching performance analyzer that auto-suggests fixes (e.g., fuse ops, change data layout). Provides raw measurements; interpretation is left to the user. The paper explicitly says leveraging those traces “is not our primary goal.” Features an automated analysis pipeline that performs 15 types of analyses, including generating tables and roofline plots to systematically characterize models.
Visual Experience Integrated VS Code panel with hierarchical flame-paths and metric drill-downs. Introduces the DMAT plot that overlays memory-pressure, time and thread density in one heat-map for instant visual insight. Generates static plots and tables, such as throughput graphs, layer distribution pie charts, and roofline analysis plots.
Programmability / Extensibility Users extend analysis by writing new pattern-rules over the stored context/metrics; instrumentation itself is fixed (handled by DLMonitor). A Python DSL or TOML lets you write custom probes, combine them cooperatively and even build ring-buffer maps—akin to eBPF for GPUs. Extensible design allows integrating new profiling tools by implementing a “tracer interface” and adding new analyses to the automated pipeline.
Cross-Platform Coverage One binary profiles PyTorch or JAX on NVIDIA, AMD, x86 or ARM with the same UI. Currently supports CUDA 12 + ROCm/HIP on Linux for NVIDIA & AMD GPUs. The evaluated implementation is for TensorFlow and MXNet on NVIDIA GPUs (Turing, Volta, Pascal, Maxwell).
Profiling Overhead Comparable to state-of-the-art tools while cutting trace size dramatically through runtime metric aggregation. Lightweight probes slow kernels only 1.04× on average and burn ≈ 4 extra registers; heavy DMAT probes are ~7×. Uses a “leveled experimentation” approach to measure and account for overhead from underlying profilers like CUPTI, which can be significant (over 100x for memory metrics).
Safety / Verification Relies on mature vendor APIs, so safety comes “for free” from CUPTI/RocTracer. Includes a verifier that rejects probes touching original registers, flow-control instructions or shared-memory, preventing crashes. Inherits safety from the established, underlying profilers it uses at each stack level (e.g., framework profilers, CUPTI).

How They Fit Together

These three tools represent different philosophies and target distinct, but complementary, levels of the deep learning performance optimization workflow.

In a practical workflow, a developer might first use DeepContext to identify a performance bottleneck and determine that a specific GPU kernel in their PyTorch model is the culprit. If the automated suggestions aren’t sufficient, they could then use NEUTRINO to attach custom probes to that specific kernel’s assembly code, analyzing its memory access patterns with a DMAT plot to understand the root cause of the inefficiency at the hardware level. XSP provides the foundational concepts and analyses that inform the design of both of these more modern tools.

Scalana ‘20 ‘24

There are two versions of Scalana, one is published in SC '20, the other is published in TPDS '24. Here is a table summarizing the comparison between the two versions of the paper.

Comparison Dimension Early Version (SC '20) Current Version (TPDS '24)
Core Idea Combines static program analysis with dynamic, sampling-based profiling to automatically identify root causes of scalability bottlenecks using a Program Performance Graph (PPG) and a backtracking algorithm. The core idea remains the same, but the methodology, technical implementation, and evaluation are significantly deepened and expanded.
Analysis Scope Primarily supports C and Fortran MPI programs based on source code analysis using LLVM. Extending to other models like OpenMP is noted as future work. Significantly Expanded: • Supports direct analysis of executable binaries (using Dyninst). • Adds support for GPU programs (CUDA) and other parallel models like OpenMP and Pthreads.
Key Techniques & Depth Proposes the fundamental framework of graph generation and backtracking detection. Uses LLVM for static analysis. Technically Deepened: • Introduces optimizations for binary analysis like parallel analysis and granularity coarsening. • Details more refined, low-overhead data collection techniques like lightweight indirect call analysis. • Adds a formal complexity analysis of the backtracking algorithm.
Evaluation Scale Evaluated on up to 2,048 processor cores. Significantly Enlarged: • Evaluated on up to 16,384 processor cores. • Includes a scalability analysis of the tool itself against parameters like process count and sampling frequency.
Overhead & Performance Runtime Overhead: 1.73% on average for up to 2,048 processes. • Storage Cost: In the order of Kilobytes (KB) to Megabytes (MB), far less than tracing-based tools. Runtime Overhead: 5.65% on average for up to 16,384 processes (including I/O). • Storage Cost: In the order of Megabytes (MB) to Gigabytes (GB), while still being significantly less than tracing tools.
Optimization Impact Achieved up to 11.11% performance improvement on 2,048 processes by fixing detected bottlenecks. Optimization impact is more significant, achieving up to 33.01% performance improvement.
Tool Maturity & Usability Provides a GUI named ScalAna-viewer and a clear command-line workflow. The tool is more mature: Provides the same GUI and adds explicit configuration guidance for key parameters like MaxLoopDepth and ImbThd, improving usability.
Case Studies Evaluated applications including NPB, Zeus-MP, SST, and Nekbone. Case studies are updated and more in-depth, evaluating new applications like LAMMPS (CPU/GPU), Sweep3D, HPL-GPU, and HPCG-GPU, replacing some of the earlier cases.

Blog post: Scalana: A (Not-So) Deep Dive into its Codebase

Please pay attention to these two papers:

Landscape of GPU-Centric Communication ‘24

Figure 4. The evolution of NVIDIA's GPUDirect technology. Each version progressively removes the CPU as a bottleneck, enabling more direct and efficient communication paths between GPUs and other devices like NICs, both within a single server and across the network.

The GPUDirect family represents a series of innovations that progressively eliminate CPU involvement:

NVLink‑C2C is NVIDIA’s ground‑referenced, single‑ended 40 Gbps/pin link that stitches Grace CPUs to Hopper/Blackwell GPUs (or to another Grace) into a single, cache‑coherent address space. It supplies up to 900 GB/s aggregate bandwidth (450 GB/s each direction), consumes only 1.3 pJ/bit, and tolerates BER 10⁻¹⁵ with a 0.3 UI eye margin—without forward‑error correction. This moves the traditional CPU↔GPU “data copy” bottleneck out of the critical path; instead, locality, coherence traffic, and NUMA effects dominate performance analysis.

Physical‑Layer Fundamentals

Architectural Integration

Performance Characteristics

Metric Peak Measured (research/benchmarks) Notes
Bandwidth (H2D) 450 GB/s 375 GB/s Comm‑Scope micro‑benchmarks
Bandwidth (D2H) 450 GB/s 297 GB/s Same study
CPU→GPU Remote‐HBM Latency ~800 ns 800–1000 ns Memory‑disaggregation eval.
GPU Local HBM ~300 ns Baseline for comparison  

Effective payload bandwidth is ≈ 672 GB/s total—~75 % of the wire peak—once protocol and software overheads are included.

GH200 Case Study ‘24

It provides an analysis of the memory hierarchy and data movement performance of the Quad GH200 node within the Alps. As workloads in HPC and AI become increasingly limited by memory bottlenecks, optimizing intra-system communication latency and bandwidth has become a critical driver for new architectural developments.

Figure 5. The Quad GH200 node within the Alps supercomputer.
Component Description Key Specifications
Quad GH200 Node The basic unit of the system shown in Figure 1, composed of 4 GH200 superchips. 288 CPU cores, 4 GPUs
GH200 Superchip An integrated chip with a Grace CPU + Hopper GPU. Each GH200 has 96GB of HBM3 and 128GB of LPDDR5 memory.
NVLink-C2C Connects the CPU and GPU within a single GH200. 900 GB/s bidirectional bandwidth, cache coherent.
NVLink Connects the Hopper GPUs on different GH200s. 150 GB/s unidirectional bandwidth.
Grace Interconnect Connects the Grace CPUs on different GH200s. 150 GB/s bandwidth.
Slingshot Connects the node to the external supercomputer network. 25 GB/s unidirectional bandwidth (per GH200).
NUMA Architecture The entire node is a system with 8 NUMA domains. Grace CPU (NUMA 0,1,2,3), Hopper GPU (NUMA 4,12,20,28).

Core System Architecture

The fundamental building block of the system is the Quad GH200 node, which integrates four Grace Hopper Superchips into a single, cohesive NUMA system.

Component Description Key Features
Grace Hopper Superchip (GH200) The basic unit, tightly integrating a Grace CPU (Arm architecture) and a Hopper GPU. Forms a single, powerful processing unit.
NVLink-C2C Interconnect A high-speed Chip-to-Chip link connecting the Grace CPU and Hopper GPU. - Bandwidth: 900 GB/s bidirectional. - Cache Coherency: Allows CPU and GPU to share a unified address space and access all main memory. - Advantage: 7x higher bandwidth than traditional PCIe 5.0.
Quad GH200 Node The building block of the Alps supercomputer, consisting of four GH200 Superchips. - Connectivity: Fully interconnected via NVLink (for GPUs) and Grace Interconnect (GI, for CPUs). - Total Resources: 288 CPU cores and 4 GPUs per node. - System View: Functions as a single NUMA system.
Heterogeneous Memory Each GH200 features two types of memory: LPDDR5 associated with the Grace CPU and HBM3 with the Hopper GPU. - LPDDR5 (CPU): ~500 GB/s bandwidth. - HBM3 (GPU): >4 TB/s bandwidth. - Implication: The memory pool itself is heterogeneous, making intelligent data placement crucial for performance.

Key Performance Insights & Micro-benchmarking

The authors used a series of micro-benchmarks to explore the performance of different processing units (PUs) when performing read, write, and copy operations across various physical memory locations.

Read/Write Bandwidth

Scenario Observation Performance Implication
Hopper GPU accessing Local DDR Highly efficient, achieving 93% of theoretical bandwidth via the C2C interconnect. The C2C interconnect provides effective access for the GPU to CPU-attached memory.
Grace CPU accessing Local HBM Less efficient, achieving only 53%-64% of theoretical bandwidth. Accessing HBM from the CPU is possible but not as performant as the reverse.
Concurrent Access (Noise Test) When CPU and GPU access memory simultaneously, the C2C interconnect becomes a bottleneck. Performance is significantly impacted, especially for write operations to HBM.

Memory Copy Performance

Operation Path Observation Limiting Factor
DDR to DDR (within the same GH200) The data must traverse the interconnect twice, limiting the effective bandwidth to half of the interconnect’s total bandwidth. The C2C interconnect bandwidth is the bottleneck.
Grace CPU copying to a peer GH200 Asymmetry in performance was observed. The Grace CPU is more efficient at pushing data from its local memory to a peer GH200 than the reverse operation. The underlying data path and interconnect topology create performance asymmetries.

Access Latency & Inter-Processor Communication

Test / Scenario Finding Key Influencing Factor
Pointer Chase (Cross-C2C access) Accessing memory across the C2C interconnect (e.g., Hopper accessing DDR) introduces significant latency. The physical hop across the interconnect is a primary source of delay.
Cache Behavior Hopper’s L2 cache can cache data from HBM. In some cases, accessing cached HBM data on a peer GH200 can be faster than accessing local DDR. Caching plays a critical role in mitigating latency, sometimes in non-intuitive ways.
Ping-Pong Test (Atomic Flags) Communication latency is extremely sensitive to the physical memory location of the atomic flag. Latency is lowest when communication happens within a single, local GH200.

Inter-Node Communication

Benchmark Finding Recommendation
MPI To fully saturate the external network bandwidth of a Quad GH200 node, at least four processes must be launched. Bind each process to a separate network interface to maximize network throughput.
NCCL Data locality is paramount for collective operations like AllReduce. Performance is far superior when using memory on the same GH200 for intra-node communication. Placing data in the DDR memory of a peer GH200 during inter-node communication will severely degrade overall performance.

Application Performance Analysis

The study further applied these micro-benchmark insights to real-world, memory-intensive applications.

Application Scenario Performance Impact
GEMM All matrix data is stored in local HBM. The task is compute-bound, achieving high performance.
GEMM Even one input matrix is moved out of local HBM. Performance drops sharply; the task immediately becomes memory-bound.
Llama2 Inference Model weights are placed in local HBM memory. Inference time is significantly reduced.
Llama2 Inference Model weights are placed in DDR or peer memory. Inference time increases substantially, demonstrating the critical role of memory access speed.

Conclusion

The paper concludes that although the Quad GH200 node is a highly integrated, tightly coupled system, developers must treat it as a collection of four independent superchips interconnected by a high-speed network to achieve optimal performance. The NVLink-C2C technology successfully expands the available memory pool for each processing unit, opening up new possibilities for developing novel heterogeneous applications that mix CPU and GPU computing. However, a deep understanding of the data paths and a meticulous memory layout strategy are indispensable to fully harnessing its potential.