PC Sampling in CPU Systems: A Comprehensive Survey
Tracing the evolution of PC sampling from early profiling techniques in the 1980s to modern continuous profiling systems: hardware innovations, compiler optimizations, and large-scale deployments.
Program Counter (PC) sampling has evolved from a niche debugging aid in the 1980s to an indispensable technique for performance profiling in modern computing systems. This survey traces the development of PC sampling research across four decades, highlighting key innovations in hardware support, low-overhead profiling, and large-scale deployments that have made always-on profiling a reality in production systems.
Early Profiling Techniques (1980s–Early 90s)
Gprof (1982)
Graham, Kessler, McKusick. SIGPLAN 1982
One of the earliest profiling tools to use program counter sampling. Gprof combined sampling-based time profiling with instrumentation of function calls, producing a call-graph execution profile. This was a significant advance for its time, providing developers with insight into where programs spent execution time with low overhead. However, gprof’s coarse granularity (per procedure) and high overhead for fine-grained detail motivated research into more efficient sampling methods.
Quartz (1990)
Anderson & Lazowska. SIGMETRICS 1990
A performance tuning tool for parallel programs. While not focused exclusively on PC sampling, it highlighted the challenges of profiling parallel applications. Quartz primarily used instrumentation (counts of procedure executions) rather than pure PC sampling, incurring high overhead. Its introduction underscored the need for lower-overhead profiling techniques as systems grew in complexity.
Dynamic Instrumentation vs. Sampling (Early 90s)
Reiser & Skudlarek. SIGPLAN Notices 1994
During the early 1990s, various tools explored program instrumentation and dynamic binary rewriting for profiling. This work described profiling via machine-language rewriting, an approach that inserts code to record events. Such instrumentation-based profilers (including MTOOL and others) could collect detailed information but often imposed high runtime overhead. This period set the stage for statistical sampling as a lightweight alternative, since purely instrumentation-based systems were impractical for continuous or always-on profiling in production. PC sampling began to gain attention as a way to periodically interrupt execution and record the current PC, trading a small, random perturbation for broad coverage of program behavior.
Emergence of Low-Overhead Sampling (Mid-1990s)
Hardware Counter Profiling (1996)
Zagha et al. SC 1996
By the mid-90s, CPUs began to include hardware performance counters that could trigger interrupts for profiling. This work used MIPS R10000 hardware counters for performance analysis. This approach sampled events like cache misses and recorded the PC at overflow, allowing analysts to attribute performance costs to code addresses with much lower overhead than full instrumentation. It demonstrated the feasibility of sampling-based profiling on modern processors and inspired subsequent system-wide profilers.
Digital Continuous Profiling Infrastructure (DCPI, 1997)
Anderson et al. SOSP 1997
A landmark in PC sampling. DCPI is a continuous profiling system that periodically samples the PC across all running code (user and kernel) using hardware counter interrupts. The sampler runs with low overhead (~1–3% slowdown) yet collects fine-grained statistics, attributing stall cycles and events to individual program instructions. DCPI introduced randomization of sampling intervals to avoid bias and could profile entire production systems continuously. The paper “Continuous Profiling: Where Have All the Cycles Gone?” reported that DCPI’s statistical sampling could pinpoint performance bottlenecks (like cache misses or branch mispredictions) at the instruction level in large applications. This was a major advance, showing that always-on profiling via PC sampling was practical for complex systems.
Morph (OS Support for Profiling, 1997)
Zhang et al. SOSP 1997
Another SOSP 1997 paper focused on operating system support for automatic profiling and optimization. Morph implemented low-overhead sampling on commodity operating systems (targeting Windows NT and Unix) to monitor desktop applications continuously. Like DCPI, it relied on periodic PC sampling (using timer interrupts or counters) to gather profiles with minimal performance impact. Morph’s goal was to automate not just profiling but also dynamic optimizations based on the profiles. While Morph’s sampling was coarser (it used existing OS timer interrupts, which limited accuracy inside interrupt handlers), it demonstrated how an OS could integrate profiling into daily application runtime. This line of work reinforced that statistical PC sampling was becoming an accepted strategy in top operating systems research for achieving always-on performance monitoring.
Dean et al. MICRO 1997
Traditional hardware counters only counted events and on overflow would sample the PC, which often could not attribute events to the exact causing instruction on out-of-order CPUs. This work introduced ProfileMe, a new hardware support scheme for instruction-driven sampling on out-of-order processors. Instead of sampling on events, ProfileMe randomly samples individual instructions as they flow through the pipeline. For each sampled instruction, the hardware records detailed information: its PC, pipeline latencies per stage, whether it missed in caches or was mispredicted, etc. This produces a rich profile of pipeline events attributed accurately to instructions, overcoming the “skid” problem (where simple PC sampling might pinpoint the wrong instruction due to out-of-order execution).
ProfileMe could even do paired sampling, capturing two concurrently in-flight instructions to study interactions (measuring overlap and concurrency). The ProfileMe paper showed that sampling a few instructions with hardware support can yield fine-grained insight comparable to full simulation, with far less overhead. Although ProfileMe was a research prototype (not a commercial product at the time), its ideas influenced later hardware features. For instance, modern processors from AMD introduced “Instruction-Based Sampling (IBS)” around 2007, and Intel added Precise Event-Based Sampling (PEBS), both of which echo ProfileMe’s goal of capturing detailed per-instruction events via sampling.
Sampling-Based Profiling in HPC (2000s)
HPCToolkit and Call-Path Profiling (2005)
Froyd, Mellor-Crummey, Fowler. ICS 2005
As multicore and supercomputer systems grew, sampling techniques were adopted in HPC performance tools to handle large-scale code with minimal overhead. HPCToolkit, developed at Rice University, is an example of an HPC profiling suite that relies heavily on PC sampling. This work presented low-overhead call-path profiling of optimized binaries using sampling. The technique not only samples the program counter, but also unwinds the call stack at each sample to attribute costs along the full calling context. By sampling asynchronously, they avoid instrumenting every function, keeping overhead low even for complex parallel codes. This was the first demonstration of efficiently capturing full call graphs via sampling, even for fully optimized (inlined) code, something previous profilers struggled with.
Memory Profiling Using Hardware Counters (2003)
Itzkowitz, Wylie, Aoki, Kosche. SC 2003
Marty Itzkowitz and collaborators extended sampling beyond instruction costs to the data side of performance. Their SC’03 paper described how Sun’s performance tools used hardware counters to attribute memory hierarchy events directly to data structures and even individual elements. Applying the technique to the SPEC CPU2000 MCF benchmark, they produced breakdowns that showed which data objects incurred the most misses and bandwidth, guiding targeted optimizations. The paper also outlined how the collected memory profiles could feed compiler-directed prefetching and richer reports, demonstrating that PC sampling infrastructure could provide first-class visibility into data behavior, not just instruction hotspots.
Scaling to Petascale (2009)
Tallent et al. SC 2009
With petascale supercomputers, profiling needed to scale to thousands of cores. This work described how HPCToolkit uses sampling to pinpoint both node-level and scaling bottlenecks in emerging petascale applications. Their approach collects sample-based profiles across all processes of large MPI jobs, then correlates and analyzes them to identify inefficiencies down to source lines in full program context. They showed that statistical sampling could be made accurate and precise enough to guide optimizations even on massively parallel runs, something infeasible with instrumentation-based profilers due to perturbation.
HPC tools like Cray’s performance analysis toolkit and IBM’s Parallel Performance Toolkit also began to incorporate sample-based techniques in this era. Additionally, the PAPI project (2000) provided a standard API for accessing hardware counters, enabling HPC researchers to set up timer or event-driven PC sampling uniformly across platforms. By the end of the 2000s, sampling had become the de facto method for performance analysis in HPC.
Profiling in Managed Runtimes and Compilers (2000s–2010s)
Java and Managed Languages
Arnold & Grove. CGO 2005
Academic work in the 2000s also looked at applying PC sampling to managed runtime environments (JVMs, .NET). Researchers developed sampling-based profilers for Java Virtual Machines that avoid bytecode instrumentation by using periodic native sampling. This work introduced a low-overhead sampling strategy to rapidly identify hot call edges in Java programs, converging quickly on an accurate dynamic call graph. Implemented in Jikes RVM and J9, their approach showed that sampling can find hot methods and call relationships nearly as precisely as exhaustive instrumentation, but with far less overhead.
Compiler Optimizations with Sampled Profiles
Chen et al. CGO 2010
In traditional Feedback-Directed Optimization (FDO), profiles were gathered by instrumenting and running training workloads. Around 2010, researchers began to replace instrumentation with sampled hardware profiles to guide compilers, making FDO more practical in large-scale deployments. This work addressed challenges in using sample-based profiles for optimization with the paper “Taming Hardware Event Samples for FDO Compilation”. They noted issues like skid and sample bias when attributing costs to instructions, and proposed techniques to improve profile accuracy (e.g. using Intel’s PEBS to get precise instruction addresses). By correcting for sampling error, their method achieved profiles that led to near-equal performance gains as traditional instrumentation-based profiles.
Chen, Li, Moseley. CGO 2016 (AutoFDO)
Building on prior work, Google engineers developed AutoFDO, presented as “Automatic Feedback-Directed Optimization for Warehouse-Scale Applications”. AutoFDO collects profiles from production machines by sampling hardware performance monitors (PCs associated with cache-misses, cycles, etc.), then feeds these profiles into the compiler for the next build. The paper showed that sampling could gather representative profiles with negligible overhead in live systems, vastly increasing the adoption of FDO across Google’s large codebases. The sampled profiles (augmented with address-to-source mapping) achieved ~85% of the performance gains of classical FDO while being much easier to deploy at scale. This success exemplified how far PC sampling had come: from a debugging aid to an integral part of production compiler optimization pipelines.
Continuous Profiling at Scale (2010s)
Google-Wide Profiling (2010)
Ren et al. IEEE Micro 2010
Google’s internal continuous profiling system, GWP, was reported in this article. GWP extended the DCPI philosophy to Google’s data centers, continuously sampling across thousands of machines. It collected stack traces and events from live services with minimal overhead, uncovering performance issues at scale. GWP’s design drew heavily on prior research (it cites DCPI as its inspiration) and in turn influenced academia by demonstrating the value of always-on profiling in production.
Following this, other large-scale systems (e.g., Uber’s and Facebook’s profiling infrastructures) and research prototypes also embraced always-on PC sampling in distributed environments. This trend was occasionally reported in systems conferences, showing that techniques pioneered in the 80s and 90s have become fundamental for performance analysis in the 21st century.
Recent Developments
Research continues to refine CPU sampling. Recent papers explore memory-access sampling (to profile data locality) and context-sensitive sampling (e.g., differentiating samples by execution context or input). Recent analyses examine the precision and overheads of Intel’s PEBS and AMD’s IBS for memory profiling (Yi et al. 2020; Sasongko 2022), and SOSP’23 Memtis demonstrates PEBS-based memory-access sampling at scale.
Meanwhile, the rise of GPUs led to PC sampling being introduced in GPU drivers (NVIDIA’s CUPTI and AMD’s ROCm now offer PC sampling for kernels). This mirrors the CPU history: initial GPU profilers relied on instrumentation or simulation, and only more recently has statistical sampling been deployed to low-overhead profile GPU instruction stalls and bottlenecks.
In summary, decades of research – from gprof to present – have established PC sampling as an indispensable technique for performance profiling, tracing a line through top conferences in systems, architecture, PL, HPC, and compilers. Each generation of work built on the last, continually reducing overhead and improving fidelity, to make profiling an everyday, everywhere capability in modern computing systems.