A detailed analysis of the Scalana codebase, its workflow, components, and data formats.
A (not-so) deep dive into the Scalana tool, its workflow, core components, and data formats, based on the SC20 paper and its accompanying source code.
SCALANA is a performance analysis tool designed to automatically detect scalability bottlenecks in large-scale parallel programss. Traditional performance tools often force a choice between low-overhead profiling, which lacks the detail for root-cause analysis, and high-detail tracing, which can incur prohibitive overhead.
SCALANA’s core innovation is a hybrid approach that combines static analysis at compile-time with lightweight, sampling-based profiling at runtime. This allows it to construct a detailed Program Performance Graph (PPG) that captures both the program’s structure and its performance characteristics, enabling deep analysis at a fraction of the cost of full tracing. This post breaks down the tool’s workflow, key concepts, and data formats to provide a comprehensive understanding of how it works under the hood.
To compile and run the SCALANA artifact, the following dependencies are required, with version numbers based on the paper’s description.
g++
: The C++ compiler.make
: The build automation tool used to execute instructions in the Makefile
.binutils
: Requires the addr2line
utility for post-mortem symbol resolution.llvm-config
: Used by the Makefile
to get the correct compiler and linker flags for building the LLVM pass.opt
: The LLVM optimizer, used to load and run the analysis pass (irs.so
).llc
: The LLVM static compiler, used to compile LLVM bitcode into object files.mpirun
: The launcher for parallel programs.mpicxx
/ mpif90
: MPI compiler wrappers used for linking the final instrumented executable.run.py
).python/
directory. Requires standard data science libraries like pandas
, numpy
, and matplotlib
.Scalana-viewer
GUI for visual analysis.The tool’s end-to-end process can be understood as a three-act play, moving from static code analysis to dynamic data collection and finally to post-mortem fusion and analysis.
This phase corresponds to Section III-A of the paper and involves analyzing the code’s structure and instrumenting it.
opt_cmd
command in run.py
.IRStruct.cpp
(compiled into irs.so
), which runs as an LLVM Pass.n{...}.bc
: The application’s original LLVM bitcode.irs.so
: The compiled LLVM Pass.in.txt
: A configuration file listing MPI functions and their assigned IDs.opt
tool loads the irs.so
pass.entryPoint
and exitPoint
(from sampler.cpp
) at the boundaries of functions, loops, and other key nodes in the PSG.out.txt
: The serialized PSG, which serves as a static “map” of the program.i{...}.bc
: The new, instrumented bitcode containing the added tracing calls.i{...}
: The final instrumented executable, ready for the runtime phase.This phase corresponds to Section III-B of the paper, where the instrumented program is run to collect performance data.
sub_cmd
command in run.py
.sampler.cpp
(compiled into libsampler.so
and loaded by the program at runtime).i{...}
: The instrumented executable from Act I.libsampler.so
: The runtime sampling and tracing library.entryPoint
or exitPoint
from the runtime library.call_log
).papi_handler
is invoked, which performs a stack backtrace and records the instruction addresses in another buffer (address_log
).LOG{rank}.txt
: Contains the precise, ordered execution trace for each process.SAMPLE{rank}.txt
: Contains the raw sampled call stack addresses for each process.This phase corresponds to Sections III-C and IV of the paper and brings all the data together for the final analysis.
parse_cmd
in run.py
, which executes parse.sh
.addr2line
utility.SAMPLE{rank}.txt
: Raw address log from Act II.i{...}
: The executable with debug symbols.SAMPLE.txt
into addr2line
to resolve them to filename:line_number
format.SAMPLE{rank}.txt-symb
: The symbolicated sample log.anly_cmd
in run.py
, which executes the analyze
program.log2stat.cpp
(compiled into the analyze
executable).out.txt
: The static PSG from Act I.LOG{rank}.txt
: The execution trace from Act II.SAMPLE{rank}.txt-symb
: The symbolicated log from Step 3a.out.txt
to reconstruct the static PSG in memory.LOG.txt
to resolve dynamic behaviors like indirect calls and refine the in-memory graph.SAMPLE.txt-symb
files. For each sample, it traverses the in-memory graph to find the node corresponding to the sample’s location and increments its performance counters.stat{rank}.txt
: The final Program Performance Graph (PPG), one for each process.python
scripts or the java
GUI.stat.txt
files, typically from multiple runs with varying process counts.scalebottleneck_all_node_fit.ipynb
loads the PPGs, performs regression analysis on the performance data of each node across different scales, and identifies nodes with poor scaling behavior.Scalana-viewer
loads the PPGs and provides an interactive interface to explore the call paths of detected root causes and view the corresponding source code.The “magic” behind Scalana’s static analysis is the LLVM Pass implemented in IRStruct.cpp
. After a compiler front-end (like Clang) translates C++ or Fortran into the language-agnostic LLVM Intermediate Representation (IR), this pass operates directly on the IR. It performs two critical tasks:
out.txt
.entryPoint
) at the entry and exit points of these structures.This powerful mechanism allows Scalana to access the program’s structure directly from the compiler’s perspective and reliably inject the necessary hooks for runtime tracing.
The entire workflow masterfully decouples static and dynamic analysis. The process can be understood with an analogy:
ID
to every “building” (function, loop, etc.) and records this information on a master map (out.txt
).analyze
program takes the GPS coordinates (SAMPLE.txt-symb
), looks up the corresponding building on the master map (out.txt
), and makes a mark on it.stat.txt
).A deep analysis reveals a significant difference between the paper’s design and the provided codebase regarding communication analysis.
Design in the Paper (Section III-B2): The paper explicitly states the use of the PMPI (Profiling MPI) interface to intercept MPI calls at runtime. This is crucial for capturing detailed communication dependencies, such as the source/destination rank and message tags, which are needed to draw the inter-process dependency edges in the PPG.
Implementation in the Codebase: The code in sampler.cpp
and IRStruct.cpp
does not implement PMPI. Instead, it treats MPI calls like any other function:
MPI_
function names from in.txt
and creates nodes for them in the PSG.entryPoint
/exitPoint
hooks.Conclusion: The current codebase can identify where MPI calls occur and measure their duration, but it lacks the mechanism to resolve inter-process dependencies (i.e., who talks to whom). The generated PPG contains detailed intra-process control flow and performance data but does not contain the explicit communication edges between processes described in the paper.
Understanding the intermediate files is key to understanding the workflow.
Intro: out.txt
is the Program Structure Graph (PSG). It is a hierarchical representation of the program’s structure, generated at compile-time. It contains no runtime data. It is a forest of trees, where each tree represents a function.
Format: The file begins with the number of trees, followed by a depth-first serialization of each node. It ends with lookup tables for directory and file names.
Node Line Format: <ID> <Type> <1st> <Last> <Child_Count> <Dir_ID> <File_ID> <Start_Line> <End_Line>
ID
: A globally unique, zero-based identifier for the node.Type
: A numeric code for the node’s type, corresponding to the NodeType
enum in IRStruct.h
.
-1
=LOOP, -4
=FUNCTION, -5
=CALL).in.txt
.1st
& Last
: Relate to graph contraction. For a normal node, these are identical to Type
. For a merged COMBINE
(-8) node, they indicate the types of the original first and last nodes in the sequence.Child_Count
: The number of direct children this node has in the tree.Dir_ID
& File_ID
: Zero-based indices into the directory and file lookup tables at the end of the file.Start_Line
& End_Line
: The source code line range for the node.Example Visualized:
# out.txt content
1
0 -4 -4 -4 1 0 0 10 50
1 -1 -1 -1 1 0 0 15 40
2 42 42 42 0 0 0 20 20
...
This translates to the following structure:
- FUNCTION (ID 0, main.c:10-50)
|
+--- LOOP (ID 1, main.c:15-40)
|
+--- MPI_CALL (ID 2, main.c:20)
Intro: LOG{rank}.txt
is the Execution Trace. It records the precise, ordered sequence of instrumented code blocks that were entered and exited during a single run.
Format: Each line is an event.
Event Line Format: <Event_Type> <Node_ID> <Node_Type>
Event_Type
: A character flag: 'Y'
for Entry, 'T'
for Exit.Node_ID
: The bridge to the static graph, corresponding to the ID
in out.txt
.Node_Type
: The type of the node, for context.Example:
Y 0 -4 # Enter main (ID 0)
Y 1 -1 # Enter loop (ID 1)
Y 2 42 # Enter MPI call (ID 2)
T 2 42 # Exit MPI call
T 1 -1 # Exit loop
T 0 -4 # Exit main
SAMPLE.txt
Format: A list of hexadecimal instruction addresses representing the call stack at the time of a PAPI interrupt. Samples are separated by blank lines.SAMPLE.txt-symb
Format: The same structure, but with addresses translated to filename:line_number
by addr2line
.Intro: stat{rank}.txt
is the Program Performance Graph (PPG). It is a hierarchical representation of the program’s structure, generated at runtime. It contains runtime data. It is a forest of trees, where each tree represents a function.
Format: Nearly identical to out.txt
, but with additional columns for runtime performance metrics.
Node Line Format: <ID> <Type> <Child_Count> ... <Sample_Count> <Time_Share>
Example Visualized:
Imagine the program structure from the out.txt
example, but now annotated with performance data. This makes bottlenecks instantly visible.
- FUNCTION (ID 0, main.c:10-50) [Samples: 0, Time: 0.0%]
|
+--- LOOP (ID 1, main.c:15-40) [Samples: 5, Time: 5.0%]
|
+--- MPI_CALL (ID 2, main.c:20) [Samples: 95, Time: 95.0%] <-- Performance Hotspot
The table below summarizes the entire pipeline, connecting each phase to its core components and data artifacts.
Paper Phase | Workflow Step | Key File/Code | Input (Source) | Output (Destination) |
---|---|---|---|---|
PSG Construction (III-A) | Static Analysis (Compile-time) |
IRStruct.cpp (irs.so ) |
*.bc (Source), in.txt (Disk) |
out.txt (Disk), i{...} (Disk) |
Runtime Data (III-B) | Dynamic Analysis (Runtime) |
sampler.cpp (libsampler.so ) |
i{...} (from Phase 1) |
LOG.txt & SAMPLE.txt (Disk) |
PPG Construction (III-C) | Post-mortem Processing |
parse.sh , log2stat.cpp
|
SAMPLE.txt , LOG.txt (from Phase 2), out.txt (from Phase 1) |
stat.txt (Disk) |
Scalability Analysis (IV) | Bottleneck Detection & Backtracing | Python Scripts, Java Viewer |
stat.txt (from Phase 3) |
Visualizations, Reports (Screen/Disk) |