Scalana: A (Not-So) Deep Dive into its Codebase

A detailed analysis of the Scalana codebase, its workflow, components, and data formats.

A (not-so) deep dive into the Scalana tool, its workflow, core components, and data formats, based on the SC20 paper and its accompanying source code.

Paper Code

Introduction

SCALANA is a performance analysis tool designed to automatically detect scalability bottlenecks in large-scale parallel programss. Traditional performance tools often force a choice between low-overhead profiling, which lacks the detail for root-cause analysis, and high-detail tracing, which can incur prohibitive overhead.

SCALANA’s core innovation is a hybrid approach that combines static analysis at compile-time with lightweight, sampling-based profiling at runtime. This allows it to construct a detailed Program Performance Graph (PPG) that captures both the program’s structure and its performance characteristics, enabling deep analysis at a fraction of the cost of full tracing. This post breaks down the tool’s workflow, key concepts, and data formats to provide a comprehensive understanding of how it works under the hood.

Dependencies

To compile and run the SCALANA artifact, the following dependencies are required, with version numbers based on the paper’s description.

1. C++ Toolchain

g++: The C++ compiler.
make: The build automation tool used to execute instructions in the Makefile.
binutils: Requires the addr2line utility for post-mortem symbol resolution.

2. LLVM

LLVM (v3.3): A complete LLVM development environment is essential.
- llvm-config: Used by the Makefile to get the correct compiler and linker flags for building the LLVM pass.
- opt: The LLVM optimizer, used to load and run the analysis pass (irs.so).
- llc: The LLVM static compiler, used to compile LLVM bitcode into object files.

3. Parallel and Performance Libraries

MPI (Message Passing Interface): An MPI implementation like OpenMPI (v3.0.0) or MPICH (v3.2.1).
- mpirun: The launcher for parallel programs.
- mpicxx / mpif90: MPI compiler wrappers used for linking the final instrumented executable.
PAPI (v5.2.0): The Performance API, used to access hardware performance counters for sampling.
libunwind (v1.3.1): Used for stack unwinding (backtracing) within the PAPI sampling handler.

4. Analysis and Visualization

Python: Required to run the main orchestration script (run.py).
Jupyter Notebook/Lab: Used to run the analysis scripts in the python/ directory. Requires standard data science libraries like pandas, numpy, and matplotlib.
Java (JDK/JRE): Required to compile and run the Scalana-viewer GUI for visual analysis.

I. Workflow

The tool’s end-to-end process can be understood as a three-act play, moving from static code analysis to dynamic data collection and finally to post-mortem fusion and analysis.

Act I: Static Analysis (Compile-Time)

This phase corresponds to Section III-A of the paper and involves analyzing the code’s structure and instrumenting it.

Executor: The opt_cmd command in run.py.
Core Logic: IRStruct.cpp (compiled into irs.so), which runs as an LLVM Pass.
Inputs:
- n{...}.bc: The application’s original LLVM bitcode.
- irs.so: The compiled LLVM Pass.
- in.txt: A configuration file listing MPI functions and their assigned IDs.
Process:
1. The opt tool loads the irs.so pass.
2. The pass traverses the program’s bitcode, analyzing its control flow and call graph to build an in-memory Program Structure Graph (PSG).
3. The pass then instruments the bitcode, inserting calls to entryPoint and exitPoint (from sampler.cpp) at the boundaries of functions, loops, and other key nodes in the PSG.
Outputs:
- out.txt: The serialized PSG, which serves as a static “map” of the program.
- i{...}.bc: The new, instrumented bitcode containing the added tracing calls.
- i{...}: The final instrumented executable, ready for the runtime phase.

Act II: Data Collection (Runtime)

This phase corresponds to Section III-B of the paper, where the instrumented program is run to collect performance data.

Executor: The sub_cmd command in run.py.
Core Logic: sampler.cpp (compiled into libsampler.so and loaded by the program at runtime).
Inputs:
- i{...}: The instrumented executable from Act I.
- libsampler.so: The runtime sampling and tracing library.
Process:
1. The program runs. When an instrumented code block is entered or exited, it calls entryPoint or exitPoint from the runtime library.
2. These functions record the event (node ID, event type) into an in-memory buffer (call_log).
3. Concurrently, PAPI is configured for periodic sampling. On each interrupt, the papi_handler is invoked, which performs a stack backtrace and records the instruction addresses in another buffer (address_log).
4. When buffers are full or the program terminates, their contents are flushed to disk.
Outputs:
- LOG{rank}.txt: Contains the precise, ordered execution trace for each process.
- SAMPLE{rank}.txt: Contains the raw sampled call stack addresses for each process.

Act III: Graph Fusion and Analysis (Post-Mortem)

This phase corresponds to Sections III-C and IV of the paper and brings all the data together for the final analysis.

Step 3a: Symbol Resolution

Executor: The parse_cmd in run.py, which executes parse.sh.
Core Logic: addr2line utility.
Inputs:
- SAMPLE{rank}.txt: Raw address log from Act II.
- i{...}: The executable with debug symbols.
Process: The script pipes the addresses from SAMPLE.txt into addr2line to resolve them to filename:line_number format.
Output:
- SAMPLE{rank}.txt-symb: The symbolicated sample log.

Step 3b: PPG Construction

Executor: The anly_cmd in run.py, which executes the analyze program.
Core Logic: log2stat.cpp (compiled into the analyze executable).
Inputs:
- out.txt: The static PSG from Act I.
- LOG{rank}.txt: The execution trace from Act II.
- SAMPLE{rank}.txt-symb: The symbolicated log from Step 3a.
Process:
1. Reads out.txt to reconstruct the static PSG in memory.
2. Reads LOG.txt to resolve dynamic behaviors like indirect calls and refine the in-memory graph.
3. Iterates through the SAMPLE.txt-symb files. For each sample, it traverses the in-memory graph to find the node corresponding to the sample’s location and increments its performance counters.
4. After processing all samples, it writes the final, performance-annotated graph to disk.
Output:
- stat{rank}.txt: The final Program Performance Graph (PPG), one for each process.

Step 3c: Scalability Analysis

Executor: python scripts or the java GUI.
Inputs: A collection of stat.txt files, typically from multiple runs with varying process counts.
Process:
- The Python script scalebottleneck_all_node_fit.ipynb loads the PPGs, performs regression analysis on the performance data of each node across different scales, and identifies nodes with poor scaling behavior.
- The Java GUI Scalana-viewer loads the PPGs and provides an interactive interface to explore the call paths of detected root causes and view the corresponding source code.
Outputs:
- Analysis reports, plots, and GUI visualizations that pinpoint scalability bottlenecks.

II. Key Concepts Explained

The LLVM Pass: Scalana Static Engine

The “magic” behind Scalana’s static analysis is the LLVM Pass implemented in IRStruct.cpp. After a compiler front-end (like Clang) translates C++ or Fortran into the language-agnostic LLVM Intermediate Representation (IR), this pass operates directly on the IR. It performs two critical tasks:

Analysis and Graphing: It traverses the program’s IR, identifying key structures like functions, loops, and calls. It uses this to build the complete Program Structure Graph (PSG) in memory before serializing it to out.txt.
Transformation and Instrumentation: As it builds the graph, it simultaneously modifies the IR, precisely inserting calls to its runtime library functions (e.g., entryPoint) at the entry and exit points of these structures.

This powerful mechanism allows Scalana to access the program’s structure directly from the compiler’s perspective and reliably inject the necessary hooks for runtime tracing.

The PSG-to-PPG Pipeline: Map and Annotate

The entire workflow masterfully decouples static and dynamic analysis. The process can be understood with an analogy:

Assigning “ID Cards” (PSG Creation): During static analysis, the LLVM pass acts like a census taker, walking through the “city” of the source code. It assigns a unique ID to every “building” (function, loop, etc.) and records this information on a master map (out.txt).
Locating “Coordinates” (Sampling): At runtime, PAPI acts like a GPS tracker, periodically reporting the raw coordinates (instruction addresses) where the program is spending its time.
Looking up the “Map” (PPG Fusion): The analyze program takes the GPS coordinates (SAMPLE.txt-symb), looks up the corresponding building on the master map (out.txt), and makes a mark on it.
Creating the “Heatmap” (Final PPG): After all GPS reports are processed, the map is now annotated with marks indicating activity levels. This annotated map, showing which buildings were busiest, is the final Program Performance Graph (PPG) (stat.txt).

The PMPI Discrepancy

A deep analysis reveals a significant difference between the paper’s design and the provided codebase regarding communication analysis.

Design in the Paper (Section III-B2): The paper explicitly states the use of the PMPI (Profiling MPI) interface to intercept MPI calls at runtime. This is crucial for capturing detailed communication dependencies, such as the source/destination rank and message tags, which are needed to draw the inter-process dependency edges in the PPG.
Implementation in the Codebase: The code in sampler.cpp and IRStruct.cpp does not implement PMPI. Instead, it treats MPI calls like any other function:
1. The LLVM Pass identifies MPI_ function names from in.txt and creates nodes for them in the PSG.
2. It instruments them with generic entryPoint/exitPoint hooks.
3. PAPI samples attribute time spent within these functions to the corresponding nodes.
Conclusion: The current codebase can identify where MPI calls occur and measure their duration, but it lacks the mechanism to resolve inter-process dependencies (i.e., who talks to whom). The generated PPG contains detailed intra-process control flow and performance data but does not contain the explicit communication edges between processes described in the paper.

III. A (Not-So) Deep Dive into File Formats

Understanding the intermediate files is key to understanding the workflow.

1. PSG

Intro: out.txt is the Program Structure Graph (PSG). It is a hierarchical representation of the program’s structure, generated at compile-time. It contains no runtime data. It is a forest of trees, where each tree represents a function.
Format: The file begins with the number of trees, followed by a depth-first serialization of each node. It ends with lookup tables for directory and file names.

Node Line Format: <ID> <Type> <1st> <Last> <Child_Count> <Dir_ID> <File_ID> <Start_Line> <End_Line>
- ID: A globally unique, zero-based identifier for the node.
- Type: A numeric code for the node’s type, corresponding to the NodeType enum in IRStruct.h.
  - Negative: Structural nodes (e.g., -1=LOOP, -4=FUNCTION, -5=CALL).
  - Positive: Instrumented library calls, primarily MPI functions defined in in.txt.
- 1st & Last: Relate to graph contraction. For a normal node, these are identical to Type. For a merged COMBINE (-8) node, they indicate the types of the original first and last nodes in the sequence.
- Child_Count: The number of direct children this node has in the tree.
- Dir_ID & File_ID: Zero-based indices into the directory and file lookup tables at the end of the file.
- Start_Line & End_Line: The source code line range for the node.

Example Visualized:

# out.txt content
1
0 -4 -4 -4 1 0 0 10 50
1 -1 -1 -1 1 0 0 15 40
2 42 42 42 0 0 0 20 20
...

This translates to the following structure:

- FUNCTION (ID 0, main.c:10-50)
    |
    +--- LOOP (ID 1, main.c:15-40)
         |
         +--- MPI_CALL (ID 2, main.c:20)

2. Execution Trace

Intro: LOG{rank}.txt is the Execution Trace. It records the precise, ordered sequence of instrumented code blocks that were entered and exited during a single run.
Format: Each line is an event.

Event Line Format: <Event_Type> <Node_ID> <Node_Type>
- Event_Type: A character flag: 'Y' for Entry, 'T' for Exit.
- Node_ID: The bridge to the static graph, corresponding to the ID in out.txt.
- Node_Type: The type of the node, for context.

Example:

Y 0 -4   # Enter main (ID 0)
Y 1 -1   # Enter loop (ID 1)
Y 2 42   # Enter MPI call (ID 2)
T 2 42   # Exit MPI call
T 1 -1   # Exit loop
T 0 -4   # Exit main

3. Symbolicated Sampling Data

SAMPLE.txt Format: A list of hexadecimal instruction addresses representing the call stack at the time of a PAPI interrupt. Samples are separated by blank lines.
SAMPLE.txt-symb Format: The same structure, but with addresses translated to filename:line_number by addr2line.

4. PPG

Intro: stat{rank}.txt is the Program Performance Graph (PPG). It is a hierarchical representation of the program’s structure, generated at runtime. It contains runtime data. It is a forest of trees, where each tree represents a function.
Format: Nearly identical to out.txt, but with additional columns for runtime performance metrics.

Node Line Format: <ID> <Type> <Child_Count> ... <Sample_Count> <Time_Share>

Example Visualized: Imagine the program structure from the out.txt example, but now annotated with performance data. This makes bottlenecks instantly visible.

- FUNCTION (ID 0, main.c:10-50)      [Samples:   0, Time:  0.0%]
    |
    +--- LOOP (ID 1, main.c:15-40)        [Samples:   5, Time:  5.0%]
         |
         +--- MPI_CALL (ID 2, main.c:20) [Samples:  95, Time: 95.0%]  <-- Performance Hotspot

IV. Summary

The table below summarizes the entire pipeline, connecting each phase to its core components and data artifacts.

Paper Phase	Workflow Step	Key File/Code	Input (Source)	Output (Destination)
PSG Construction (III-A)	Static Analysis (Compile-time)	`IRStruct.cpp` (`irs.so`)	`*.bc` (Source), `in.txt` (Disk)	`out.txt` (Disk), `i{...}` (Disk)
Runtime Data (III-B)	Dynamic Analysis (Runtime)	`sampler.cpp` (`libsampler.so`)	`i{...}` (from Phase 1)	`LOG.txt` & `SAMPLE.txt` (Disk)
PPG Construction (III-C)	Post-mortem Processing	`parse.sh`, `log2stat.cpp`	`SAMPLE.txt`, `LOG.txt` (from Phase 2), `out.txt` (from Phase 1)	`stat.txt` (Disk)
Scalability Analysis (IV)	Bottleneck Detection & Backtracing	Python Scripts, Java Viewer	`stat.txt` (from Phase 3)	Visualizations, Reports (Screen/Disk)