SC25 Watchlist

Personal watchlist of SC25 sessions and workshops, grouped by their official tracks with quick notes on why each paper matters.

Workshop

Official session: SC25 Workshops · Mon, Nov 17 · various times

Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia : Presents a SYCL-based multi-GPU BFS implementation with work-group localization, even workload distribution, and strided heuristics; a compelling example of portable performance across MI300X, Max 1550, and A100 nodes. [WACCPD · Mon, Nov 17 · 9:00 AM–5:30 PM CST · Room 266]
Roofline Analysis of Tightly-Coupled CPU-GPU Superchips: A Study on MI300A and GH200 : Extends Roofline modeling to capture contention in concurrent CPU+GPU execution on MI300A and Grace Hopper, highlighting how different allocators and execution modes interact under unified memory and shared power budgets. [P3HPC · Mon, Nov 17 · 9:00 AM–5:30 PM CST · Room 231]
PEAK: Cost-Adaptive Profiling in a Heartbeat : Introduces a DBI-based profiler that caps instrumentation overhead either statically or through a heartbeat feedback loop, so we get predictable profiling costs even on long, dynamic HPC jobs. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]
Extending THAPI with CXI Hardware Counter Sampling for High-Resolution NIC Telemetry : Adds Cassini CXI counter sampling so Perfetto timelines can correlate HPC traces with congestion/retry telemetry, giving actionable network diagnostics at negligible overhead. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]
Scalable, High-Fidelity Monitoring of Application Communication Patterns in Vernier : Introduces histogram-based instrumentation in the Vernier system to capture annotated communication patterns at scale, yielding specific networking insights without overwhelming trace volume. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]

Auto-Tuning

Official session: Auto-Tuning, Compilation, and Code Generation · Tue, Nov 18 · 10:30 AM–12:00 PM CST

PerfDojo : Frames kernel optimization as an RL “game” over a human-readable IR so LLM+RL agents can auto-generate ML library kernels that stay portable across CPU, GPU, and accelerator targets.
Constraint-Driven Auto-Tuning of GEMM-like Operators for MT-3000 : DynaChain splits MT-3000 kernels into compute+data sub-problems, prunes the search via dependency constraints, and uses ILP-backed tiling plus custom micro-kernels to match expert-tuned code.

Benchmarks

Official session: Performance: Benchmarks and Optimization · Tue, Nov 18 · 10:30 AM–12:00 PM CST

Zero-Value Code Specialization : ZeroSpec pairs a static control/data-flow analysis with runtime profiling to detect zero-propagation hot spots, then emits specialized fast paths that delivered up to 1.31× speedups on SPEC/NPB codes.

Analysis Tools

Official session: Performance: Analysis Tools · Tue, Nov 18 · 1:30–3:00 PM CST

C.A.T.S. : Introduces a whole-program tracing format that records control-flow stacks plus memory events, feeding interactive visualizations that already exposed layout fixes and 3× footprint cuts in the case studies.
RedSan : Uses binary instrumentation to pinpoint redundant memory instructions in fully optimized CUDA kernels, netting up to 6.27× speedups and 3× fewer memory ops.
TraceFlow : Redistributes trace events by interaction pattern so analysis runs nearly communication-free and 13.5× faster than current replay pipelines.

State of Practice

Official session: State of the Practice · Tue, Nov 18 · 1:30–3:00 PM CST

ChatHPC : Builds specialized Code Llama–based assistants for each layer of the HPC stack, using RAG + expert-supervised fine-tuning to hit ~90% higher trustworthiness than GPT-4o while running on just a pair of H100s.

Cloud Utilization

Official session: System Software and Cloud Computing: Resource Utilization · Tue, Nov 18 · 1:30–3:00 PM CST

HELM : Derives interpretable locality metrics straight from Unified Memory driver telemetry so we can pick migration/placement policies under oversubscription, yielding 3.5× better performance than the UM defaults.

Sparse Computation

Official session: Performance: Sparse Matrix and Tensor Computation · Tue, Nov 18 · 3:30–5:00 PM CST

FaSTCC : Provides a hashing-based sparse tensor contraction engine that models every loop permutation, then applies a 2D tiled contraction-index scheme with probabilistic tile selection to beat prior CPU implementations.
Bridging the Gap between Unstructured SpMM and Structured Sparse Tensor Cores : MP-SpMM reformulates arbitrary sparsity via maximum matching + padding so 2:4 SpTCs stay fed, complete with a custom storage format and GPU kernel.

Precision & Reals

Official session: Precision and Real Number Representations · Tue, Nov 18 · 3:30–5:00 PM CST

RAPTOR : Offers an LLVM-based numerical profiler that auto-rewrites kernels into alternate precisions (down to FP16/custom formats) and reports stability impacts, as demonstrated on Flash-X multiphysics apps.

Energy & Power

Official session: Energy, Power, and Sustainability · Wed, Nov 19 · 10:30 AM–12:00 PM CST

Characterizing Performance, Power, and Energy of AMD CDNA3 GPU Family : Presents a third-party MI300X/MI325X study that measures compute scaling, memory latency/bandwidth, and energy tradeoffs, showing when power capping the MI325X actually wins on efficiency.

ML Methods

Official session: Machine Learning: Methods · Wed, Nov 19 · 10:30 AM–12:00 PM CST

TurboFNO : Debuts a fully fused FFT-GEMM-iFFT kernel that bakes in truncation/zero-padding and shared-memory swizzling so FNO layers avoid extra launches and run up to 150% faster than PyTorch+cuBLAS/cuFFT.

Data & Storage

Official session: Data Analytics, Visualization & Storage · Wed, Nov 19 · 1:30–3:00 PM CST

STELLAR : Acts as an autonomous multi-agent LLM tuner that scrapes manuals, reasons over I/O traces, executes experiments, and converges to near-optimal PFS configs within five trials on unseen workloads.
gParaKV : Implements a GPU-parallel key-value design that accelerates LSM compaction and garbage collection via bitmap-based marking and merge-sort offload, beating RocksDB-family KV stores on write-heavy loads.

ML at Scale 1

Official session: Machine Learning: Training at Scale 1 · Wed, Nov 19 · 1:30–3:00 PM CST

HPC-R1 : Delivers a full-stack characterization of training large reasoning models on Perlmutter, covering SFT, GRPO, generation, and distillation with 19 concrete observations that feed back into HPC-AI system design.

Resilience

Official session: Anomaly Detection, Failure Management, and Resilience 2 · Wed, Nov 19 · 3:30–5:00 PM CST

Story of Two GPUs : Analyzes 2.5 years of Delta 1 logs (11.7M GPU-hours) to show H100s hit 3.2× more memory errors yet stronger core resilience than A100s, and that operators still need ~5% node overprovisioning for failures.

ML at Scale 2

Official session: Machine Learning: Training at Scale 2 · Wed, Nov 19 · 3:30–5:00 PM CST

MLP-Offload : Introduces multi-level, multi-path offloading of optimizer states that exploits idle tiers and controls concurrency, cutting I/O stalls so 280B-parameter pre-training runs up to 2.5× faster than today’s runtimes.

GEMM Optimization

Official session: Algorithms: Matrix Multiplication and GEMM Optimization · Thu, Nov 20 · 10:30 AM–12:00 PM CST

KAMI : Extends communication-avoiding 1D/2D/3D GEMM theory to a single GPU, keeping tensor cores busy on batched, low-rank, and sparse multiplies by staging data through registers and shared memory.
HyTiS : Combines two-level tile scheduling, latency-aware partial waves, and layout-aware L2 modeling to mitigate wave quantization on big GPUs, hitting up to 2× cuBLAS speedups on H100/A100.
LiquidGEMM : Introduces LiquidQuant and an implicit warp pipeline so W4A8 GEMMs dequantize with two instructions per four values and fully overlap load/dequant/MMA, boosting LLM serving throughput up to 4.9×.

Scheduling & Tiling

Official session: Scheduling, Tiling, and Parallelism · Thu, Nov 20 · 10:30 AM–12:00 PM CST

SIREN : Deploys process-level data taps plus fuzzy hashing to reliably identify repeated or unknown software on shared systems, giving operators observability that job-name heuristics can’t.

Compression

Official session: Compression and Data Reduction 1 · Thu, Nov 20 · 1:30–3:00 PM CST

What to Support When You’re Compressing : Surveys nine supercomputing domains, distills 24 takeaways on requirements vs. current compressors, and flags concrete research gaps for error-bounded lossy compression.