Personal watchlist of SC25 sessions and workshops, grouped by their official tracks with quick notes on why each paper matters.
Workshop
Official session: SC25 Workshops · Mon, Nov 17 · various times
Towards Efficient Load Balancing BFS on GPUs: One Code for AMD, Intel & Nvidia: Presents a SYCL-based multi-GPU BFS implementation with work-group localization, even workload distribution, and strided heuristics; a compelling example of portable performance across MI300X, Max 1550, and A100 nodes. [WACCPD · Mon, Nov 17 · 9:00 AM–5:30 PM CST · Room 266]
Roofline Analysis of Tightly-Coupled CPU-GPU Superchips: A Study on MI300A and GH200: Extends Roofline modeling to capture contention in concurrent CPU+GPU execution on MI300A and Grace Hopper, highlighting how different allocators and execution modes interact under unified memory and shared power budgets. [P3HPC · Mon, Nov 17 · 9:00 AM–5:30 PM CST · Room 231]
PEAK: Cost-Adaptive Profiling in a Heartbeat: Introduces a DBI-based profiler that caps instrumentation overhead either statically or through a heartbeat feedback loop, so we get predictable profiling costs even on long, dynamic HPC jobs. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]
Extending THAPI with CXI Hardware Counter Sampling for High-Resolution NIC Telemetry: Adds Cassini CXI counter sampling so Perfetto timelines can correlate HPC traces with congestion/retry telemetry, giving actionable network diagnostics at negligible overhead. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]
Scalable, High-Fidelity Monitoring of Application Communication Patterns in Vernier: Introduces histogram-based instrumentation in the Vernier system to capture annotated communication patterns at scale, yielding specific networking insights without overwhelming trace volume. [ProTools · Mon, Nov 17 · 2:00–5:30 PM CST · Room 241]
Auto-Tuning
Official session: Auto-Tuning, Compilation, and Code Generation · Tue, Nov 18 · 10:30 AM–12:00 PM CST
PerfDojo: Frames kernel optimization as an RL “game” over a human-readable IR so LLM+RL agents can auto-generate ML library kernels that stay portable across CPU, GPU, and accelerator targets.
Constraint-Driven Auto-Tuning of GEMM-like Operators for MT-3000: DynaChain splits MT-3000 kernels into compute+data sub-problems, prunes the search via dependency constraints, and uses ILP-backed tiling plus custom micro-kernels to match expert-tuned code.
Benchmarks
Official session: Performance: Benchmarks and Optimization · Tue, Nov 18 · 10:30 AM–12:00 PM CST
Zero-Value Code Specialization: ZeroSpec pairs a static control/data-flow analysis with runtime profiling to detect zero-propagation hot spots, then emits specialized fast paths that delivered up to 1.31× speedups on SPEC/NPB codes.
Analysis Tools
Official session: Performance: Analysis Tools · Tue, Nov 18 · 1:30–3:00 PM CST
C.A.T.S.: Introduces a whole-program tracing format that records control-flow stacks plus memory events, feeding interactive visualizations that already exposed layout fixes and 3× footprint cuts in the case studies.
RedSan: Uses binary instrumentation to pinpoint redundant memory instructions in fully optimized CUDA kernels, netting up to 6.27× speedups and 3× fewer memory ops.
TraceFlow: Redistributes trace events by interaction pattern so analysis runs nearly communication-free and 13.5× faster than current replay pipelines.
State of Practice
Official session: State of the Practice · Tue, Nov 18 · 1:30–3:00 PM CST
ChatHPC: Builds specialized Code Llama–based assistants for each layer of the HPC stack, using RAG + expert-supervised fine-tuning to hit ~90% higher trustworthiness than GPT-4o while running on just a pair of H100s.
Cloud Utilization
Official session: System Software and Cloud Computing: Resource Utilization · Tue, Nov 18 · 1:30–3:00 PM CST
HELM: Derives interpretable locality metrics straight from Unified Memory driver telemetry so we can pick migration/placement policies under oversubscription, yielding 3.5× better performance than the UM defaults.
Sparse Computation
Official session: Performance: Sparse Matrix and Tensor Computation · Tue, Nov 18 · 3:30–5:00 PM CST
FaSTCC: Provides a hashing-based sparse tensor contraction engine that models every loop permutation, then applies a 2D tiled contraction-index scheme with probabilistic tile selection to beat prior CPU implementations.
Bridging the Gap between Unstructured SpMM and Structured Sparse Tensor Cores: MP-SpMM reformulates arbitrary sparsity via maximum matching + padding so 2:4 SpTCs stay fed, complete with a custom storage format and GPU kernel.
Precision & Reals
Official session: Precision and Real Number Representations · Tue, Nov 18 · 3:30–5:00 PM CST
RAPTOR: Offers an LLVM-based numerical profiler that auto-rewrites kernels into alternate precisions (down to FP16/custom formats) and reports stability impacts, as demonstrated on Flash-X multiphysics apps.
Energy & Power
Official session: Energy, Power, and Sustainability · Wed, Nov 19 · 10:30 AM–12:00 PM CST
Characterizing Performance, Power, and Energy of AMD CDNA3 GPU Family: Presents a third-party MI300X/MI325X study that measures compute scaling, memory latency/bandwidth, and energy tradeoffs, showing when power capping the MI325X actually wins on efficiency.
ML Methods
Official session: Machine Learning: Methods · Wed, Nov 19 · 10:30 AM–12:00 PM CST
TurboFNO: Debuts a fully fused FFT-GEMM-iFFT kernel that bakes in truncation/zero-padding and shared-memory swizzling so FNO layers avoid extra launches and run up to 150% faster than PyTorch+cuBLAS/cuFFT.
Data & Storage
Official session: Data Analytics, Visualization & Storage · Wed, Nov 19 · 1:30–3:00 PM CST
STELLAR: Acts as an autonomous multi-agent LLM tuner that scrapes manuals, reasons over I/O traces, executes experiments, and converges to near-optimal PFS configs within five trials on unseen workloads.
gParaKV: Implements a GPU-parallel key-value design that accelerates LSM compaction and garbage collection via bitmap-based marking and merge-sort offload, beating RocksDB-family KV stores on write-heavy loads.
ML at Scale 1
Official session: Machine Learning: Training at Scale 1 · Wed, Nov 19 · 1:30–3:00 PM CST
HPC-R1: Delivers a full-stack characterization of training large reasoning models on Perlmutter, covering SFT, GRPO, generation, and distillation with 19 concrete observations that feed back into HPC-AI system design.
Resilience
Official session: Anomaly Detection, Failure Management, and Resilience 2 · Wed, Nov 19 · 3:30–5:00 PM CST
Story of Two GPUs: Analyzes 2.5 years of Delta 1 logs (11.7M GPU-hours) to show H100s hit 3.2× more memory errors yet stronger core resilience than A100s, and that operators still need ~5% node overprovisioning for failures.
ML at Scale 2
Official session: Machine Learning: Training at Scale 2 · Wed, Nov 19 · 3:30–5:00 PM CST
MLP-Offload: Introduces multi-level, multi-path offloading of optimizer states that exploits idle tiers and controls concurrency, cutting I/O stalls so 280B-parameter pre-training runs up to 2.5× faster than today’s runtimes.
GEMM Optimization
Official session: Algorithms: Matrix Multiplication and GEMM Optimization · Thu, Nov 20 · 10:30 AM–12:00 PM CST
KAMI: Extends communication-avoiding 1D/2D/3D GEMM theory to a single GPU, keeping tensor cores busy on batched, low-rank, and sparse multiplies by staging data through registers and shared memory.
HyTiS: Combines two-level tile scheduling, latency-aware partial waves, and layout-aware L2 modeling to mitigate wave quantization on big GPUs, hitting up to 2× cuBLAS speedups on H100/A100.
LiquidGEMM: Introduces LiquidQuant and an implicit warp pipeline so W4A8 GEMMs dequantize with two instructions per four values and fully overlap load/dequant/MMA, boosting LLM serving throughput up to 4.9×.
Scheduling & Tiling
Official session: Scheduling, Tiling, and Parallelism · Thu, Nov 20 · 10:30 AM–12:00 PM CST
SIREN: Deploys process-level data taps plus fuzzy hashing to reliably identify repeated or unknown software on shared systems, giving operators observability that job-name heuristics can’t.
Compression
Official session: Compression and Data Reduction 1 · Thu, Nov 20 · 1:30–3:00 PM CST
What to Support When You’re Compressing: Surveys nine supercomputing domains, distills 24 takeaways on requirements vs. current compressors, and flags concrete research gaps for error-bounded lossy compression.