New Architectures, New Opportunities

Exploring emerging AI accelerator architectures: AWS Trainium's distributed training capabilities and Cerebras WSE's wafer-scale computing approach for large language models and HPC workloads.

The landscape of AI accelerators is rapidly evolving beyond traditional GPU architectures. While NVIDIA GPUs have dominated ML training and inference, emerging architectures like AWS Trainium and Cerebras Wafer-Scale Engine (WSE) challenge fundamental assumptions about how we design hardware for large-scale machine learning. This post surveys the academic research characterizing these novel architectures, focusing on archival, top-conference papers that provide rigorous performance analysis and algorithmic insights.

AWS Trainium: Distributed Training at Scale

Distributed Training of Large Language Models on AWS Trainium (SoCC 2024) Introduces the Neuron Distributed Training Library that enables distributed LLM training across Trainium instances. The paper describes the software stack architecture and reports performance comparisons between Trainium (trn1) and NVIDIA A100 for large language model training workloads. Provides baseline performance characteristics demonstrating competitive throughput with cost advantages for specific LLM training scenarios.

Cerebras Wafer-Scale Engine

The Cerebras Wafer-Scale Engine implements an entire ML accelerator on a single silicon wafer, with WSE-2 containing 850,000 cores interconnected by a 2D mesh network. This architecture differs fundamentally from traditional multi-chip clusters.

LLM Inference at Wafer Scale

WaferLLM: Large Language Model Inference at Wafer Scale (OSDI 2025) First comprehensive system for wafer-scale LLM inference on Cerebras WSE-2. Develops device models capturing WSE-2’s characteristics (850,000 cores, 48 KB SRAM per core, 2D mesh topology). Introduces MeshGEMM and MeshGEMV operators designed for wafer-scale matrix operations. Reports substantial speedups over A100 GPU clusters for inference workloads by exploiting uniform on-chip connectivity and fine-grained tensor parallelism across hundreds of thousands of cores.

Collective Communications

Near-Optimal Wafer-Scale Reduce (HPDC 2024) Systematic study of Reduce and AllReduce operations on Cerebras WSE. Develops performance models specific to WSE’s 2D mesh topology and proposes algorithms that outperform vendor-provided collectives, achieving near-optimal bandwidth utilization. Demonstrates that WSE’s uniform on-chip network enables reduction patterns fundamentally different from hierarchical inter-node topologies.

Scientific Computing Kernels

Wafer-Scale Fast Fourier Transforms (ICS 2023) Optimizes 1D, 2D, and 3D FFTs for Cerebras CS-2, achieving sub-millisecond execution for 512³ 3D FFTs. Analyzes the interplay between computation and communication on WSE’s 2D mesh. Demonstrates that carefully designed data layouts exploit the wafer-scale architecture’s bandwidth to achieve performance previously requiring supercomputer-scale resources.

Data Compression

CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2 (HPDC 2024) Maps error-bounded lossy compression algorithms to WSE, exploiting both data parallelism and pipeline parallelism to achieve high compression throughput. Demonstrates that WSE’s massive core count can accelerate traditionally sequential compression algorithms through novel parallelization strategies.