Foundations of Generative Inference #

Overview #

As machine learning models and computing systems grow larger and more complex, system performance increasingly depends on how well algorithms and hardware are designed to work together. This principle, known as software–hardware co-design, involves shaping software to exploit the strengths of the hardware (such as parallel processors, memory layout, and precision formats), while simultaneously developing hardware that anticipates the evolving needs of future algorithms.

Hardware vendors commit to a design today while algorithms continue to evolve. If hardware is misaligned with those future workloads, even sophisticated algorithms may run inefficiently, and powerful chips can remain underutilized. Conversely, a mathematically elegant algorithm that fails to map efficiently onto hardware (e.g. due to communication overheads, memory bottlenecks, or interconnect latency) will rarely achieve its theoretical potential.

Co-design bridges this gap by aligning the algorithm design with the physical realities of computation. It encourages optimization of data movement, matching of precision formats to arithmetic units, and rethinking of architectures to minimize communication costs. Modern systems achieve the greatest efficiency when models, compilers, and hardware are tuned together, allowing computation to flow smoothly through every layer. This shift from isolated component optimization to holistic system-level thinking has fueled breakthroughs such as specialized AI accelerators and optimized training and inference pipelines, delivering substantial performance gains in speed, energy efficiency, and scalability.

This section establishes the foundational knowledge required to understand and optimize generative inference systems. We begin by examining the core transformer architecture that underpins modern large language models, then explore how these models are implemented across different architectural families. We then shift our focus to the hardware infrastructure that executes these models, examining specialized AI accelerators and their memory hierarchies. To reason about performance systematically, we introduce roofline analysis as a framework for understanding compute and memory bottlenecks. We also establish the key metrics used to measure inference efficiency in production systems. Finally, we explore parallelism strategies and communication primitives that enable efficient scaling across multiple devices.

Together, these topics provide the essential background for understanding how software algorithms and hardware systems interact to deliver high-performance generative inference.

Contents of this section #

1.1 Overview of Transformer Architecture #

Introduces the fundamental components of the transformer architecture, including self-attention mechanisms, positional encodings, feed-forward networks, and layer normalization. Explains the distinction between prefill and decoding phases in autoregressive inference and how each stage presents different computational challenges.

1.2 Representative LLM Architectures #

Surveys prominent large language model families including Llama, DeepSeek, OLMo, Gemma, and Qwen, highlighting key architectural innovations such as Grouped Query Attention (GQA), Mixture-of-Experts (MoE), and alternative attention mechanisms. Demonstrates how architectural choices directly impact inference performance and memory requirements.

1.3 Hardware Accelerators for Generative AI #

Examines the architecture of specialized AI accelerators, using AWS Trainium as a case study to illustrate compute engines (tensor, vector, scalar, GPSIMD), memory hierarchies (HBM, SBUF, PSUM), and scale-up/scale-out networking. Explains how understanding hardware constraints informs effective kernel design and optimization strategies.

1.4 Roofline Analysis #

Presents the roofline model as a framework for analyzing performance bottlenecks, distinguishing between compute-bound and memory-bound regimes. Provides a systematic approach to identifying whether an operation is limited by arithmetic throughput or memory bandwidth, guiding optimization efforts.

1.5 Measuring Inference Efficiency #

Defines key performance metrics for LLM inference: Time-to-First-Token (TTFT), Output Tokens Per Second (OTPS), and throughput. Explains how these metrics capture different aspects of inference behavior and how they relate to user experience and system efficiency.

1.6 Parallelism Strategies and Communication Collectives #

Explores how matrix operations are partitioned across multiple devices and the communication primitives required to combine partial results. Covers sharded matrix multiplication patterns, AllGather, AllReduce, ReduceScatter, and All-to-All collectives, which form the foundation for tensor parallelism, pipeline parallelism, and expert parallelism.