Disaggregated Inference #
In the previous section, we explored chunked prefill as a practical solution to the scheduling tension between prefill and decode phases. By breaking long prefill operations into smaller chunks, chunked prefill enables interleaving with decode steps, improving system stability and fairness. However, this approach still operates within a fundamental constraint: both prefill and decode must share the same GPU resources, creating an inherent tradeoff between prefill efficiency and decode responsiveness.
Disaggregated Inference [1] represents a paradigm shift that eliminates this constraint by physically separating prefill and decode execution onto different devices or specialized hardware tiers. This architectural separation allows each phase to be optimized independently, unlocking new opportunities for performance and efficiency that are impossible in unified systems.
In this section, we examine why disaggregation is necessary, how it works, and the benefits and challenges it introduces.
The Fundamental Tension: Why Separation Matters #
As we’ve seen throughout this section, LLM inference consists of two distinct phases with fundamentally different characteristics:
Prefill Phase #
- Compute-bound: Requires large matrix multiplications over the entire input sequence
- Memory-intensive: Generates large intermediate activations and attention matrices
- Batch-friendly: Benefits from processing multiple long sequences together
- Throughput-optimized: Best performance comes from saturating the GPU with large, uninterrupted workloads
Decode Phase #
- Memory-bandwidth-bound: Small per-token computation, limited by KV cache access
- Latency-sensitive: Users expect low inter-token latency for interactive applications
- Dynamic batching: Requires fine-grained scheduling to handle variable-length sequences
- Responsiveness-optimized: Best performance comes from frequent, small scheduling opportunities
When both phases run on the same GPU, they compete for resources in ways that create unavoidable tradeoffs:
Any improvement to decode responsiveness (e.g., smaller chunk sizes in chunked prefill)
comes at the expense of prefill efficiency, and vice versa.
This tension becomes more severe as:
- Context lengths grow: Longer prompts make prefill more expensive and disruptive
- Concurrency increases: More concurrent requests amplify scheduling conflicts
- Workload diversity increases: Mixing short and long prompts creates unpredictable interference
Disaggregated Inference: The Key Idea #
Disaggregated inference addresses this tension by recognizing that prefill and decode have different resource requirements and should be optimized separately:
Execute prefill and decode on separate devices or hardware tiers, allowing each
to be optimized for its specific workload characteristics.
This separation enables:
- Independent optimization: Prefill devices can prioritize compute throughput; decode devices can prioritize memory bandwidth and low latency
- Specialized hardware: Different device types (e.g., high-compute GPUs for prefill, memory-optimized GPUs for decode) can be selected for each phase
- Better resource utilization: Each device type operates at peak efficiency for its intended workload
- Improved isolation: Long prefill operations no longer block decode requests
How Disaggregated Inference Works #
Disaggregated serving separates prefill and decode on different devices (credits: NVIDIA Dynamo blog)
Separate Prefill and Decode Clusters #
The system is divided into two logical clusters:
- Prefill cluster: Dedicated devices optimized for compute-intensive prefill operations
- Decode cluster: Dedicated devices optimized for memory-bandwidth-bound decode operations
Each cluster can have different hardware configurations, scheduling policies, and optimization strategies.
Request Lifecycle #
When a new request arrives:
- Prefill execution: The request is routed to a prefill device, where the prompt is processed through the model to generate the initial KV cache
- State transfer: After prefill completes, the KV cache (and any necessary model state) is transferred to a decode device
- Decode execution: The request joins the decode cluster’s continuous batching system, generating tokens one at a time
- Completion: When generation finishes, the decode device releases resources
Independent Scheduling #
Each cluster operates its own scheduler optimized for its workload:
- Prefill scheduler: Can batch multiple long prompts together, maximizing GPU utilization without worrying about decode latency
- Decode scheduler: Can prioritize low-latency token generation, using continuous batching without interference from prefill operations
Dynamic Load Balancing #
The system must balance load across both clusters:
- Prefill load balancing: Distribute incoming requests across prefill devices based on capacity and current load
- Decode load balancing: Distribute completed prefill requests to decode devices, ensuring decode capacity matches prefill throughput
- Adaptive scaling: Scale each cluster independently based on workload characteristics
Performance Advantages #
Disaggregated inference provides several key benefits over unified systems:
Eliminates Scheduling Tension #
By physically separating prefill and decode, the fundamental tradeoff between prefill efficiency and decode responsiveness is eliminated. Each phase can be optimized independently:
- Prefill devices can run large, uninterrupted batches without impacting decode latency
- Decode devices can prioritize low-latency token generation without waiting for prefill operations
Enables Specialized Hardware #
Different phases can leverage hardware optimized for their specific needs:
- Prefill devices: High-compute GPUs (e.g., H100, A100) with large memory for processing long sequences
- Decode devices: Memory-bandwidth-optimized GPUs or specialized inference accelerators designed for low-latency, high-throughput token generation
This specialization can significantly improve cost efficiency compared to using general-purpose hardware for both phases.
Improves Resource Utilization #
Each device type operates at peak efficiency for its intended workload:
- Prefill devices maintain high compute utilization by batching long prompts
- Decode devices maintain high memory bandwidth utilization through continuous batching of many concurrent requests
This eliminates the efficiency losses that occur when trying to optimize for both workloads simultaneously.
Better Performance Isolation #
Long prefill operations no longer block decode requests:
- Users generating long outputs don’t impact interactive users expecting low latency
- System performance becomes more predictable and stable under diverse workload patterns
Independent Scaling #
Each cluster can be scaled independently based on workload characteristics:
- High prefill load (e.g., document processing) can scale prefill cluster
- High decode load (e.g., chat applications) can scale decode cluster
- This enables more cost-effective resource allocation
Challenges and Tradeoffs #
While disaggregated inference offers significant benefits, it also introduces new challenges:
State Transfer Overhead #
After prefill completes, the KV cache must be transferred from prefill devices to decode devices. This transfer:
- Adds latency: Network transfer time increases time-to-first-token (TTFT)
- Consumes bandwidth: Large KV caches (especially for long contexts) require significant network capacity
- Requires coordination: The system must manage state transfer without blocking either cluster
Optimizations such as compression, pipelining, and efficient serialization can mitigate but not eliminate this overhead.
System Complexity #
Disaggregated systems are more complex to design and operate:
- Two-tier scheduling: Must coordinate scheduling across both clusters
- Load balancing: Must balance load across prefill and decode clusters
- Failure handling: Must handle failures in either cluster gracefully
- State management: Must track request state across cluster boundaries
This complexity increases development and operational costs compared to unified systems.
Resource Fragmentation #
Separating resources into two clusters can lead to fragmentation:
- Imbalanced load: If prefill and decode workloads are imbalanced, one cluster may be underutilized while the other is overloaded
- Fixed allocation: Resources allocated to one cluster cannot be easily repurposed for the other
Dynamic resource allocation and autoscaling can help, but perfect balance is difficult to achieve.
Cost Considerations #
While disaggregation can improve efficiency, it may also increase costs:
- Infrastructure overhead: Managing two clusters requires additional orchestration infrastructure
- Network costs: State transfer consumes network bandwidth, which may be expensive in cloud environments
- Operational complexity: More complex systems require more sophisticated monitoring and management
The cost-benefit tradeoff depends on workload characteristics and scale.
When Disaggregation Makes Sense #
Disaggregated inference is most beneficial when:
- High workload diversity: Systems serving both long-document processing and interactive chat benefit from separation
- Scale: Large-scale deployments can amortize the complexity and overhead across many requests
- Specialized hardware available: Access to different device types optimized for prefill vs. decode
- Strict latency requirements: Applications requiring very low decode latency benefit from eliminating prefill interference
For smaller deployments or homogeneous workloads, the added complexity may not justify the benefits, and unified systems with chunked prefill may be more practical.
Summary #
Disaggregated inference represents a fundamental architectural shift that addresses the inherent tension between prefill and decode phases by physically separating their execution. By allowing each phase to run on specialized hardware optimized for its specific workload characteristics, disaggregated systems can achieve better performance, efficiency, and isolation than unified systems.
However, this separation comes with costs: state transfer overhead, increased system complexity, and potential resource fragmentation. The decision to adopt disaggregated inference depends on workload characteristics, scale, and available infrastructure.
As context lengths continue to grow and workloads become more diverse, disaggregated inference offers a path toward systems that can simultaneously achieve high prefill throughput and low decode latency—goals that are fundamentally at odds in unified architectures.
References #
-
Zhong et al. “DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving.” 18th USENIX Symposium on Operating Systems Design and Implementation. 2024.
-
NVIDIA Dynamo blog: NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models, Mar 2025.