4.1 vLLM

vLLM: High-Throughput, Memory-Efficient LLM Serving #

vLLM is an open-source, high-performance inference engine designed to optimize Large Language Model (LLM) deployment. Originally developed by UC Berkeley’s Sky Computing Lab, vLLM addresses critical bottlenecks in traditional LLM serving, i.e. memory fragmentation, low memory throughput, and inefficient GPU utilization, through innovative algorithms and system design.

Core Features & Innovations #

Continuous Batching #

Unlike static batching, which waits for full batches of requests, vLLM integrates new requests dynamically during decoding. New requests are inserted into ongoing decoding streams without stalling the device on which inference is running. This maximizes device utilization (by overlapping computation and data transfer) and reduces latency, especially under variable request traffic. Consequently, vLLM can achieve response times shorter than one second with over 1000 concurrent users. See Continuous Batching for more details.

PagedAttention #

Traditional LLM serving suffers from KV cache memory fragmentation due to variable-length sequences, wasting up to 80% of allocated memory. Inspired by virtual memory in operating systems, PagedAttention divides KV caches into reusable, fixed-size “pages” (e.g. 128 tokens/page), enabling dynamic memory allocation, reuse, and sharing across requests with identical prefixes. This reduces memory waste and supports parallel sampling, beam search, and shared prefixes. As a result, PageAttention reduces KV cache memory usage by 60-80% and achieves up to 24$\times$ higher throughput compared to Hugging Face Transformers in high-concurrency workloads. This allows larger models on fewer devices. See PagedAttention for more details.

Tensor Parallelism #

vLLM supports multi-GPU distributed inference for large models, e.g., Llama-70B. By splitting the model across GPUs and minimizing cross-GPU data transfers during decoding, this technique enables near-linear scaling across devices and supports models that exceed single-GPU memory. See Parallelism Strategies and Communication Collectives for more details on parallelism strategies.

OpenAI-Compatible API #

vLLM provides a REST API compatible with OpenAI’s ChatCompletions and Completions endpoints, simplifying integration into existing applications. This improves ease of adoption as developers can switch between cloud and on-premise LLMs without code changes. Developers who have already built apps using OpenAI’s API can switch to vLLM with very few code changes.

Quantization #

vLLM supports FP8, INT4 (GPTQ), AWQ, and other low-bit weight quantization techniques to reduce memory footprint and accelerate inference. It also leverages CUDA/FlashAttention and custom kernels for faster LLM operations. This enables 2-3$\times$ speedups with quantized weights. See Model Quantization for more details.

Heterogeneous Batching #

By processing requests with different lengths, formats, and tasks (e.g. chat, code generation) in a single batch, vLLM can handle mixed workloads. It dynamically groups requests based on shared prefixes or operations and enables higher throughput by avoiding the overhead of isolating homogenous batches.

Use Cases #

vLLM excels in:

  • Chatbots & AI Assistants: Handling thousands of concurrent users with low latency.
  • Content Generation: High-throughput tasks, such as code generation or document summarization.
  • Multi-Modal Models: Supports vision-language models (e.g. LLaVA) via extensions.

References #

  1. AIMultiple, LLM Inference Engines: vLLM vs LMDeploy vs SGLang, 2025 https://research.aimultiple.com/inference-engines/
  2. F22 Labs, What is vLLM? Everything You Should Know, 2025 [https://www.f22labs.com/blogs/what-is-vllm-everything-you-should-know/]
  3. Gianni Crivello, Deploying vLLM on Google Cloud: A Guide to Scalable Open LLM Inference, 2024 https://eigenvalue.medium.com/deploying-vllm-on-google-cloud-a-guide-to-scalable-open-llm-inference-1dde477abc0d
  4. GitHub, vLLM https://github.com/vllm-project/vllm
  5. Kolluru et al., Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI https://arxiv.org/html/2511.17593v1
  6. Kwon et al., 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention https://arxiv.org/abs/2309.06180 NVIDIA, vLLM Overview https://docs.nvidia.com/deeplearning/frameworks/vllm-release-notes/overview.html
  7. Sandgarden, vLLM: The Fast Lane for Scalable, GPU-Efficient LLM Inference https://www.sandgarden.com/learn/vllm
  8. Vikram Singh, The Rise of Multimodal LLMs and Efficient Serving with vLLM, 2025 https://pyimagesearch.com/2025/09/15/the-rise-of-multimodal-llms-and-efficient-serving-with-vllm/
  9. Christian Taillon, vLLM Production Guide 2025 https://christiant.io/vLLM
  10. Simran Verma, Boost LLM Throughput: vLLM vs. Sglang and Other Serving Frameworks, 2025 https://tensorfuse.io/blog/llm-throughput-vllm-vs-sglang
  11. vLLM, vLLM: Easy, fast, and cheap LLM serving for everyone https://docs.vllm.ai/en/latest/