1.5 Measuring Inference Efficiency

Measuring Inference Efficiency #

Evaluating the performance of an LLM during inference requires a clear understanding of both speed and cost efficiency. While model quality and reasoning capability are crucial (and invariably non-negotiable), the practical utility of an LLM, especially in production systems, often hinges on how quickly and economically it can generate responses. To quantify this, several key metrics are used to capture different aspects of inference behavior. These are visualized and described in details below.

Sliding Window Attention Mechanism

Visualization of performance metrics TTFT and ITL

Latency (Time-to-First-Token) #

Time-to-first-token (TTFT) measures the elapsed time between submitting a prompt to an LLM and receiving the first generated token in response. It captures the model’s initial latency, encompassing tokenization, prompt encoding, etc., and the first forward pass through the network. TTFT is especially important for interactive or streaming applications such as chatbots or autocomplete systems, where perceived responsiveness directly affects user experience.

For example, consider the prompt: “The quick brown fox jumps over the”. In a next-word-prediction task, the model might output the next token “lazy”. If it takes 480 ms from sending the prompt to receiving “lazy”, then the TTFT is 480 ms, regardless of how quickly subsequent tokens (e.g., “dog”, “.”) are generated afterward. TTFT is a crucial indicator of responsiveness, especially for interactive or low-latency applications, where users perceive delay before the model begins to speak.

Because the prefill stage processes the entire input sequence in a single pass before any decoding begins, TTFT is generally compute-bound, i.e., its latency is dominated by dense matrix multiplications and attention operations over long input sequences.

Output Speed (Tokens per Second) #

Output tokens per second (OTPS) quantifies how quickly an LLM generates tokens after the first token has appeared. It is typically measured as the average number of tokens produced per second (tokens/sec) during the streaming phase of inference. High output speed is crucial for long-form generation and large-scale deployments, where total inference time and serving cost scale with the number of generated tokens. OTPS can also be measured by its inverse, inter-token latency (ITL).

For example, continuing from the previous prompt, “The quick brown fox jumps over the”. If the model generates the next 6 tokens ("lazy dog and runs away.") in 0.3 seconds, the output speed is 20 tokens/sec. This indicates that, after the initial TTFT, the model sustains a generation rate of 20 tokens per second until completion.

Unlike prefill, decoding is incremental, i.e., each step depends on the previously generated token, and is therefore often bandwidth-bound rather than compute-bound. The need to repeatedly access and update large key–value (KV) caches for attention makes memory throughput and cache locality key determinants of OTPS. High OTPS is especially important for long-form generation and large-scale serving workloads, where sustained decoding performance dictates overall system efficiency.

Throughput #

While Time-to-First-Token (TTFT) and Output Tokens per Second (OTPS) describe the latency and generation speed of a single inference request, real-world LLM deployments rarely serve one user at a time. Production systems such as chat services, retrieval-augmented APIs, or model endpoints handle many concurrent prompts arriving continuously, each with varying sequence lengths and response demands. In such scenarios, performance depends not only on how fast one request completes, but on how efficiently the entire system processes multiple requests simultaneously while utilizing available compute resources.

Throughput captures this aggregate behavior by measuring the total number of tokens generated per second across all concurrent inferences. It reflects how effectively the model server converts hardware capacity into useful output, and is often reported as tokens per second per device or tokens per second per deployment. High throughput indicates strong hardware utilization and scheduling efficiency, typically achieved through techniques such as dynamic batching, asynchronous queuing, and pipeline or tensor parallelism.

In deployment benchmarks (e.g., vLLM, TensorRT-LLM), throughput is often measured over the total wall-clock time, which does include TTFT periods across concurrent requests. It is given by

$$\text{Throughput}_{\text{system}} = \frac{\text{Total tokens generated across all requests}}{\text{Total wall-clock duration}}$$

Throughput differs from TTFT and OTPS in scope and interpretation. TTFT measures responsiveness—how quickly the first token is produced—while OTPS measures streaming speed once generation begins. Throughput, in contrast, represents system-level efficiency under concurrency. Although TTFT is not directly included in throughput calculations, long TTFTs can still reduce throughput by delaying when generation can begin. Systems that overlap the prefill and decoding phases, such as those employing continuous batching (e.g., vLLM) maintain high throughput even when TTFT varies across requests.

For example, suppose a model server processes 16 simultaneous prompts, each generating 100 tokens over 4 seconds. The system outputs a total of 1,600 tokens, giving a throughput of 400 tokens/sec. Even if each request has a TTFT of 600 ms, the aggregate throughput remains high because token generation for later requests overlaps with the TTFT of earlier ones. This illustrates how TTFT governs perceived user latency, while throughput governs aggregate system efficiency and cost in large-scale LLM serving.

References #

  1. Artificial Analysis: Understand the AI landscape to choose the best model and provider for your use case https://artificialanalysis.ai/