| Inference Optimization

Algorithms and Systems for Efficient Inference in Generative AI

Tuesday, January 20, 2026 from 2:00pm to 6:00pm

Venue: Singapore EXPO, 1 Expo Dr, Singapore 486150

Room: Garnet 218

Tutorial Abstract

State-of-the-art Generative AI (GenAI) models now span diverse modalities including language, vision, and audio, and are being deployed across an increasingly wide range of applications. However, their growing size and complexity pose significant challenges for efficient inference, particularly in real-time or resource-constrained settings. This tutorial will introduce participants to a range of techniques that enable high-performance, scalable inference without compromising model accuracy. It will cover practical tools, open-source serving frameworks, and deployment considerations across different hardware platforms, and the underlying principles will be broadly applicable to emerging model architectures. These methods will be illustrated through concrete scenarios, such as low-latency serving of large language models (LLMs), on-device deployment of quantized models, and performance tuning for cost-sensitive applications.

The tutorial is organized into three sessions that collectively build a comprehensive understanding of efficient inference for GenAI models. Session 1 provides an overview of the diverse architectural classes of GenAI models, ranging from decoder-only LLMs to multimodal models. This is followed by an introduction to the multi-layered inference stack, spanning high-level ML frameworks such as PyTorch to low-level hardware instructions — an understanding of which is essential for systematically uncovering performance bottlenecks. This will be accompanied with a demonstration of profiling techniques to analyze these bottlenecks. Session 2 dives into algorithmic, modeling, and systems-level optimizations, covering techniques like efficient attention variants, quantization, speculative decoding, and kernel fusion — while examining how each of these techniques alleviates the bottlenecks discussed in Session 1. Session 3 concludes the tutorial by focusing on practical tools and frameworks that implement many of the previously discussed optimization techniques with accompanying demos. It will highlight vLLM for high-throughput LLM serving, and TensorRT for hardware-optimized compilation across modalities. Together, these sessions will equip participants with the technical depth and tooling needed to deploy GenAI models efficiently in real-world systems.

This tutorial is designed for researchers and practitioners in AI who are interested in understanding and improving the efficiency of GenAI models. It will be particularly valuable for those with expertise in a specific layer of the inference stack, such as modeling, quantization, or infrastructure, and are seeking to develop a holistic perspective connecting algorithmic techniques with system-level and hardware-aware optimizations. Attendees are expected to have basic familiarity with deep learning frameworks (e.g., PyTorch). Prior experience with LLM architectures and deployment is beneficial, but not required.

Speakers' Bio

Rajarshi Saha is an Applied Scientist at AWS AI Research and Education, where he focuses on developing algorithms for resource-efficient training and inference of large foundation models. His research broadly explores theoretical approaches to challenges in this space, with an emphasis on investigating optimality and developing theoretically-backed practical algorithms. Before joining AWS, he earned a PhD in Electrical Engineering from Stanford University. Prior to that, he completed his Bachelors’ and Masters’ from Indian Institute of Technology (IIT) Kharagpur, where he received the Best Undergraduate Thesis award, along with the Prime Minister of India Gold Medal as the class valedictorian.

Aninda Manocha is an Applied Scientist at AWS AI Research and Education, where she works on memory management for LLM inference and optimized Neuron Kernel Interface (NKI) kernel generation design. Previously, she worked as a memory subsystem architect at Rivos, a RISC-V chip startup. She received her PhD in Computer Science at Princeton University with a focus on Computer Architecture, where her dissertation work optimized the mapping of irregular applications with sparse, memory-bound characteristics onto modern hardware.

Youngsuk Park is a Senior Applied Scientist & Manager at AWS Annapurna Labs, leading a core algorithm team advancing scalable and efficient LLM training and inference methods. He manages a high-caliber research group, pioneering innovations in quantization, structured sparsity, and hardware-aware modelling and algorithms optimized for AWS Trainium. His algorithmic work spans the full model lifecycle—from efficient large-scale training recipes to low-latency inference deployment—powering foundation models across Amazon Bedrock, AGI, and partners like Anthropic. He has co-authored 30+ papers at ICLR, ICML, AISTATS, and KDD on LLM training, inference, optimization, time-series and reinforcement learning. He regularly organizes tutorials and workshops at top AI conferences, sharing practical insights on deploying foundation models efficiently on AI accelerators.

Contributors