The tutorial is organized into three sessions that collectively build a comprehensive understanding of efficient inference for GenAI models. Session 1 provides an overview of the diverse architectural classes of GenAI models, ranging from decoder-only LLMs to multimodal models. This is followed by an introduction to the multi-layered inference stack, spanning high-level ML frameworks such as PyTorch to low-level hardware instructions — an understanding of which is essential for systematically uncovering performance bottlenecks. This will be accompanied with a demonstration of profiling techniques to analyze these bottlenecks. Session 2 dives into algorithmic, modeling, and systems-level optimizations, covering techniques like efficient attention variants, quantization, speculative decoding, and kernel fusion — while examining how each of these techniques alleviates the bottlenecks discussed in Session 1. Session 3 concludes the tutorial by focusing on practical tools and frameworks that implement many of the previously discussed optimization techniques with accompanying demos. It will highlight vLLM for high-throughput LLM serving, and TensorRT for hardware-optimized compilation across modalities. Together, these sessions will equip participants with the technical depth and tooling needed to deploy GenAI models efficiently in real-world systems.
This tutorial is designed for researchers and practitioners in AI who are interested in understanding and improving the efficiency of GenAI models. It will be particularly valuable for those with expertise in a specific layer of the inference stack, such as modeling, quantization, or infrastructure, and are seeking to develop a holistic perspective connecting algorithmic techniques with system-level and hardware-aware optimizations. Attendees are expected to have basic familiarity with deep learning frameworks (e.g., PyTorch). Prior experience with LLM architectures and deployment is beneficial, but not required.
Rajarshi Saha is an Applied Scientist at AWS AI Research and Education, where he focuses on developing algorithms for resource-efficient training and inference of large foundation models. His research broadly explores theoretical approaches to challenges in this space, with an emphasis on investigating optimality and developing theoretically-backed practical algorithms. Before joining AWS, he earned a PhD in Electrical Engineering from Stanford University. Prior to that, he completed his Bachelors’ and Masters’ from Indian Institute of Technology (IIT) Kharagpur, where he received the Best Undergraduate Thesis award, along with the Prime Minister of India Gold Medal as the class valedictorian.
Aninda Manocha is an Applied Scientist at AWS AI Research and Education, where she works on memory management for LLM inference and optimized Neuron Kernel Interface (NKI) kernel generation design. Previously, she worked as a memory subsystem architect at Rivos, a RISC-V chip startup. She received her PhD in Computer Science at Princeton University with a focus on Computer Architecture, where her dissertation work optimized the mapping of irregular applications with sparse, memory-bound characteristics onto modern hardware.
Youngsuk Park is a Senior Applied Scientist & Manager at AWS Annapurna Labs, leading a core algorithm team advancing scalable and efficient LLM training and inference methods. He manages a high-caliber research group, pioneering innovations in quantization, structured sparsity, and hardware-aware modelling and algorithms optimized for AWS Trainium. His algorithmic work spans the full model lifecycle—from efficient large-scale training recipes to low-latency inference deployment—powering foundation models across Amazon Bedrock, AGI, and partners like Anthropic. He has co-authored 30+ papers at ICLR, ICML, AISTATS, and KDD on LLM training, inference, optimization, time-series and reinforcement learning. He regularly organizes tutorials and workshops at top AI conferences, sharing practical insights on deploying foundation models efficiently on AI accelerators.
Lingfan Yu
Applied Scientist
AWS AI
Kaan Ozkara
Applied Scientist
AWS AI
Wei Tang
Applied Scientist
AWS AI
Tao Yu
Applied Scientist
AWS AI
Jiaji Huang
Senior Applied Scientist
AWS AI
Liangfu Chen
Senior Software Engineer
AWS AI
Jonas Kübler
Senior Applied Scientist
AWS AI
Yida Wang
Principal Scientist
AWS AI
George Karypis
Senior Principal Scientist
AWS AI